agent-brain: Known Failure Modes

Component: agent-brain Tags: agent-brain,llm,llm_queue,llm-gateway,rabbitmq,response Author: phase-f-migration Updated: 4/26/2026, 3:47:42 PM

agent-brain consumes from agent.chat.request queue, calls the LLM via llm-gateway (through llm_queue), and publishes the response to rpe.telegram.response.

FAILURE MODES:

1. llm-gateway unreachable — LLM job submission fails. agent-brain cannot generate a response. Check llm_queue for stuck jobs.

2. llm_queue job stalled — job is in "queued" or "processing" state for > 5 minutes. llm-gateway may have crashed mid-job or is overloaded. Check the llm_queue table for the specific job and llm-gateway health.

3. llm_queue job failed — LLM returned an error (API error, token limit, content policy). Check llm_queue.error_message for the cause.

4. RabbitMQ disconnect — cannot consume from agent.chat.request. Messages accumulate in queue.

5. RabbitMQ publish failure — cannot publish response to rpe.telegram.response. LLM inference completed but response is lost.

6. agent.chat.request queue depth rising — agent-brain not consuming. Consumer count drops to 0.

7. Neo4j unavailable — UserProfile reads for personality adaptation fail gracefully. Response generated without graph context.

8. MongoDB write failure — assistant response not saved to agent_chatlog. Non-fatal for delivery but affects conversation continuity.

CRITICAL DISTINCTION:

Check: SELECT * FROM telegram_system.llm_queue WHERE created_at BETWEEN {request_time} - INTERVAL 60 SECOND AND {request_time} + INTERVAL 5 MINUTE ORDER BY created_at;

FIRST CHECKS:

DEPLOY TYPE: bind-mount. Changes require restart only.