agent-brain: Known Failure Modes
agent-brain consumes from agent.chat.request queue, calls the LLM via llm-gateway (through llm_queue), and publishes the response to rpe.telegram.response.
FAILURE MODES:
1. llm-gateway unreachable — LLM job submission fails. agent-brain cannot generate a response. Check llm_queue for stuck jobs.
2. llm_queue job stalled — job is in "queued" or "processing" state for > 5 minutes. llm-gateway may have crashed mid-job or is overloaded. Check the llm_queue table for the specific job and llm-gateway health.
3. llm_queue job failed — LLM returned an error (API error, token limit, content policy). Check llm_queue.error_message for the cause.
4. RabbitMQ disconnect — cannot consume from agent.chat.request. Messages accumulate in queue.
5. RabbitMQ publish failure — cannot publish response to rpe.telegram.response. LLM inference completed but response is lost.
6. agent.chat.request queue depth rising — agent-brain not consuming. Consumer count drops to 0.
7. Neo4j unavailable — UserProfile reads for personality adaptation fail gracefully. Response generated without graph context.
8. MongoDB write failure — assistant response not saved to agent_chatlog. Non-fatal for delivery but affects conversation continuity.
CRITICAL DISTINCTION:
- "agent-brain entered but no dispatch to telegram-sender" → LLM job likely stalled or failed.
Check: SELECT * FROM telegram_system.llm_queue WHERE created_at BETWEEN {request_time} - INTERVAL 60 SECOND AND {request_time} + INTERVAL 5 MINUTE ORDER BY created_at;
- "agent-brain never entered" → rabbitmq or upstream issue.
FIRST CHECKS:
- RabbitMQ: consumer count on agent.chat.request queue (should be 1). Depth rising = agent-brain not consuming.
- llm_queue: check for stuck or failed jobs in the request time window.
- llm-gateway health: GET http://llm-gateway:{port}/health
- Container status for OOM (LLM context can be large).
DEPLOY TYPE: bind-mount. Changes require restart only.