rpe-analysis-services: Known Failure Modes (classifier/sentiment/intent/dynamics/extractor)
The 5 RPE analysis services (rpe-classifier, rpe-sentiment, rpe-intent, rpe-dynamics, rpe-extractor) are HTTP servers called in parallel by rpe-context-assembler. Each calls llm-gateway to run LLM inference.
SHARED FAILURE MODES (apply to all 5):
1. llm-gateway unreachable — LLM call fails. Service returns an error response to rpe-context-assembler. If all 5 are down simultaneously, likely llm-gateway is the root cause.
2. LLM timeout — inference takes longer than the callLLMFromTemplate timeout. Service appears up but returns 504/timeout. Check llm_queue for stuck jobs.
3. OOM kill — LLM inference is memory-intensive. Container restarted with exit code 137. Check docker container status and inspect for oom_kill_disable setting.
4. Prompt template missing — callLLMFromTemplate cannot find the template in MariaDB. Service returns 500 with template-not-found error. Check prompt_templates table for the specific template name.
5. All 5 failing simultaneously — almost certainly a shared dependency (llm-gateway, rabbitmq, or config-service) rather than individual service failures.
FIRST CHECKS:
- Hit each health endpoint: http://rpe-{service}:{port}/health
- If all 5 are down: check llm-gateway first, then rabbitmq consumer count.
- If only 1 is down: check that specific container — likely OOM or startup crash.
- Check docker container status: running/restarting/exited + exit code.
- RabbitMQ: these are HTTP services, not queue consumers — no queue to check for them directly.
DEPLOY TYPE: image-copy. All 5 require rebuild + up -d for code changes.