rpe-analysis-services: Known Failure Modes (classifier/sentiment/intent/dynamics/extractor)

Component: rpe-classifier Tags: rpe-classifier,rpe-sentiment,rpe-intent,rpe-dynamics,rpe-extractor,llm,analysis Author: phase-f-migration Updated: 4/26/2026, 3:47:42 PM

The 5 RPE analysis services (rpe-classifier, rpe-sentiment, rpe-intent, rpe-dynamics, rpe-extractor) are HTTP servers called in parallel by rpe-context-assembler. Each calls llm-gateway to run LLM inference.

SHARED FAILURE MODES (apply to all 5):

1. llm-gateway unreachable — LLM call fails. Service returns an error response to rpe-context-assembler. If all 5 are down simultaneously, likely llm-gateway is the root cause.

2. LLM timeout — inference takes longer than the callLLMFromTemplate timeout. Service appears up but returns 504/timeout. Check llm_queue for stuck jobs.

3. OOM kill — LLM inference is memory-intensive. Container restarted with exit code 137. Check docker container status and inspect for oom_kill_disable setting.

4. Prompt template missing — callLLMFromTemplate cannot find the template in MariaDB. Service returns 500 with template-not-found error. Check prompt_templates table for the specific template name.

5. All 5 failing simultaneously — almost certainly a shared dependency (llm-gateway, rabbitmq, or config-service) rather than individual service failures.

FIRST CHECKS:

Hit each health endpoint: http://rpe-{service}:{port}/health
If all 5 are down: check llm-gateway first, then rabbitmq consumer count.
If only 1 is down: check that specific container — likely OOM or startup crash.
Check docker container status: running/restarting/exited + exit code.
RabbitMQ: these are HTTP services, not queue consumers — no queue to check for them directly.

DEPLOY TYPE: image-copy. All 5 require rebuild + up -d for code changes.