Diagnose manifest builder failure in agent-brain

Component: agent-brain Category: ai-quality Version: 1 Author: claude Last used: 6/3/2026, 9:41:27 PM

Prerequisites

MongoDB access (see reference_mongodb_local_client.md). MariaDB access. Docker log access via tail-logs.js or ssh-exec.js.

Expected Outcome

No "Manifest builder failed" warnings in logs. zoey_detail_log steps.manifest_builder shows non-null output with question_type, context_early, context_reinforce, answer_guidance populated.

Steps

1. Confirm the signal: search Docker logs for "Manifest builder failed" or "JSON parse failed after repair attempt".

2. Check whether this is isolated to one user or all users:

SELECT telegram_user_id, COUNT(*) AS fails FROM zoey.zoey_detail_log

WHERE steps.manifest_builder.error IS NOT NULL

GROUP BY telegram_user_id

ORDER BY fails DESC LIMIT 10;

(Run via mongosh with the connection string in reference_mongodb_local_client.md.)

3. Check the manifest_builder step in the most recent failed detail log:

db.getCollection("zoey_detail_log").findOne(

{ "steps.manifest_builder.error": { $ne: null } },

{ "steps.manifest_builder": 1, telegram_user_id: 1, created_at: 1 }

);

This shows the error, skip_reason, latency_ms, and the raw LLM response.

4. If error = "empty response from manifest-builder":

- Check callLLM is returning a non-empty result.text. Look at the llm_call fields in the step for model, input_tokens, output_tokens.

- Verify the agent.manifest-builder template key exists in telegram_system.prompt_templates:

SELECT template_key, model, temperature FROM telegram_system.prompt_templates WHERE template_key = 'agent.manifest-builder';

- If missing template: the manifest builder falls back to callLLM direct. That is expected ??? the template key is advisory only.

5. If error = "JSON parse failed after repair attempt":

- Retrieve the raw_response from steps.manifest_builder.prompt_debug.raw_response in the detail log.

- Verify the LLM output is valid JSON. If it contains prose preamble or markdown fences, repairJsonSimple should handle it. A total non-JSON response (e.g. a refusal) is the likely cause.

- Check if a specific question_type is over-represented in failures (unusual prompts triggering safety filters).

6. For all failure modes: agent-brain logs WARN and continues without a manifest. The scratchpad and primary model still run. Users receive a response ??? just without curated P3.5/P12.9 context.

7. If failures are sustained (>5 in 30 min): check Anthropic API status. Haiku 4.5 outages or rate limits are the most likely root cause.

8. Restart agent-brain after any code fix (bind-mounted ??? restart sufficient):

node ssh-exec.js --compose section45 restart agent-brain