Concept

Failure Modes and Recovery

What goes wrong with multi-agent systems and what to do about it.

Three failure modes account for almost all multi-agent incidents. Doom loops: an agent keeps re-asking the same question without making progress (fix: per-step timeout + max-iterations). Capability leakage: a role accidentally has a tool it shouldn't (fix: role-policy audits run nightly). Context poisoning: bad data from one agent's tool result derails downstream agents (fix: planner re-validates inputs at each hand-off).

Build for these explicitly. Defensive design at the role level beats reactive incident response every time.

Check your understanding
Q1. What's a "doom loop" and how do you prevent it?
· Score 100% on the quiz.