Level 1 answers questions. It doesn't check what's in the question, doesn't notice when a question isn't really a retrieval question, and doesn't know if its own answer was any good. Level 2 fixes those three gaps by turning the flow into an explicit LangGraph state machine — each step a node, the state passed between them.
The graph
Every query runs through the same path: check for PII → classify intent → branch. A normal clinical question goes retrieve → generate → evaluate. But not everything should hit the retriever, so the router has four exits:
- Retrieval — a real clinical question; do the full RAG path.
- Direct — a simple conversational turn that needs no documents.
- Clarification — too vague to answer; ask for more.
- Out of scope — not a CKD question; decline politely.
Making that routing explicit (rather than hoping one big prompt handles every case) is the whole point of Level 2: you can see which path fired, log it, and test each branch on its own.
PII first, always
The first node runs Microsoft Presidio with custom recognizers for things like NHS numbers, so anything identifying is caught and redacted before the query is logged or sent anywhere. In a health context this can't be an afterthought bolted on later — it's the first thing that happens, by design.
The system scores itself
On the retrieval path, an evaluation node can run RAGAS on the response — faithfulness, context precision, recall — so quality is measured per answer, not assumed. I also added CKD-specific checks: did it cite a source, did it include the right disclaimer, was the advice appropriate to the stage. Those custom metrics encode "what a good answer looks like here," which generic metrics can't.
Why a graph, not just more prompt
The value isn't the LangGraph library — it's that the control flow becomes a thing you can inspect, test edge by edge, and extend. "PII is always checked first" and "out-of-scope questions are declined, not answered" stop being hopes and become structure. That's exactly the routing-control trade-off I wrote about in the ADK vs LangGraph comparison.
Next: when one assistant isn't focused enough — splitting it into specialists. (Part 5)