Level 1 is deliberately plain: take the question, embed it, search the vector store, hand the top chunks to MedGemma, and answer from them with a citation. No agents, no routing. The reason to build it first is that it sets the bar — if the fancier levels don't beat this, they're just complexity.

Embeddings: one model, several sizes

Queries and chunks are embedded with EmbeddingGemma. It supports Matryoshka representation learning, so the same model gives you 128-, 256-, 512- or 768-dimension vectors from one pass — you can trade retrieval quality against speed and storage without swapping models. I ran at 768 for quality, with optional caching so repeated queries don't re-embed.

The retriever is where the work is

The default retriever does two unglamorous but important things. It expands medical terms — "CKD" also searches "chronic kidney disease," and so on — so a query phrased one way still finds text phrased the other. And it matches on word boundaries, so "AKI" doesn't spuriously match the middle of another word. There's a configurable similarity threshold so genuinely irrelevant chunks are dropped rather than padded in.

I also built more elaborate retrievers — a hybrid one using reciprocal rank fusion, a tree retriever that routes by section, and RAPTOR-style and contextual variants — and kept them behind a single factory so a level can pick one. But the lesson from the evaluation was sobering: clever retrieval helps less than getting the chunks right in the first place.

Generation, and saying "I don't know"

MedGemma answers only from the retrieved context, and cites it. The prompt's job is to keep it honest — answer from the chunks, and if the chunks don't cover the question, say so rather than improvising. For a clinical assistant that refusal behaviour is a feature, not a failure.

The baseline lesson

For a lot of direct factual questions, simple RAG is genuinely enough. Build it, measure it, and only add machinery where it falls short. Everything in the next two parts is an answer to "where does this baseline break?" — not decoration on top of a working system.

Next: wrapping this in a LangGraph workflow that adds the safety and routing a clinical assistant needs. (Part 4)