From 22 Clinical PDFs to a Vector Store (RAG Data Pipeline)

Everyone wants to talk about the model. But for a RAG system, the answer is only ever as good as what you retrieved, and what you retrieve is decided here — in the pipeline that turns source documents into vectors. For the CKD assistant (Part 1) that source is 22 clinical guideline PDFs from NICE, KDIGO, the UK Kidney Association and Kidney Care UK.

Pipeline: 22 clinical PDFs through Docling OCR, section splitting, cleaning, block-aware chunking and EmbeddingGemma into a ChromaDB vector store

OCR that respects tables

Clinical guidelines are full of tables — dosing thresholds, staging tables — and a naive text extract turns those into mush. I used IBM's Docling with table-structure recognition on, so a table comes out as a table, not a scrambled run of numbers. It exports both markdown (readable) and a structured JSON tree. Some pages are scanned, so OCR is on too.

Splitting out the parts that aren't clinical content

A guideline PDF is maybe 60% clinical content and 40% noise for my purposes — title pages, conflict-of-interest declarations, a long bibliography, abbreviations. If you embed all of it, your retriever happily returns a citation list when someone asks about potassium. So each document is split into main_text, references, and front/end matter, and only the clinical content goes forward.

I did the classification with a hybrid approach: MedGemma classifies each heading, with a regex heuristic as a fallback when the model isn't available. The fallback matters — it means the pipeline still runs without a GPU, just a bit more bluntly.

Cleaning and block-aware chunking

Then a manual cleaning pass for OCR artifacts (I tracked each document's review status — pass, needs-check, or delete; one KDIGO guideline was superseded by a newer edition and dropped). Finally, chunking. The rule I care about: don't slice through a coherent block. A chunk that ends mid-table or mid-recommendation retrieves badly, so the chunker is block-aware rather than a blind fixed-size window.

The clean chunks are embedded with EmbeddingGemma into 768-dimension vectors and stored in ChromaDB under one collection. That collection is the shared foundation every level in this series reads from.

The point

Most "the RAG isn't working" problems are pipeline problems — bad OCR, references polluting the index, chunks cut through the middle of a table. Spend your effort here before you reach for a bigger model. I found this out the hard way in the evaluation: nearly all the failures were retrieval, which traces straight back to this stage.

Next: the simplest system that uses this store — Level 1, plain RAG. (Part 3)

From 22 clinical PDFs to a vector store

OCR that respects tables

Splitting out the parts that aren't clinical content

Cleaning and block-aware chunking

The point