Benchmarks
Six public benchmarks. Real numbers.
We run ckem on the benchmarks the retrieval community already agrees on — multi-hop reasoning, multi-step composition, reasoning-intensive search, long-document QA, legal contract retrieval, and financial QA. Same corpus, same queries, same judgement files as everyone else.
Two of those — LegalBench-RAG and FinanceBench — live on their own pages because the corpora and the pitch are domain-specific.
General · multi-hop reasoning
MultiHop-RAG
609 news-corpus questions that each require composing evidence from multiple documents — inference, comparison, temporal, and null-answer queries. dataset.
- Hit@1098.7%vs.74.7%
- Hit@492.7%vs.66.2%
- MRR80.3vs.58.6
- MAP60.6vs.47.9
What this tells us: when the answer requires composing evidence across documents, the typed graph keeps the right passages within reach instead of burying them under near-duplicates.
Ongoing work: MultiHop-RAG's corpus is short, single-topic news articles — easier territory than the long, multi-topic contracts and filings the domain benchmarks below cover. Hit@1 at 67% leaves the most room: when two articles cover overlapping sub-claims, the second-best passage is often also correct, which the typed graph could resolve by following derived_from edges. A learned reranker over the top-30 candidates is the next lever; a news-domain LoRA on top of the Qwen3 encoder is queued behind it.
General · multi-step composition
MuSiQue-Ans (dev)
2–4 hop questions built so the bridging entity has to be recovered before the answer can be — designed to resist shortcut answering from a single passage. dataset.
- Hit@186.6%vs.48%
- Hit@597.9%vs.60%
- Retrieval F150.9vs.30
What this tells us: the gain over a strong single-vector encoder isn't in Hit@5 — both methods can find the right passage given five chances. It's in Hit@1: the right passage ranked first.
Ongoing work: Retrieval F1 at 50.9 is well above baseline but still leaves room. The remaining failures are chains where the bridging entity is lexically dissimilar from the question — “the composer of the score Spielberg used for his shark film.” Sub-question decomposition (split the hop, retrieve each half, intersect) and a small graph- walk over typed entity edges before the second-hop encode are the two interventions on the bench.
General · reasoning-intensive retrieval
BRIGHT (biology)
A deliberately hard retrieval benchmark — queries where lexical and semantic similarity are not enough; the relevant passage requires reasoning to identify. We run the biology subset. dataset.
- nDCG@1023.2vs.17.6
What this tells us: BRIGHT is the benchmark that punishes pure embedding-similarity approaches. Numbers across the field are low here — beating the strongest published baseline by 5+ points is meaningful.
Ongoing work: BRIGHT is the toughest of our public benchmarks — the queries are designed to defeat embedding similarity. The 23.2 nDCG@10 here is from a stronger encoder configuration than the default ckem stack; folding those gains into the main pipeline (so the rest of the table moves with it) is in flight. The other 11 BRIGHT sub-tasks (Earth Science, Economics, StackExchange, etc.) are queued; a reasoning- aware reranker — passing the candidate and the original query back through the encoder with chain-of-thought — is the next algorithmic step.
On the bench · next sprint
Upcoming benchmarks
Runs queued for the next benchmark sprint. Numbers post here when they land.
General · long-document QA
LongRAG
QA over long retrieval units — full Wikipedia documents instead of short passages — designed to test whether a retrieval system keeps the relevant document on top when units are large. dataset.
4–8K-token retrieval units are exactly the regime where averaging-into-noise hurts flat retrieval most. Ingest sprint is mid-flight; we expect this to move the most as the chunking strategy stabilises.
General · grounded visual QA
CiteVQA
Visual question answering with span-level citations into the source page. Every answer has to point at the exact figure, caption, or table cell that supports it — a retrieval problem dressed as a VQA problem.
On the bench: extending the typed graph across page-image regions so a figure caption, the figure itself, and the paragraph that references it stay linked through ingest. Numbers post when the page-region encoder is wired through.
Domain benchmarks
LegalBench-RAG and FinanceBench live on their own pages.
The corpora are domain-specific — long contracts in one case, full 10-K filings in the other — and the pitch is different enough that we keep the numbers next to the surrounding context.
776 query-answer pairs over PrivacyQA, CUAD, MAUD, and ContractNLI.
150 questions over 84 10-K filings (PatronusAI).
Upcoming domain benchmarks
Corpora and judgement files in evaluation. Pages and numbers post when the runs land.
Clinical-guideline retrieval over UpToDate-style references and discharge notes. Span-level evidence requirement.
Retrieval over trial protocols, FDA labels, and dossiers; inclusion / exclusion criteria with effective-date provenance.
QA over engineering specs, SOPs, and audit trails — long, multi-section technical documents with cross-references.
Statute and case-law retrieval with citation chains. Supersession and amendment modelled as first-class edges.
Methodology
- 1. Public benchmark, untouched. We use the upstream corpus and judgement files verbatim. No re-labelling, no question filtering, no held-out subsets we picked ourselves.
- 2. Same encoder where it matters. When we compare against a single-vector baseline, both sides use the same open-source embedding model — the variable is ckem's retrieval layer, not the encoder.
- 3. Held-out queries. No benchmark query is in the training or tuning loop.
- 4. Reproducible by us. Scripts and configs available for shared review during a pilot. Underlying corpora are shared subject to their original licensing.
- 5. Ongoing work disclosed. Each benchmark notes the open metrics and the specific interventions in flight — fine-tuned encoders, learned rerankers, query decomposition, graph-aware traversal. Where ckem ties a baseline or a metric sits below where we want it, we say so.
Bench against your own corpus.
Bring 50–200 documents and a labeled query set. We'll run ckem against your current retrieval and share the comparison.