The retrieval graph for long documents.
For teams shipping agents that need to ground in their own documents — contracts, protocols, specs, transcripts, research.
- Auto-merging graph with full provenance
- Typed edges, ranked over MCP
- No SDK glue, no reranker wiring
- Your documents never leave
98.7% Hit@10 on MultiHop-RAG against a 74.7% baseline — same encoder, same corpus, different result.
MCP-native · CLI enabled · Self-hosted or managed
What ckem is
Three pieces of infrastructure, not three features.
Fine-grained matching, not flat lookups.
Long, multi-topic documents stop collapsing into a single noisy point. ckem scores the passage that actually answers the query — not the average of everything else in the document.
Auto-merge with an audit trail.
The graph bounds itself as the corpus grows: near-duplicates merge on every write, originals soft-archive into derived_from edges. Provenance survives every merge — nothing is destroyed, just superseded.
Your documents never leave.
Local sentence-transformer embeddings by default — no third-party API in the default path. Self-host via Docker, run in your own AWS account with the included Terraform, or managed. Same code path.
Where ckem excels
The territory the rest of the industry skips.
End-to-end graph retrieval over long, multi-topic, cross-referencing documents — protocols, contracts, technical specs, transcripts. Built for the corpora where a single embedding per document averages everything that matters into noise.
Long, multi-topic documents
10k+ tokens with several distinct sections. Fine-grained ranking keeps the section that answers the query, where a single document-level embedding would average it into noise.
Cross-document reference resolution
When the answer lives across documents — clauses with their defined terms, amendments with what they supersede — ckem walks the typed graph and returns them together.
Version-aware retrieval
Supersession, amendments, deprecations modeled as first-class edges. Queries default to current; pin a date and get the corpus as it stood. Originals stay retrievable with provenance.
Agent-native access
Your agents call retrieval directly over MCP. Self-hosted via Docker, run in your own AWS account with the included Terraform, or managed by us. Same code path; you pick who operates it.
How it works
Three steps. No magic.
- 01Ingest
Any document, any size. Indexed at the passage level. Embeddings produced locally — your text doesn't leave the project.
- 02Graph
Typed edges — similarity, supersedes, references, derived_from. Near-duplicate passages auto-merge on write, with provenance preserved on every merge.
- 03Query
Fine-grained scoring across passages, optional graph-neighbor expansion. Returns ranked passages with their typed edges, directly to your agent over MCP.
# Agents call ckem over MCP
hits = await mcp.call("query", {
"project": "corpus-id",
"text": "What does section 4 say about termination?",
"include_neighbors": True,
})
# hits[0].text → matching passage
# hits[0].score → 0.88
# hits[0].neighbors → typed cross-referencesSee ckem on your corpus.
Bring a sample of your documents — we'll run ckem on them and walk through the graph and retrieval together over MCP.