feat(compiler): opt-in top-K retrieval for concepts-plan (bound plan prompt as the KB grows) by cnndabbler · Pull Request #92 · VectifyAI/OpenKB

cnndabbler · 2026-06-09T19:20:16Z

Motivation

The concepts-plan step injects every existing concept/entity brief into the prompt (_read_concept_briefs / _read_entity_briefs read the whole concepts/ and entities/ dirs). So the plan prompt grows O(N) with the KB. On a 165-doc KB this was observed climbing from ~2k tokens early to ~15–18k tokens at a few hundred concepts — which hurts cost/latency on every subsequent doc and degrades the model's ability to reconcile a new doc against the right existing pages (it starts creating near-duplicate concepts).

What this does (opt-in, default off)

Adds openkb/retrieval.py and a small block in _compile_concepts that, only when enabled, keeps the top-K briefs most relevant to the current doc's summary instead of all of them. Two config keys (default off → behaviour byte-identical):

concepts_plan_retrieval: false     # set true to enable
concepts_plan_retrieval_k: 40

Default ranker is dependency-free TF-IDF cosine over the brief lines (no new deps, no extra API calls).
An optional embedding ranker (select_relevant_briefs_embed, provider injected — no SDK dependency in the module) is included for higher-drift corpora / future hybrid use.
No-ops when the brief set is already within budget, so small KBs are unaffected.

Benchmark (recall@K vs. ground-truth concept links)

On a real 335-concept / 489-entity KB, using each summary's [[concepts/X]] links as ground truth and the summary as the query (scripts/bench_retrieval.py):

K	TF-IDF	Embeddings (text-embedding-3-small)	prompt size
20	0.79	0.67	6% of full
40	0.90	0.79	12% of full

TF-IDF wins here (LLM-generated briefs share heavy lexical overlap with summaries) and is free per-doc, so it's the default. K=40 recovers ~90% of the relevant concepts at 12% of the full-inject prompt size.

Tradeoff

At K=40, ~10% of relevant existing concepts may fall outside the window for a given doc (small risk of a duplicate concept) in exchange for a bounded plan prompt as the KB scales. Off by default; tune ..._k higher to trade prompt size for recall.

Tests: tests/test_retrieval.py (ranker behaviour) + full suite green.

The concepts-plan step injects every existing concept/entity brief, so the prompt grows O(N) with the KB (~2k->18k tokens at a few hundred concepts), hurting speed/cost and the model's ability to reconcile against the right existing pages (-> near-duplicate concepts). Add an opt-in top-K relevance filter (openkb/retrieval.py, dependency-free TF-IDF cosine over brief lines, query = doc summary). Off by default (concepts_plan_retrieval / concepts_plan_retrieval_k in config), so behaviour is unchanged unless enabled. The select_relevant_briefs() interface is swappable for an embedding-based ranker later. Prototype for the O(N)->O(K) plan-context scaling discussion.

Benchmark on a real 335-concept KB (ground truth = summary concept links): TF-IDF recall@40=0.90 vs embeddings 0.79. Dependency-free TF-IDF wins on this LLM-generated corpus (high lexical overlap) and is free per-doc, so it stays the default; embedding ranker kept as an option for higher-drift corpora.

KylinMountain · 2026-06-12T10:53:57Z

Took a close look — the approach is sound and the key safety property checks out, so this is good to land with one cleanup.

Verified: filtering the briefs does not shrink the wikilink whitelist. The top-K filter only narrows the plan-step context (concept_briefs/entity_briefs → the plan prompt) at compiler.py#L1411-L1415. The set of valid [[wikilink]] targets is built independently from list_existing_wiki_targets(wiki_dir) at compiler.py#L1576-L1577, so the model can still link to every existing page and the ghost-link stripper wont drop anything. The only cost is reduced reconciliation context (a doc may not "see" a relevant existing concepts brief and create a near-duplicate) — exactly the tradeoff the PR documents. Putting the filter here is the right call.

One ask before merge — the embedding ranker is dead code on the prod path. select_relevant_briefs_embed (retrieval.py#L95) is never called by the compiler (only select_relevant_briefs is); its sole caller is the bench script. Id either wire it behind a config switch (e.g. concepts_plan_retrieval_ranker: tfidf|embed) so its actually reachable, or drop it from this PR and reintroduce it when the hybrid path lands. Prefer the merged surface to be exactly what is used.

Minor:

load_config is re-read inside _compile_concepts though the caller already loaded it — harmless, just a tiny extra read per doc.
docs_retrieval_findings.md + scripts/bench_retrieval.py probably should not ship in the repo (the script reads .env/API keys and a hardcoded KB path). Suggest folding the durable conclusion (TF-IDF > embeddings here, K=40 ≈ 90% recall@K) into the retrieval.py docstring and keeping the rest out of the open-source tree.

Nice work — the diagnosis and the opt-in/default-off rollout are exactly right.

cnndabbler added 2 commits June 9, 2026 10:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compiler): opt-in top-K retrieval for concepts-plan (bound plan prompt as the KB grows)#92

feat(compiler): opt-in top-K retrieval for concepts-plan (bound plan prompt as the KB grows)#92
cnndabbler wants to merge 2 commits into
VectifyAI:mainfrom
cnndabbler:feat/retrieval-concepts-plan

cnndabbler commented Jun 9, 2026

Uh oh!

KylinMountain commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cnndabbler commented Jun 9, 2026

Motivation

What this does (opt-in, default off)

Benchmark (recall@K vs. ground-truth concept links)

Tradeoff

Uh oh!

KylinMountain commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants