fix: collision-resistant doc_name — same-stem documents no longer overwrite each other by KylinMountain · Pull Request #96 · VectifyAI/OpenKB

KylinMountain · 2026-06-12T14:11:38Z

Problem

doc_name is the filename stem, and it names every artifact (wiki/sources/{doc_name}.md|json, images/{doc_name}/, wiki/summaries/{doc_name}.md, [[summaries/{doc_name}]]). Two different files with the same stem — a/report.md vs b/report.md, or report.md vs report.pdf — silently overwrite each other's pages.

Approach (Scheme A: clean names by default, suffix only on collision)

doc_name stays the sanitized stem (NFKC + [^\w\-]+ → -); only when that name is already claimed by a different registered document does it get a deterministic -{sha256(path)[:8]} suffix. Unique names — the overwhelmingly common case — are unchanged, and existing KBs need zero migration.
Identity is anchored by a registry path index (get_by_path over new path/raw_path/source_path metadata): re-ingesting an edited file or a watch-mode edit maps back to the same doc_name and overwrites in place, replacing the stale content-hash entry.
Legacy entries (no path field) are matched by stem (NFKC-normalized both sides — macOS NFD filenames match NFC entries) and backfilled, so editing a pre-upgrade document does not fork a duplicate.
The registry is the single authority on name ownership: unclaimed on-disk artifacts are adopted, so a retry after a failed compile keeps its clean name instead of drifting to a suffixed one.
index_long_document takes an explicit doc_name; openkb remove locates raw copies via the recorded raw_path (raw copies are now named by doc_name); lint's missing-entry check resolves raw files through the registry before stem matching (fixes false positives for dotted names like arXiv 2509.11420.pdf).

Hardening found during implementation

dedup early-return is read-only (a duplicate copy can no longer backfill its path onto the original entry)
the CLI constructs its registry instance after conversion, so converter-side legacy backfills aren't clobbered by a stale in-memory snapshot (regression test reproduces the lost-update)
remove's raw-name fallback is restricted to legacy entries without raw_path, eliminating a cross-deletion edge

Relation to #30

Supersedes #30, which pioneered this fix (thank you @saccharin98!) but used an unconditional stem-<hash12> name for every document (uglier names + a mixed-naming migration for existing KBs) and predates the current add_single_file (Literal return codes, long-PDF doc_id persistence). Core ideas are carried over and credited via Co-authored-by.

Tests

New coverage (24 new tests): unique name stays clean; same basename in different dirs; same stem different extension; CJK/full-width punctuation sanitization; all-symbol stems; edited re-ingest keeps identity and replaces the stale hash entry; retry-after-failed-compile keeps its name; duplicate-copy skip doesn't poison paths; legacy entry reuse + NFKC path backfill; oldest-generation ({name, type}-only) entries converge to a single registry entry; registry path index; lint registry resolution; remove with renamed raw copies. Full suite: 716 passed (the 4 test_url_ingest failures are pre-existing — missing optional trafilatura in this env — and reproduce on clean main).

…d_legacy_by_stem

… edge tests

Unclaimed on-disk artifacts no longer force a suffix (fixes doc_name drift on retry after a failed compile), and the dedup early-return now derives the name from the stored entry without invoking the resolver (no legacy path backfill from duplicate copies).

… clobbered

The identity model (path-keyed registry metadata, collision-resistant naming) builds on the approach pioneered in PR #30. Co-authored-by: Xinyan Zhou <xinyanzhou938@gmail.com>

KylinMountain · 2026-06-12T14:21:51Z

Final end-to-end review notes (full-lifecycle walkthroughs executed with real code, LLM compile mocked) — all green, plus known boundaries worth recording:

Long-PDF re-add leaks the old PageIndex doc (minor, follow-up): when an edited long PDF is re-added, remove_by_doc_name drops the stale entry including its old doc_id without calling the PageIndex cleanup, so the managed PDF copy + SQLite row become unreachable. Same class of residue existed on main; a follow-up could clean the old doc_id before replacing the entry.
Identity is path-anchored by design: moving a file AND editing it before re-adding mints a new (suffixed) document; the old entry remains under the dead path. Known Scheme A boundary.
Suffixed candidates aren't re-checked for collisions ({stem}-{8hex} clash probability ≈ 2⁻³², deterministic behavior if it ever happens).

Residual pre-existing items (not introduced here): registry has no file locking (remove snapshots at prompt time); openkb list doesn't show the doc_name slug, so same-named docs render identically — surfacing the slug would pair well with this fix.

KylinMountain and others added 15 commits June 12, 2026 19:30

feat(state): index registry entries by path for doc identity

1c0440b

feat(state): match legacy registry entries by stem for path backfill

3389803

docs(state): document first-match-wins + truthy-path semantics in fin…

f429f7d

…d_legacy_by_stem

feat(converter): portable registry path key

bd22291

feat(converter): collision-resistant doc_name resolution (Scheme A)

31e9c4d

fix(converter,state): NFKC-normalize both sides of name comparisons +…

5ca7375

… edge tests

feat(converter): name all artifacts by collision-resistant doc_name

85760ab

feat(indexer): accept explicit doc_name for long-doc artifacts

c84cf54

feat(cli): persist path identity metadata on add

feff3d3

fix(cli): construct registry after convert so legacy backfills aren't…

5aed0ee

… clobbered

fix(lint): resolve raw files through the registry before stem matching

05d7d3f

fix(cli): remove locates raw copies via recorded raw_path

f63a7c9

fix(cli): restrict raw-name fallback to legacy entries without raw_path

9da4bbc

credit: collision-resistant doc_name groundwork

299e283

The identity model (path-keyed registry metadata, collision-resistant naming) builds on the approach pioneered in PR #30. Co-authored-by: Xinyan Zhou <xinyanzhou938@gmail.com>

KylinMountain mentioned this pull request Jun 13, 2026

feat(cli): import existing PageIndex Cloud indices via add --from-pageindex-cloud (closes #88) #97

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: collision-resistant doc_name — same-stem documents no longer overwrite each other#96

fix: collision-resistant doc_name — same-stem documents no longer overwrite each other#96
KylinMountain wants to merge 15 commits into
mainfrom
fix/doc-name-collision

KylinMountain commented Jun 12, 2026

Uh oh!

KylinMountain commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KylinMountain commented Jun 12, 2026

Problem

Approach (Scheme A: clean names by default, suffix only on collision)

Hardening found during implementation

Relation to #30

Tests

Uh oh!

KylinMountain commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant