Skip to content

fix: collision-resistant doc_name — same-stem documents no longer overwrite each other#96

Open
KylinMountain wants to merge 15 commits into
mainfrom
fix/doc-name-collision
Open

fix: collision-resistant doc_name — same-stem documents no longer overwrite each other#96
KylinMountain wants to merge 15 commits into
mainfrom
fix/doc-name-collision

Conversation

@KylinMountain

Copy link
Copy Markdown
Collaborator

Problem

doc_name is the filename stem, and it names every artifact (wiki/sources/{doc_name}.md|json, images/{doc_name}/, wiki/summaries/{doc_name}.md, [[summaries/{doc_name}]]). Two different files with the same stem — a/report.md vs b/report.md, or report.md vs report.pdf — silently overwrite each other's pages.

Approach (Scheme A: clean names by default, suffix only on collision)

  • doc_name stays the sanitized stem (NFKC + [^\w\-]+-); only when that name is already claimed by a different registered document does it get a deterministic -{sha256(path)[:8]} suffix. Unique names — the overwhelmingly common case — are unchanged, and existing KBs need zero migration.
  • Identity is anchored by a registry path index (get_by_path over new path/raw_path/source_path metadata): re-ingesting an edited file or a watch-mode edit maps back to the same doc_name and overwrites in place, replacing the stale content-hash entry.
  • Legacy entries (no path field) are matched by stem (NFKC-normalized both sides — macOS NFD filenames match NFC entries) and backfilled, so editing a pre-upgrade document does not fork a duplicate.
  • The registry is the single authority on name ownership: unclaimed on-disk artifacts are adopted, so a retry after a failed compile keeps its clean name instead of drifting to a suffixed one.
  • index_long_document takes an explicit doc_name; openkb remove locates raw copies via the recorded raw_path (raw copies are now named by doc_name); lint's missing-entry check resolves raw files through the registry before stem matching (fixes false positives for dotted names like arXiv 2509.11420.pdf).

Hardening found during implementation

  • dedup early-return is read-only (a duplicate copy can no longer backfill its path onto the original entry)
  • the CLI constructs its registry instance after conversion, so converter-side legacy backfills aren't clobbered by a stale in-memory snapshot (regression test reproduces the lost-update)
  • remove's raw-name fallback is restricted to legacy entries without raw_path, eliminating a cross-deletion edge

Relation to #30

Supersedes #30, which pioneered this fix (thank you @saccharin98!) but used an unconditional stem-<hash12> name for every document (uglier names + a mixed-naming migration for existing KBs) and predates the current add_single_file (Literal return codes, long-PDF doc_id persistence). Core ideas are carried over and credited via Co-authored-by.

Tests

New coverage (24 new tests): unique name stays clean; same basename in different dirs; same stem different extension; CJK/full-width punctuation sanitization; all-symbol stems; edited re-ingest keeps identity and replaces the stale hash entry; retry-after-failed-compile keeps its name; duplicate-copy skip doesn't poison paths; legacy entry reuse + NFKC path backfill; oldest-generation ({name, type}-only) entries converge to a single registry entry; registry path index; lint registry resolution; remove with renamed raw copies. Full suite: 716 passed (the 4 test_url_ingest failures are pre-existing — missing optional trafilatura in this env — and reproduce on clean main).

KylinMountain and others added 15 commits June 12, 2026 19:30
Unclaimed on-disk artifacts no longer force a suffix (fixes doc_name
drift on retry after a failed compile), and the dedup early-return now
derives the name from the stored entry without invoking the resolver
(no legacy path backfill from duplicate copies).
The identity model (path-keyed registry metadata, collision-resistant
naming) builds on the approach pioneered in PR #30.

Co-authored-by: Xinyan Zhou <xinyanzhou938@gmail.com>
@KylinMountain

Copy link
Copy Markdown
Collaborator Author

Final end-to-end review notes (full-lifecycle walkthroughs executed with real code, LLM compile mocked) — all green, plus known boundaries worth recording:

  1. Long-PDF re-add leaks the old PageIndex doc (minor, follow-up): when an edited long PDF is re-added, remove_by_doc_name drops the stale entry including its old doc_id without calling the PageIndex cleanup, so the managed PDF copy + SQLite row become unreachable. Same class of residue existed on main; a follow-up could clean the old doc_id before replacing the entry.
  2. Identity is path-anchored by design: moving a file AND editing it before re-adding mints a new (suffixed) document; the old entry remains under the dead path. Known Scheme A boundary.
  3. Suffixed candidates aren't re-checked for collisions ({stem}-{8hex} clash probability ≈ 2⁻³², deterministic behavior if it ever happens).

Residual pre-existing items (not introduced here): registry has no file locking (remove snapshots at prompt time); openkb list doesn't show the doc_name slug, so same-named docs render identically — surfacing the slug would pair well with this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant