#293: per-pid tree bitmask filter — eliminate the multi-tree WASM data-scale wall#299
Merged
Merged
Conversation
…SM data-scale wall
Broad multi-tree facet selections filtered via a pid-subquery over the 39M-row
sample_facet_membership (GROUP BY pid HAVING COUNT(DISTINCT facet_type)=N), which
stalls DuckDB-WASM (~45s at global view). Replace it with a columnar bitwise
predicate over a precomputed per-pid mask:
pid IN (SELECT pid FROM sample_facet_masks
WHERE (material_mask & <sel>)<>0 AND (context_mask & <sel>)<>0 ...)
Set-identical to the membership form (membership already encodes the ancestor
closure → a parent node's bit covers its whole subtree), but one 6M-row scan with
no GROUP BY. Measured on deployed 202608 data: 146x faster filter, 25x faster
filtered h3 aggregation; set difference 0 across all dims.
Artifacts (both derived from membership, additive — no change to existing files):
- facet_node_bits (facet_type, concept_uri, bit_index) ~56 rows — authoritative
concept→bit map the explorer loads to turn a node selection into a mask.
- sample_facet_masks (pid, material_mask, context_mask, object_type_mask) ~6M
rows / ~10 MB. Hard-fails if a dim exceeds 63 nodes (BIGINT mask overflow).
Validator: node_bits covers exactly membership nodes + dense-unique bit range;
masks re-derived from membership+node_bits (symmetric diff); and a real-node
cross-check that the bitwise filter == the membership pid set per dim.
Tests: bitmask==membership for every fixture node, gate-bites-on-corruption,
--only orchestration. 25 pass.
Explorer: facetFilterSQL prefers the mask predicate; falls back to the membership
collapse when node_bits/masks aren't loaded or a selected node has no bit — so
it's safe to ship before the artifacts are published (no regression).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…generation masks P1.1 (mixed generations): bit assignment is positional (ROW_NUMBER over sorted concept_uri), so it shifts if the node set changes. A browser pairing fresh node_bits with a stale-cached masks file would map bits wrong → wrong pids. Fix: embed a shared build_id (md5 of the node set) in BOTH node_bits and masks. The explorer enables the mask path ONLY when it can read masks AND the two build_ids match; otherwise it falls back to the membership scan. P1.2 (masks not preflighted): the loader now probes masks (the same build_id query) before advertising readiness, so a missing/broken/unpublished masks file also falls back to membership instead of failing the query path. Validator: single-build_id per artifact + node_bits/masks build_id match. Tests: build_id-mismatch gate bites (stale-generation masks). 23 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- build_id now fingerprints the FULL membership generation (order-independent bit_xor of per-row hashes), not just the node set — so stale masks with an unchanged node set but changed pid memberships no longer match and activate. - Explorer: require node_bits to carry exactly ONE build_id (Set, not last-row- wins) before enabling the mask path. - Validator: compare build_ids only after both single-build_id checks pass (avoids a multi-row scalar-subquery throw); multi-valued artifact → clean FAIL. 23 tests pass; validator green on deployed 202608 data. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves the multi-tree FILTER half of #293. (The counts half, #290, shipped in #298.)
What
Broad multi-tree facet selections were filtered via a pid-subquery over the 39M-row
sample_facet_membership(GROUP BY pid HAVING COUNT(DISTINCT facet_type)=N), whichstalls DuckDB-WASM (~45s at global view). Replace it with a columnar bitwise
predicate over a precomputed per-pid mask:
Set-identical to the membership form (membership already encodes the ancestor closure,
so a parent node's bit covers its whole subtree), but one ~6M-row scan with no GROUP BY.
Measured on deployed 202608 data: 146× faster filter, 25× faster filtered h3 agg,
set difference 0 across all dims.
Artifacts (additive — already published to R2; no change to existing files)
facet_node_bits(concept_uri→bit_index + build_id) ~2 KBsample_facet_masks(pid, 3 BIGINT masks + build_id) ~10 MBBoth derived from membership; share a
build_id(membership fingerprint).Safety
matches node_bits; otherwise it falls back to the membership scan (set-identical,
just slower). So no regression even if the artifacts are missing/stale.
cross-check (bitwise == membership pid set), build_id consistency.
Verified on rdhyee staging
Warm multi-tree filter renders correctly (3-tree AND), no membership-fallback warning
(mask path engaged). Note: cold first-load latency is a separate, pre-existing issue
(Cloudflare cold-cache range behavior on data.isamples.org), not introduced here.
🤖 Generated with Claude Code