[#70] Fix analyze CRC for cah:/ ContentDirectory resources; add --skip-crc#73
Conversation
Recognize content-addressed stream paths (cah:/<hash>) produced by Unity 6.6 ContentDirectory builds: fold the path (which contains the content hash) into the CRC instead of opening the differently named resource file, which was failing with "Error opening resource file". Legacy CAB-*.resS/.resource streams still read their bytes for the CRC. Also addresses the performance issues from PR 66 without losing CRC coverage of external streams: - Fix UnityFileReader.ComputeCRC chunking (advance the file offset and handle the partial final chunk) so ranges larger than the buffer no longer produce a wrong CRC or over-read. - Fix the ProcessManagedReferenceData CRC size argument (stringSize + 4, not m_Offset + stringSize + 4). - Keep journal_mode = MEMORY (drop PR 66's ineffective WAL change). Add a --skip-crc option, fully independent of --skip-references: --skip-references now only skips reference extraction and no longer skips the CRC. The reference walk still resolves referenced object ids (so the CRC stays stable) but only inserts refs rows when extracting. Add a ComputeCRC unit test for buffer-boundary ranges and update the analyze documentation for the new flag semantics. Fixes #70.
…mption The content-addressed scheme casing is not guaranteed, so compare the prefix with OrdinalIgnoreCase. Also document why path-only CRC is correct: a cah:/ stream always references the entire resource file, and the offset/size fields only remain for backward compatibility with the older format that packed multiple resources into a single file.
Rewrite the class comment to cover both responsibilities (PPtr extraction and CRC fingerprinting, including external stream content), document the constructor arguments and Process(), comment each member variable, and group the fields by lifetime (constructor config, reused caches/scratch, and per-object processing state).
Reorder methods to a more logical ordering.
Refine the CallbackDelegate comment to clarify the id spaces (objectId and the return value are analyzer/database ids; fileId/pathId are raw PPtr fields) and document the return value. Add a block comment in front of ProcessManagedReferenceRegistry with C# and YAML examples showing the [SerializeReference] "references:" layout, and explain throughout that each entry's data follows the referenced type's own TypeTree (obtained via GetRefTypeTypeTreeRoot) - the reason walking the registry jumps between type trees and is more involved than the rest of the object. Also document the version 1/2 layouts, the terminating sentinel, the FQN string reads, and the registry re-entry guard.
…comments Drop the unused referencedTypeDataNode parameter (the referenced object's layout comes from its own TypeTree via GetRefTypeTypeTreeRoot, not from this node) and update both call sites. Add comments for the trickier mechanics: 4-byte alignment, the vector/map/staticvector array wrapper, Array node child layout, the StreamingInfo 32/64-bit offset field and its field-order difference from StreamedResource, PPtr<T> type-name parsing, and the >2GB size truncation/overflow assumptions in the stream and array handling.
|
Note: This PR includes an important performance fix, which was extracted from #66 (@UnityZappy) I got Claude to describe why the CRC bugs were the cause of really bad performance that @UnityZappy hit in a customer project: The two bugs Bug A — wrong size for SerializeReference type-name strings (ProcessManagedReferenceData). The old code: The size argument was m_Offset + stringSize + 4. m_Offset is the current read position — potentially hundreds of MB or gigabytes into the file. So a CRC that should cover ~20 bytes was asked to cover ~ Bug B — ComputeCRC didn't advance through the file. The old loop: For any size larger than the 4 MB buffer, it loaded the first 4 MB once and then re-hashed that same 4 MB over and over — ceil(size / 4MB) times — never reading the rest of the range. How they combined into the hang Bug B is the amplifier for Bug A. Feed the bogus size from A (say m_Offset ≈ 1.5 GB) into the loop from B:
So the performance recovery comes from fixing Bug A (the size). Bug B's fix is primarily correctness (large legitimate ranges now produce the right CRC), but the buggy loop in B is what turned A's bad number into a runaway. The subtlety I glossed over Fixing Bug B is not a free speedup for large reads. For a genuinely large range — like the 4 GB .resS in your log — the old loop read only the first 4 MB (then re-hashed it, wrongly); the fixed loop reads the entire 4 GB from the archive (correctly). So for legitimately large streams the fix does more I/O, not less. The "we made it faster" story applies only to the bogus-size case (Bug A), not to large legitimate CRCs. |
Fixes #70. Also incorporates CRC fixes from #66 and improved the comments and readability of the PPTR / CRC traversal code.
Issue 70 —
cah:/content-addressed resourcesUnity 6.6 ContentDirectory builds write resource streams whose
StreamingInfo/StreamedResourcepath iscah:/<hash>while the file on disk is<hash>.resS/<hash>.resource. The old resource lookup stripped thecah:/prefix and triedarchive:/<hash>and<folder>/<hash>(no extension), both of which fail — producing manyError opening resource filemessages and omitting those bytes from the CRC.The hash in a
cah:/path is the content hash of the whole resource file, so we fold the path string into the CRC instead of opening the file. The prefix is matched case-insensitively. LegacyCAB-*.resS/.resourcestreams still read their bytes as before.A
cah:/stream always references the entire resource file (offset 0, full size); theoffset/sizefields only remain for backward compatibility with the older format that packed multiple resources into one file, which ContentDirectory builds do not do. This is documented in code so the path-only CRC choice is clear.Performance rework (from #66)
Kept the CRC fixes
UnityFileReader.ComputeCRC: fixed the chunking bug — the file offset never advanced and the final partial chunk was mishandled, so any range larger than the buffer produced a wrong CRC and over-read. Added a unit test covering buffer-boundary and partial-final-chunk ranges.ProcessManagedReferenceData: fixed the CRC size argument (stringSize + 4, wasm_Offset + stringSize + 4).New
--skip-crcflag, independent of--skip-references--skip-referencesnow only skips reference extraction; CRC is computed unless--skip-crcis also given. (Previously-sdid both.) The reference walk still resolves referenced object ids so the CRC stays stable, but only insertsrefsrows when extracting references. All four flag combinations were verified on a real ContentDirectory build.Comments and refactoring
The PPtrAndCrcProcessor code was very difficult to understand and actually covers some very fundamental aspects of how Unity serialization works (including SerializeReference support). So this PR adds comments, reorders the code for readability and removes some dead code.
The CRCs calculation fixes means that the database produced by an older build will show spurious differences in a comparing-builds workflow:
Validation
Verified on a content directory build : no resource errors, deterministic CRCs across runs, streamed Texture2D/AudioClip get non-zero CRCs with sizes including the stream bytes, and all
--skip-crc/--skip-referencespermutations behave correctly.Existing test suites pass (
UnityFileSystem.Tests,Analyzer.Tests,UnityDataTool.Tests).Follow-up
find-refsfails to open analyze databases (pre-existing legacy connection string); related to Reduce size of refs table #44.