[#70] Fix analyze CRC for cah:/ ContentDirectory resources; add --skip-crc by SkowronskiAndrew · Pull Request #73 · Unity-Technologies/UnityDataTools

SkowronskiAndrew · 2026-06-11T20:36:00Z

Fixes #70. Also incorporates CRC fixes from #66 and improved the comments and readability of the PPTR / CRC traversal code.

Issue 70 — `cah:/` content-addressed resources

Unity 6.6 ContentDirectory builds write resource streams whose StreamingInfo/StreamedResource path is cah:/<hash> while the file on disk is <hash>.resS / <hash>.resource. The old resource lookup stripped the cah:/ prefix and tried archive:/<hash> and <folder>/<hash> (no extension), both of which fail — producing many Error opening resource file messages and omitting those bytes from the CRC.

The hash in a cah:/ path is the content hash of the whole resource file, so we fold the path string into the CRC instead of opening the file. The prefix is matched case-insensitively. Legacy CAB-*.resS/.resource streams still read their bytes as before.

A cah:/ stream always references the entire resource file (offset 0, full size); the offset/size fields only remain for backward compatibility with the older format that packed multiple resources into one file, which ContentDirectory builds do not do. This is documented in code so the path-only CRC choice is clear.

Performance rework (from #66)

Kept the CRC fixes

UnityFileReader.ComputeCRC: fixed the chunking bug — the file offset never advanced and the final partial chunk was mishandled, so any range larger than the buffer produced a wrong CRC and over-read. Added a unit test covering buffer-boundary and partial-final-chunk ranges.
ProcessManagedReferenceData: fixed the CRC size argument (stringSize + 4, was m_Offset + stringSize + 4).

New `--skip-crc` flag, independent of `--skip-references`

--skip-references now only skips reference extraction; CRC is computed unless --skip-crc is also given. (Previously -s did both.) The reference walk still resolves referenced object ids so the CRC stays stable, but only inserts refs rows when extracting references. All four flag combinations were verified on a real ContentDirectory build.

Comments and refactoring

The PPtrAndCrcProcessor code was very difficult to understand and actually covers some very fundamental aspects of how Unity serialization works (including SerializeReference support). So this PR adds comments, reorders the code for readability and removes some dead code.

⚠️ CRC values are not comparable across this change

The CRCs calculation fixes means that the database produced by an older build will show spurious differences in a comparing-builds workflow:

Validation

Verified on a content directory build : no resource errors, deterministic CRCs across runs, streamed Texture2D/AudioClip get non-zero CRCs with sizes including the stream bytes, and all --skip-crc / --skip-references permutations behave correctly.

Existing test suites pass (UnityFileSystem.Tests, Analyzer.Tests, UnityDataTool.Tests).

Follow-up

find-refs fails to open database: "Connection string keyword 'version' is not supported" #72 — find-refs fails to open analyze databases (pre-existing legacy connection string); related to Reduce size of refs table #44.

Recognize content-addressed stream paths (cah:/<hash>) produced by Unity 6.6 ContentDirectory builds: fold the path (which contains the content hash) into the CRC instead of opening the differently named resource file, which was failing with "Error opening resource file". Legacy CAB-*.resS/.resource streams still read their bytes for the CRC. Also addresses the performance issues from PR 66 without losing CRC coverage of external streams: - Fix UnityFileReader.ComputeCRC chunking (advance the file offset and handle the partial final chunk) so ranges larger than the buffer no longer produce a wrong CRC or over-read. - Fix the ProcessManagedReferenceData CRC size argument (stringSize + 4, not m_Offset + stringSize + 4). - Keep journal_mode = MEMORY (drop PR 66's ineffective WAL change). Add a --skip-crc option, fully independent of --skip-references: --skip-references now only skips reference extraction and no longer skips the CRC. The reference walk still resolves referenced object ids (so the CRC stays stable) but only inserts refs rows when extracting. Add a ComputeCRC unit test for buffer-boundary ranges and update the analyze documentation for the new flag semantics. Fixes #70.

…mption The content-addressed scheme casing is not guaranteed, so compare the prefix with OrdinalIgnoreCase. Also document why path-only CRC is correct: a cah:/ stream always references the entire resource file, and the offset/size fields only remain for backward compatibility with the older format that packed multiple resources into a single file.

Rewrite the class comment to cover both responsibilities (PPtr extraction and CRC fingerprinting, including external stream content), document the constructor arguments and Process(), comment each member variable, and group the fields by lifetime (constructor config, reused caches/scratch, and per-object processing state).

Reorder methods to a more logical ordering.

Refine the CallbackDelegate comment to clarify the id spaces (objectId and the return value are analyzer/database ids; fileId/pathId are raw PPtr fields) and document the return value. Add a block comment in front of ProcessManagedReferenceRegistry with C# and YAML examples showing the [SerializeReference] "references:" layout, and explain throughout that each entry's data follows the referenced type's own TypeTree (obtained via GetRefTypeTypeTreeRoot) - the reason walking the registry jumps between type trees and is more involved than the rest of the object. Also document the version 1/2 layouts, the terminating sentinel, the FQN string reads, and the registry re-entry guard.

…comments Drop the unused referencedTypeDataNode parameter (the referenced object's layout comes from its own TypeTree via GetRefTypeTypeTreeRoot, not from this node) and update both call sites. Add comments for the trickier mechanics: 4-byte alignment, the vector/map/staticvector array wrapper, Array node child layout, the StreamingInfo 32/64-bit offset field and its field-order difference from StreamedResource, PPtr<T> type-name parsing, and the >2GB size truncation/overflow assumptions in the stream and array handling.

SkowronskiAndrew · 2026-06-12T12:20:49Z

Note: This PR includes an important performance fix, which was extracted from #66 (@UnityZappy)
That PR was trying to address performance in multiple ways, but it seems 2 key CRC fixes actually address the main problem and the other changes were not necessary. To confirm I tested with a large customer AssetBundle build and used extra logging to confirm that all the CRC calculations are now optimal (but can take a long real time if there are many GB of resources in a Player or AssetBundle build)

I got Claude to describe why the CRC bugs were the cause of really bad performance that @UnityZappy hit in a customer project:

The two bugs

Bug A — wrong size for SerializeReference type-name strings (ProcessManagedReferenceData). The old code:

var stringSize = m_Reader.ReadInt32(m_Offset);
m_Crc32 = m_Reader.ComputeCRC(m_Offset, (int)(m_Offset + stringSize + 4), m_Crc32);
//                                        ^^^^^^^^ should just be stringSize + 4

The size argument was m_Offset + stringSize + 4. m_Offset is the current read position — potentially hundreds of MB or gigabytes into the file. So a CRC that should cover ~20 bytes was asked to cover ~m_Offset bytes.

Bug B — ComputeCRC didn't advance through the file. The old loop:

var readSize = size > m_Buffer.Length ? m_Buffer.Length : size;
var readBytes = 0;
while (readBytes < size)
{
    var offset = GetBufferOffset(fileOffset, readSize); // fileOffset never changes
    crc32 = Crc32Algorithm.Append(crc32, m_Buffer, offset, readSize);
    readBytes += readSize; // advances the counter, but not the file position
}

For any size larger than the 4 MB buffer, it loaded the first 4 MB once and then re-hashed that same 4 MB over and over — ceil(size / 4MB) times — never reading the rest of the range.

How they combined into the hang

Bug B is the amplifier for Bug A. Feed the bogus size from A (say m_Offset ≈ 1.5 GB) into the loop from B:

1.5 GB / 4 MB ≈ 384 iterations, each doing a CRC32 over 4 MB → ~1.5 GB of CRC compute…
…for a value that should have been a ~20-byte CRC.
There are three such calls per ReferencedManagedType (class, namespace, assembly), and the size scales with m_Offset, so entries deeper in the file cost more. Across many [SerializeReference] entries in a large file you get tens-to-hundreds of GB of pointless CRC work → the >24 h "hang."

So the performance recovery comes from fixing Bug A (the size). Bug B's fix is primarily correctness (large legitimate ranges now produce the right CRC), but the buggy loop in B is what turned A's bad number into a runaway.

The subtlety I glossed over

Fixing Bug B is not a free speedup for large reads. For a genuinely large range — like the 4 GB .resS in your log — the old loop read only the first 4 MB (then re-hashed it, wrongly); the fixed loop reads the entire 4 GB from the archive (correctly). So for legitimately large streams the fix does more I/O, not less. The "we made it faster" story applies only to the bogus-size case (Bug A), not to large legitimate CRCs.

SkowronskiAndrew added 8 commits June 4, 2026 12:32

Add reference files for Unity 6.6 output

cb6cb75

PPtrAndCrcProcessor.cs

541a6e9

Reorder methods to a more logical ordering.

Remove accidentally committed bash.exe.stackdump; ignore *.stackdump

505eb48

SkowronskiAndrew mentioned this pull request Jun 12, 2026

CRC of objects with references is not comparable across separate databases #74

Open

SkowronskiAndrew added 2 commits June 11, 2026 21:00

[#70] Note CRC is a within-database fingerprint (ref #74)

d71c011

Remove contentdirectory-zstd test fixtures (belong on another branch)

4f64dce

SkowronskiAndrew merged commit 99ce157 into main Jun 12, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#70] Fix analyze CRC for cah:/ ContentDirectory resources; add --skip-crc#73

[#70] Fix analyze CRC for cah:/ ContentDirectory resources; add --skip-crc#73
SkowronskiAndrew merged 10 commits into
mainfrom
analyze-fix-resource-crc

SkowronskiAndrew commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

SkowronskiAndrew commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SkowronskiAndrew commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue 70 — cah:/ content-addressed resources

Performance rework (from #66)

New --skip-crc flag, independent of --skip-references

Comments and refactoring

⚠️ CRC values are not comparable across this change

Validation

Follow-up

Uh oh!

Uh oh!

SkowronskiAndrew commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SkowronskiAndrew commented Jun 11, 2026 •

edited

Loading

Issue 70 — `cah:/` content-addressed resources

New `--skip-crc` flag, independent of `--skip-references`

SkowronskiAndrew commented Jun 12, 2026 •

edited

Loading