Fix `Ideogram4MRoPE` collapsing under `torch.autocast` (compute rotary in float32) by HaozheZhang6 · Pull Request #13922 · huggingface/diffusers

HaozheZhang6 · 2026-06-11T17:27:07Z

What does this PR do?

Ideogram4MRoPE produces collapsed rotary embeddings under torch.autocast, so denoising inside an autocast context (common in training, and when users wrap pipeline calls) renders a flat single-color image.

Root cause

Image-token positions are IMAGE_POSITION_OFFSET (65536) + (t, h, w). Ideogram4MRoPE.forward casts its operands to float32, but the frequency matmul is on autocast's downcast list, so under torch.autocast("cuda", torch.bfloat16) it executes in bfloat16 anyway. bfloat16's representable step at 65536 is 512, so every image position in a ≤512-wide grid rounds to the same value — all image tokens get identical rotary embeddings, spatial information is lost, and sampling degenerates to a flat field.

Reproduced with the weight-free snippet from the issue (max |cos_autocast − cos_fp32| ≈ 1.93, distinct positions become equal).

Fix

Wrap the frequency computation in torch.autocast(device_type=..., enabled=False) so the rotary embeddings are always computed in float32 regardless of an ambient autocast — the same guard transformers applies to its RoPE modules. After the fix the autocast and float32 paths are bit-identical (max |Δ| = 0.0).

Scope is Ideogram4MRoPE, the catastrophic case (others noted in the issue are far milder without the 65536 offset). Happy to extend the same guard to the sibling RoPE modules in a follow-up if you'd like.

Tests

Added test_ideogram4_mrope_is_autocast_invariant — it fails on main (collapsed positions) and passes with the fix. Full file green:

pytest tests/models/transformers/test_models_transformer_ideogram4.py
38 passed, 4 skipped

Before submitting

Did you read the contributor guideline?
Did you read our philosophy doc?
Was this discussed/approved via a GitHub issue? Ideogram4: Ideogram4MRoPE breaks under torch.autocast: all image positions collapse, producing flat single-color images #13920
Did you make sure to update the documentation with your changes? (n/a — bug fix)
Did you write any new necessary tests?

Who can review?

@DN6 @sayakpaul

…y in float32) Ideogram4 builds image-token positions as IMAGE_POSITION_OFFSET (65536) + (t, h, w). `Ideogram4MRoPE.forward` casts its operands to float32, but the rotary matmul (and cos/sin) is on autocast's downcast list, so under torch.autocast("cuda", bfloat16) — common in training and pipeline code — it runs in bfloat16 anyway. bfloat16's step at 65536 is 512, so every image position in a <=512 grid rounds to the same value: all image tokens get identical rotary embeddings, spatial information is lost, and the decoded image degenerates to a flat color. Wrap the frequency computation in torch.autocast(enabled=False) so the rotary embeddings are always computed in float32, matching how transformers guards its RoPE modules. Added a regression test that fails on main and passes with the fix. Fixes huggingface#13920

dxqb · 2026-06-11T20:03:06Z

before committing that (and thereby closing my report), please consider that other modules might be affected, just not as bad. bfloat16 becomes inaccurate for integers starting 257.0 (which is rounded to 256.0).

that's within the range of text token ids

HaozheZhang6 · 2026-06-11T20:57:18Z

You're right — confirmed bf16 rounds 257→256, 259→260, so text positions past 256 lose precision in any RoPE that matmuls raw position ids under autocast. Ideogram4 is just the pathological case: the 65536 offset collapses a whole ≤512-wide grid onto a single value, where the others degrade gradually instead of all-at-once.

I'd checked the other diffusers transformers — Ideogram4 is the only RoPE with a large position offset, so the only catastrophic one — but the gradual loss you describe is real for the rest. I can extend the same autocast(enabled=False) guard to the other RoPE forwards in this PR, or keep this one targeted at the Ideogram4 regression and do a follow-up sweep, whichever you and the maintainers prefer. Either way it shouldn't close your report until the broader case is covered.

sayakpaul · 2026-06-12T02:17:00Z

+        # IMAGE_POSITION_OFFSET (65536), so an ambient autocast would otherwise run the matmul and
+        # cos/sin in bfloat16, rounding every image position to the same value and collapsing the
+        # rotary embeddings (all spatial information is lost).
+        with torch.autocast(device_type=position_ids.device.type, enabled=False):


We don't use autocast within our modeling implementation like this.

Good catch — dropped the autocast guard and compute the freqs in float64 instead, which autocast doesn't downcast (matching the float64 rope path Flux uses). The autocast and float32 paths come out bit-identical (max|Δ| = 0), and the regression test still passes.

We don't use autocast within our modeling implementation like this.

@sayakpaul Why?
Maybe this can be reconsidered.

It's the right solution. Casting to float64 as @HaozheZhang6's AI suggested below is a bad workaround

This is what huggingface transformers does: https://github.com/huggingface/transformers/blob/08a7ef05bcf9723cb2e58855afb8dc2c799323ff/src/transformers/models/qwen3_vl/modular_qwen3_vl.py#L304

We are considering this as a library-wide thing i.e., to handle these kinds of situations. So, expect a PR soon that will also include this case.

Cc: @dg845 @yiyixuxu @DN6 as we were discussing this.

After some internal discussion, we decided that using torch.autocast(..., enabled=False) makes sense here, so the original implementation which uses it is fine.

Per review: replace the torch.autocast(enabled=False) guard with a float64 computation, which autocast does not downcast — matching the float64 rope path used elsewhere (Flux). The autocast and float32 paths stay bit-identical (max|delta|=0).

HuggingFaceDocBuilderDev · 2026-06-12T06:31:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dg845

Thanks for the PR! Would you be willing to extend the fix to other RoPE modules? Using torch.autocast(enabled=False) would be fine, as described in #13922 (comment).

Per review, use torch.autocast(enabled=False) around the rotary matmul (as the original implementation did) rather than computing in float64, and adopt the clearer comment describing the bfloat16 collapse at the 65536 offset.

HaozheZhang6 · 2026-06-18T06:04:09Z

Done — switched Ideogram4 back to torch.autocast(..., enabled=False) around the rope matmul and took your comment, thanks @dg845.

On extending it: a fair number of the other transformer RoPE modules build their freqs the same way (matmul of raw position ids), so they'd downcast under an ambient autocast too — most won't collapse as hard as Ideogram4's 65536 offset, but they lose precision once positions pass ~257 in bf16. Happy to wrap each in autocast(enabled=False). Do you want them all in this PR, or a focused first set? A few are # Copied from-linked so I'll edit the source and run make fix-copies.

dg845 · 2026-06-18T07:02:32Z

+        # matmul to bfloat16, the image positions will collapse to only a few distinct values because bfloat16 cannot
+        # represent consecutive integers at this value (after pos 65536 each 512-integer block will collapse to the
+        # same value), which causes the image to become essentially flat. Therefore, we need to disable autocast here.
+        with torch.autocast(device_type=position_ids.device.type, enabled=False):


I think it would be better to tighten the torch.autocast region to just the freqs matmul, since it's the operation that is actually precision-sensitive and needs the guard. So maybe something like

# <explanatory comment from above> pos = position_ids.permute(2, 0, 1).to(dtype=torch.float32) inv_freq = self.inv_freq.to(dtype=torch.float32)[None, None, :, None].expand(3, batch_size, -1, 1) with torch.autocast(device_type=position_ids.device.type, enabled=False): freqs = inv_freq @ pos.unsqueeze(2) freqs = freqs.transpose(2, 3) # (3, B, L, inv_freq_size) # Rest of the implementation (setting up interleaved mrope, cos/sin call) ...

dg845 · 2026-06-18T07:08:05Z

Hi @HaozheZhang6, I think fixing all of the RoPE modules that build their freqs like Ideogram 4 and thus have the bug (with the changes propagated via make fix-copies) would be best.

Extend the Ideogram4 fix: ernie_image's `rope` and helios's `get_frequency_batched` build rotary freqs with an unguarded float32 einsum over raw position ids. Under an ambient autocast the einsum runs in bfloat16 on CUDA, which cannot represent consecutive integers past 256, so positions degrade — the same bug, matching the guards mochi/omnigen already have. Wrap each in torch.autocast(enabled=False).

Cosmos3VLTextRotaryEmbedding builds its interleaved-mrope freqs with an unguarded position-id matmul (same shape as Ideogram4), so an ambient autocast downcasts it to bfloat16 and collapses positions past 256. Wrap in torch.autocast(enabled=False).

HaozheZhang6 · 2026-06-18T07:27:21Z

Went through every RoPE module and extended the guard to the unguarded ones.

Fixed (wrapped in torch.autocast(enabled=False)): ideogram4 + cosmos3 (position-id matmul), ernie_image + helios (position-id einsum).

Already handled, left alone: omnigen (matmul already under autocast(enabled=False)), mochi (einsum already forced to fp32), hidream_image (computes in float64, which autocast leaves alone).

The torch.outer-based ones (cogview4, glm_image, qwenimage, cosmos, kandinsky, z_image, longcat, nucleusmoe, joyimage) aren't affected: autocast's lower-precision list covers the GEMM family (matmul/mm/bmm/einsum) but not outer/ger, so their positions stay full-precision (confirmed locally — outer keeps float32 under autocast).

make fix-copies is clean (none of the edited functions are # Copied from sources/copies). On tests: the matmul collapse reproduces on CPU (the existing Ideogram4 test covers it, and cosmos3 is the same path), but the einsum collapse is CUDA-only — CPU autocast doesn't downcast einsum — so I relied on the matmul test plus the mochi/omnigen precedent rather than a CPU test that can't fail there. Happy to add per-module autocast-invariance tests (CUDA-gated for ernie/helios) if you'd prefer.

github-actions Bot added fixes-issue size/M PR with diff < 200 LOC models tests and removed size/M PR with diff < 200 LOC fixes-issue labels Jun 11, 2026

sayakpaul reviewed Jun 12, 2026

View reviewed changes

github-actions Bot added fixes-issue size/S PR with diff < 50 LOC labels Jun 12, 2026

sayakpaul requested a review from dg845 June 12, 2026 05:00

Merge branch 'main' into fix/ideogram4-rope-autocast

81955e9

dg845 added 3 commits June 12, 2026 17:42

Merge branch 'main' into fix/ideogram4-rope-autocast

0081d61

Merge branch 'main' into fix/ideogram4-rope-autocast

7cd6749

Merge branch 'main' into fix/ideogram4-rope-autocast

606b096

dg845 reviewed Jun 18, 2026

View reviewed changes

Comment thread src/diffusers/models/transformers/transformer_ideogram4.py Outdated

dg845 reviewed Jun 18, 2026

View reviewed changes

github-actions Bot added size/M PR with diff < 200 LOC and removed size/S PR with diff < 50 LOC labels Jun 18, 2026

dg845 reviewed Jun 18, 2026

View reviewed changes

HaozheZhang6 added 2 commits June 18, 2026 00:24

dg845 added 2 commits June 18, 2026 16:37

Merge branch 'main' into fix/ideogram4-rope-autocast

f455256

Merge branch 'main' into fix/ideogram4-rope-autocast

bae53aa

Conversation

HaozheZhang6 commented Jun 11, 2026

What does this PR do?

Root cause

Fix

Tests

Before submitting

Who can review?

Uh oh!

dxqb commented Jun 11, 2026

Uh oh!

HaozheZhang6 commented Jun 11, 2026

Uh oh!

sayakpaul Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

HaozheZhang6 Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

dxqb Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

dg845 Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jun 12, 2026

Uh oh!

Uh oh!

dg845 left a comment

Choose a reason for hiding this comment

Uh oh!

HaozheZhang6 commented Jun 18, 2026

Uh oh!

dg845 Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

dg845 commented Jun 18, 2026

Uh oh!

HaozheZhang6 commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dxqb Jun 13, 2026 •

edited

Loading