fix(compute): keep bf16 GPUStorage reshape on-device (capture-safe)#162
Merged
Conversation
GPUEngine.Reshape only took the zero-copy GPUStorage[T] view path for float32
(isFloat32[T] gate); a bf16 GPUStorage[bf16] tensor fell through to e.cpu.Reshape,
producing a host tensor. That host tensor then forced the next op -- the Transpose
feeding QKL2Norm -- onto the CPU engine, whose host memcpy breaks CUDA-graph
capture ("operation would make the legacy stream depend on a capturing blocking
stream"). So even with native bf16 transpose kernels (v1.17.0), the bf16 CrossAsset
GPU bench still could not capture.
Reshape is a pure metadata/view operation (no data movement), valid for any
element type backed by GPUStorage[T]. Allow bf16 on the GPU view path
(isFloat32[T] || isBFloat16[T]). CUDA-gated test asserts a GPU-resident bf16
tensor reshaped stays *GPUStorage[bf16] with data preserved.
Final piece letting the bf16 CrossAsset GPU backward run with CUDA-graph capture
ON. ADR-075 lever L4.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
GPUEngine.Reshapenow takes the zero-copy GPU view path for bf16GPUStorage[bf16], not just f32.Why
Reshape'sGPUStorage[T]view branch was gated onisFloat32[T](), so a bf16 tensor fell through toe.cpu.Reshape, producing a host tensor. That host tensor then forced the next op — theTransposefeeding QKL2Norm — onto the CPU engine, whose host memcpy breaks CUDA-graph capture (operation would make the legacy stream depend on a capturing blocking stream). So even with the native bf16 transpose kernels (v1.17.0), the bf16 CrossAsset GPU bench still could not capture.Fix
Reshape is a pure metadata/view operation (no data movement), valid for any element type backed by
GPUStorage[T]. Allow bf16 on the GPU view path:isFloat32[T]() || isBFloat16[T](). Go-only change — no kernel, no.sorebuild.Verification
computetests green.TestGPUBF16_ReshapeStaysOnDevice: a GPU-resident bf16 tensor reshaped stays*GPUStorage[bf16](not CPU), data preserved.Final piece letting the bf16 CrossAsset GPU backward run with CUDA-graph capture ON (representative s/epoch). Chain: ztensor v1.16.0 (NT/TN) + v1.17.0 (transpose kernels) + zerfoo v1.53.1 (grad-accum). ADR-075 lever L4.