feat: dropout op (CPU + GB10 GPU) with deterministic Philox mask#169
Merged
Conversation
Add an inverted-dropout op to ztensor with a deterministic, seedable mask: - Philox4x32-10 counter-based RNG (compute/philox.go) keyed by (seed, element offset). The mask is a pure function of (seed, offset, p), so it is reproducible and will be bit-identical to the forthcoming CUDA kernel -- the property that makes CPU-GPU parity pass. The CPU Go impl is the reference; the GB10 kernel mirrors the same constants. - Dropouter[T] optional capability interface (compute/engine.go): Dropout + DropoutBackward. Inverted-dropout semantics match torch.nn.functional.dropout: training mode y = x*mask/(1-p) with mask~Bernoulli(1-p); eval mode / p==0 is exact identity. p in [0,1) is validated. - CPU implementation (compute/cpu_engine.go): the mask is recomputed in backward from (seed,p) rather than cached, keeping the op capture-safe and avoiding a pinned save across arena resets (ADR 006). Forward and backward share one masked-and-scaled kernel (dropout is linear in its input given the mask). - EngineProxy delegates the capability; the parity StressEngine relocates the masked output into the host arena like every other op so dropout runs the reset-between-fwd-bwd schedules. Gates (CPU half): - gradcheck: Dropout OpInfo (p=0.3, fixed seed, [4,8]) -- the deterministic mask makes dropout an exact linear map, finite-diff == analytic backward. PASS. - CPU-side parity (testing/parity arena-stress, RegistryGreen): PASS. - PyTorch oracle: SkipReason -- torch's training-mode mask uses its own Philox word->element mapping, not ztensor's; matching it would mean reimplementing ztensor's Philox in the torch runner (the HadamardTransform precedent). Mask-vs-input math is pinned by gradcheck + parity instead. - Unit tests: p=0 identity, eval identity, mask determinism, inverted-dropout mean preservation, backward==mask, invalid p. PASS under -race. GPU kernel (GB10 f32) + GB10 parity/oracle replay are the next milestone. Claude-Session: https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH
…B.3a) Add the f32 GPU half of the dropout op, bit-identical to the CPU reference: - internal/cuda/kernels/dropout.cu: Philox4x32-10 device function with the same constants/round structure as compute/philox.go, so the (seed, offset) -> mask draw matches the CPU engine word-for-word. dropout_f32 launcher does the masked-and-scaled write in training mode (out = (u>=p)? in*invKeep : 0) and an exact identity copy in eval mode / p==0. invKeep is passed host-side as 1/(1-p) so the scale equals the CPU path's bit-for-bit. Added to Makefile SRCS (sm_121 for GB10). - purego + cgo wrappers (dropout_purego.go / dropout.go) and the dropout_f32 symbol registration in purego.go, following the argmax kernel pattern. - gpuapi.Dropouter optional KernelRunner extension (kernels.go) + CUDAKernels impl, mirroring BFloat16Transposer; callers type-assert and report unavailability when absent (no stub fallback). - GPUEngine.Dropout / DropoutBackward (compute/gpu_dropout.go) reuse the gpu_kernels.go scaffolding (getDevicePtr, dst-reuse, makeGPUResult). The mask is recomputed in backward from (seed,p), never cached -- capture-safe, no save pinned across an arena reset (ADR 006). - CUDA-gated CPU-GPU parity tests (compute/gpu_dropout_parity_test.go): GPU vs CPU dropout forward (multiple shapes/p/train-eval) and backward must be bit-identical; skip cleanly without a GPU. CPU suite stays green (go test ./... and -race). The GB10 parity gate runs via the cuda-tagged path on the DGX (Spark); nvcc/CUDA toolkit is not present locally, so the sm_121 kernel build + GPU parity run are the remaining step. Claude-Session: https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH
One-pod GB10 verify mirroring deploy/spark/bf16-verify-gb10.yaml: clone the bpb3a-dropout branch, build libkernels.so for sm_121, run the CUDA-gated dropout CPU-GPU parity tests (TestGPUDropout_*) and the full-registry GB10 parity run (testing/parity -run _GPU, which now includes Dropout). Claude-Session: https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH
The first GB10 run failed: GPU dropout returned all zeros vs the CPU reference. Root cause is the purego/dlopen launch ABI -- the AAPCS64 trampoline (internal/cuda/purego_linux_arm64.s) loads every argument into integer registers R0-R7 only and never populates the V float registers. A C kernel with `float` parameters reads them from V registers (garbage/zero) AND, because integer and float args consume separate AAPCS64 register sequences, every argument after the first float is shifted -- so seed/training/stream were all misread and the kernel wrote zeros. Fix: pass p and invKeep as their 32-bit IEEE-754 bit patterns in uint32 integer parameters and reinterpret them inside the kernel with __uint_as_float. The ABI is now integer-only and identical between the CGO (-tags cuda) and purego launch paths. Updated dropout.cu, dropout.go (cgo), dropout_purego.go. Claude-Session: https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH
…s_float) nvcc rejected calling the device intrinsic __uint_as_float from the __host__ launcher dropout_f32. Reinterpret the uint32 bit patterns to float on the host with a memcpy helper (host_uint_as_float) instead; the resulting float kernel args marshal correctly through the CUDA <<<>>> launch (kernel-arg passing is unaffected by the host dlopen ABI). Claude-Session: https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Plan task BPB.3a. Adds a general-purpose inverted-dropout op to ztensor with a deterministic, seedable mask, on both the CPU engine and the GB10 GPU (f32). Tracking: #168.
Design
compute/philox.go) and GPU (internal/cuda/kernels/dropout.cu), so masks are bit-identical -- the property that makes CPU-GPU parity and the oracle's eval-identity meaningful. No stateful cuRAND.torch.nn.functional.dropout: trainingy = x*mask/(1-p)withmask~Bernoulli(1-p); eval /p==0exact identity.pscalar in [0,1).compute.Dropouter[T]) + optionalgpuapi.DropouterKernelRunner extension (mirrorsBFloat16Transposer), so the coreEngineinterface is untouched and absent capability reports an error -- no stub fallback.Files
compute/philox.go,compute/cpu_engine.go(CPU op),compute/gpu_dropout.go(GPU op),compute/engine.go(Dropouter interface),compute/engine_proxy.go(delegation).internal/cuda/kernels/dropout.cu+dropout.go/dropout_purego.gowrappers,purego.gosymbol,MakefileSRCS (sm_121).internal/gpuapi/kernels.go+cuda_kernels.go(Dropouter extension).testing/gradcheck/ops.go,registry.go. oracle:testing/oracle/torchmap.go,generate_test.go. parity:testing/parity/stress_engine.go.compute/dropout_test.go(CPU),compute/gpu_dropout_parity_test.go(GB10). Manifest:deploy/spark/dropout-verify-gb10.yaml.Gates
DropoutOpInfo PASS -- deterministic mask => exact linear map, finite-diff == analytic backward.parity PASS Dropoutboth arena-stress schedules, fwd & bwdmax_abs=0.000e+00(bit-identical);TestGPUDropout_CPUParity+TestGPUDropout_Backward_CPUParityPASS on the GB10. Full registry GPU parity + red-proof + Wolf-pattern training loop all PASS.-race.https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH