feat: dropout op (CPU + GB10 GPU) with deterministic Philox mask by dndungu · Pull Request #169 · zerfoo/ztensor

dndungu · 2026-06-20T22:00:09Z

Plan task BPB.3a. Adds a general-purpose inverted-dropout op to ztensor with a deterministic, seedable mask, on both the CPU engine and the GB10 GPU (f32). Tracking: #168.

Design

Counter-based Philox4x32-10 keyed by (seed, linear element offset). The same (seed, offset) yields the same draw on CPU (compute/philox.go) and GPU (internal/cuda/kernels/dropout.cu), so masks are bit-identical -- the property that makes CPU-GPU parity and the oracle's eval-identity meaningful. No stateful cuRAND.
Mask recomputed in backward, never cached -- pure function of (seed, offset, p). Capture-safe; nothing pinned across an arena reset (ADR 006).
Inverted-dropout matching torch.nn.functional.dropout: training y = x*mask/(1-p) with mask~Bernoulli(1-p); eval / p==0 exact identity. p scalar in [0,1).
Wired as an optional capability interface (compute.Dropouter[T]) + optional gpuapi.Dropouter KernelRunner extension (mirrors BFloat16Transposer), so the core Engine interface is untouched and absent capability reports an error -- no stub fallback.

Files

compute/philox.go, compute/cpu_engine.go (CPU op), compute/gpu_dropout.go (GPU op), compute/engine.go (Dropouter interface), compute/engine_proxy.go (delegation).
internal/cuda/kernels/dropout.cu + dropout.go/dropout_purego.go wrappers, purego.go symbol, Makefile SRCS (sm_121).
internal/gpuapi/kernels.go + cuda_kernels.go (Dropouter extension).
gradcheck: testing/gradcheck/ops.go, registry.go. oracle: testing/oracle/torchmap.go, generate_test.go. parity: testing/parity/stress_engine.go.
Tests: compute/dropout_test.go (CPU), compute/gpu_dropout_parity_test.go (GB10). Manifest: deploy/spark/dropout-verify-gb10.yaml.

Gates

gradcheck: Dropout OpInfo PASS -- deterministic mask => exact linear map, finite-diff == analytic backward.
CPU-GPU parity (GB10, observed via Spark): parity PASS Dropout both arena-stress schedules, fwd & bwd max_abs=0.000e+00 (bit-identical); TestGPUDropout_CPUParity + TestGPUDropout_Backward_CPUParity PASS on the GB10. Full registry GPU parity + red-proof + Wolf-pattern training loop all PASS.
PyTorch oracle: SkipReason -- torch's training-mode mask uses its own Philox word->element mapping; matching it would mean reimplementing ztensor's Philox in the torch runner (the HadamardTransform precedent). Mask-vs-input math is pinned by gradcheck + parity; eval-mode identity is unit-tested.
CPU suite green under -race.

https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH

Add an inverted-dropout op to ztensor with a deterministic, seedable mask: - Philox4x32-10 counter-based RNG (compute/philox.go) keyed by (seed, element offset). The mask is a pure function of (seed, offset, p), so it is reproducible and will be bit-identical to the forthcoming CUDA kernel -- the property that makes CPU-GPU parity pass. The CPU Go impl is the reference; the GB10 kernel mirrors the same constants. - Dropouter[T] optional capability interface (compute/engine.go): Dropout + DropoutBackward. Inverted-dropout semantics match torch.nn.functional.dropout: training mode y = x*mask/(1-p) with mask~Bernoulli(1-p); eval mode / p==0 is exact identity. p in [0,1) is validated. - CPU implementation (compute/cpu_engine.go): the mask is recomputed in backward from (seed,p) rather than cached, keeping the op capture-safe and avoiding a pinned save across arena resets (ADR 006). Forward and backward share one masked-and-scaled kernel (dropout is linear in its input given the mask). - EngineProxy delegates the capability; the parity StressEngine relocates the masked output into the host arena like every other op so dropout runs the reset-between-fwd-bwd schedules. Gates (CPU half): - gradcheck: Dropout OpInfo (p=0.3, fixed seed, [4,8]) -- the deterministic mask makes dropout an exact linear map, finite-diff == analytic backward. PASS. - CPU-side parity (testing/parity arena-stress, RegistryGreen): PASS. - PyTorch oracle: SkipReason -- torch's training-mode mask uses its own Philox word->element mapping, not ztensor's; matching it would mean reimplementing ztensor's Philox in the torch runner (the HadamardTransform precedent). Mask-vs-input math is pinned by gradcheck + parity instead. - Unit tests: p=0 identity, eval identity, mask determinism, inverted-dropout mean preservation, backward==mask, invalid p. PASS under -race. GPU kernel (GB10 f32) + GB10 parity/oracle replay are the next milestone. Claude-Session: https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH

…B.3a) Add the f32 GPU half of the dropout op, bit-identical to the CPU reference: - internal/cuda/kernels/dropout.cu: Philox4x32-10 device function with the same constants/round structure as compute/philox.go, so the (seed, offset) -> mask draw matches the CPU engine word-for-word. dropout_f32 launcher does the masked-and-scaled write in training mode (out = (u>=p)? in*invKeep : 0) and an exact identity copy in eval mode / p==0. invKeep is passed host-side as 1/(1-p) so the scale equals the CPU path's bit-for-bit. Added to Makefile SRCS (sm_121 for GB10). - purego + cgo wrappers (dropout_purego.go / dropout.go) and the dropout_f32 symbol registration in purego.go, following the argmax kernel pattern. - gpuapi.Dropouter optional KernelRunner extension (kernels.go) + CUDAKernels impl, mirroring BFloat16Transposer; callers type-assert and report unavailability when absent (no stub fallback). - GPUEngine.Dropout / DropoutBackward (compute/gpu_dropout.go) reuse the gpu_kernels.go scaffolding (getDevicePtr, dst-reuse, makeGPUResult). The mask is recomputed in backward from (seed,p), never cached -- capture-safe, no save pinned across an arena reset (ADR 006). - CUDA-gated CPU-GPU parity tests (compute/gpu_dropout_parity_test.go): GPU vs CPU dropout forward (multiple shapes/p/train-eval) and backward must be bit-identical; skip cleanly without a GPU. CPU suite stays green (go test ./... and -race). The GB10 parity gate runs via the cuda-tagged path on the DGX (Spark); nvcc/CUDA toolkit is not present locally, so the sm_121 kernel build + GPU parity run are the remaining step. Claude-Session: https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH

One-pod GB10 verify mirroring deploy/spark/bf16-verify-gb10.yaml: clone the bpb3a-dropout branch, build libkernels.so for sm_121, run the CUDA-gated dropout CPU-GPU parity tests (TestGPUDropout_*) and the full-registry GB10 parity run (testing/parity -run _GPU, which now includes Dropout). Claude-Session: https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH

The first GB10 run failed: GPU dropout returned all zeros vs the CPU reference. Root cause is the purego/dlopen launch ABI -- the AAPCS64 trampoline (internal/cuda/purego_linux_arm64.s) loads every argument into integer registers R0-R7 only and never populates the V float registers. A C kernel with `float` parameters reads them from V registers (garbage/zero) AND, because integer and float args consume separate AAPCS64 register sequences, every argument after the first float is shifted -- so seed/training/stream were all misread and the kernel wrote zeros. Fix: pass p and invKeep as their 32-bit IEEE-754 bit patterns in uint32 integer parameters and reinterpret them inside the kernel with __uint_as_float. The ABI is now integer-only and identical between the CGO (-tags cuda) and purego launch paths. Updated dropout.cu, dropout.go (cgo), dropout_purego.go. Claude-Session: https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH

…s_float) nvcc rejected calling the device intrinsic __uint_as_float from the __host__ launcher dropout_f32. Reinterpret the uint32 bit patterns to float on the host with a memcpy helper (host_uint_as_float) instead; the resulting float kernel args marshal correctly through the CUDA <<<>>> launch (kernel-arg passing is unaffected by the host dlopen ABI). Claude-Session: https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH

dndungu added 5 commits June 20, 2026 14:36

dndungu mentioned this pull request Jun 20, 2026

Dropout op (CPU + GB10 GPU) with deterministic Philox mask #168

Closed

dndungu merged commit d00cc0c into main Jun 20, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: dropout op (CPU + GB10 GPU) with deterministic Philox mask#169

feat: dropout op (CPU + GB10 GPU) with deterministic Philox mask#169
dndungu merged 5 commits into
mainfrom
bpb3a-dropout

dndungu commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dndungu commented Jun 20, 2026

Design

Files

Gates

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant