Skip to content

feat: dropout op (CPU + GB10 GPU) with deterministic Philox mask#169

Merged
dndungu merged 5 commits into
mainfrom
bpb3a-dropout
Jun 20, 2026
Merged

feat: dropout op (CPU + GB10 GPU) with deterministic Philox mask#169
dndungu merged 5 commits into
mainfrom
bpb3a-dropout

Conversation

@dndungu

@dndungu dndungu commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Plan task BPB.3a. Adds a general-purpose inverted-dropout op to ztensor with a deterministic, seedable mask, on both the CPU engine and the GB10 GPU (f32). Tracking: #168.

Design

  • Counter-based Philox4x32-10 keyed by (seed, linear element offset). The same (seed, offset) yields the same draw on CPU (compute/philox.go) and GPU (internal/cuda/kernels/dropout.cu), so masks are bit-identical -- the property that makes CPU-GPU parity and the oracle's eval-identity meaningful. No stateful cuRAND.
  • Mask recomputed in backward, never cached -- pure function of (seed, offset, p). Capture-safe; nothing pinned across an arena reset (ADR 006).
  • Inverted-dropout matching torch.nn.functional.dropout: training y = x*mask/(1-p) with mask~Bernoulli(1-p); eval / p==0 exact identity. p scalar in [0,1).
  • Wired as an optional capability interface (compute.Dropouter[T]) + optional gpuapi.Dropouter KernelRunner extension (mirrors BFloat16Transposer), so the core Engine interface is untouched and absent capability reports an error -- no stub fallback.

Files

  • compute/philox.go, compute/cpu_engine.go (CPU op), compute/gpu_dropout.go (GPU op), compute/engine.go (Dropouter interface), compute/engine_proxy.go (delegation).
  • internal/cuda/kernels/dropout.cu + dropout.go/dropout_purego.go wrappers, purego.go symbol, Makefile SRCS (sm_121).
  • internal/gpuapi/kernels.go + cuda_kernels.go (Dropouter extension).
  • gradcheck: testing/gradcheck/ops.go, registry.go. oracle: testing/oracle/torchmap.go, generate_test.go. parity: testing/parity/stress_engine.go.
  • Tests: compute/dropout_test.go (CPU), compute/gpu_dropout_parity_test.go (GB10). Manifest: deploy/spark/dropout-verify-gb10.yaml.

Gates

  • gradcheck: Dropout OpInfo PASS -- deterministic mask => exact linear map, finite-diff == analytic backward.
  • CPU-GPU parity (GB10, observed via Spark): parity PASS Dropout both arena-stress schedules, fwd & bwd max_abs=0.000e+00 (bit-identical); TestGPUDropout_CPUParity + TestGPUDropout_Backward_CPUParity PASS on the GB10. Full registry GPU parity + red-proof + Wolf-pattern training loop all PASS.
  • PyTorch oracle: SkipReason -- torch's training-mode mask uses its own Philox word->element mapping; matching it would mean reimplementing ztensor's Philox in the torch runner (the HadamardTransform precedent). Mask-vs-input math is pinned by gradcheck + parity; eval-mode identity is unit-tested.
  • CPU suite green under -race.

https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH

dndungu added 5 commits June 20, 2026 14:36
Add an inverted-dropout op to ztensor with a deterministic, seedable mask:

- Philox4x32-10 counter-based RNG (compute/philox.go) keyed by (seed, element
  offset). The mask is a pure function of (seed, offset, p), so it is
  reproducible and will be bit-identical to the forthcoming CUDA kernel -- the
  property that makes CPU-GPU parity pass. The CPU Go impl is the reference;
  the GB10 kernel mirrors the same constants.
- Dropouter[T] optional capability interface (compute/engine.go): Dropout +
  DropoutBackward. Inverted-dropout semantics match torch.nn.functional.dropout:
  training mode y = x*mask/(1-p) with mask~Bernoulli(1-p); eval mode / p==0 is
  exact identity. p in [0,1) is validated.
- CPU implementation (compute/cpu_engine.go): the mask is recomputed in backward
  from (seed,p) rather than cached, keeping the op capture-safe and avoiding a
  pinned save across arena resets (ADR 006). Forward and backward share one
  masked-and-scaled kernel (dropout is linear in its input given the mask).
- EngineProxy delegates the capability; the parity StressEngine relocates the
  masked output into the host arena like every other op so dropout runs the
  reset-between-fwd-bwd schedules.

Gates (CPU half):
- gradcheck: Dropout OpInfo (p=0.3, fixed seed, [4,8]) -- the deterministic mask
  makes dropout an exact linear map, finite-diff == analytic backward. PASS.
- CPU-side parity (testing/parity arena-stress, RegistryGreen): PASS.
- PyTorch oracle: SkipReason -- torch's training-mode mask uses its own Philox
  word->element mapping, not ztensor's; matching it would mean reimplementing
  ztensor's Philox in the torch runner (the HadamardTransform precedent).
  Mask-vs-input math is pinned by gradcheck + parity instead.
- Unit tests: p=0 identity, eval identity, mask determinism, inverted-dropout
  mean preservation, backward==mask, invalid p. PASS under -race.

GPU kernel (GB10 f32) + GB10 parity/oracle replay are the next milestone.

Claude-Session: https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH
…B.3a)

Add the f32 GPU half of the dropout op, bit-identical to the CPU reference:

- internal/cuda/kernels/dropout.cu: Philox4x32-10 device function with the same
  constants/round structure as compute/philox.go, so the (seed, offset) -> mask
  draw matches the CPU engine word-for-word. dropout_f32 launcher does the
  masked-and-scaled write in training mode (out = (u>=p)? in*invKeep : 0) and an
  exact identity copy in eval mode / p==0. invKeep is passed host-side as
  1/(1-p) so the scale equals the CPU path's bit-for-bit. Added to Makefile SRCS
  (sm_121 for GB10).
- purego + cgo wrappers (dropout_purego.go / dropout.go) and the dropout_f32
  symbol registration in purego.go, following the argmax kernel pattern.
- gpuapi.Dropouter optional KernelRunner extension (kernels.go) + CUDAKernels
  impl, mirroring BFloat16Transposer; callers type-assert and report
  unavailability when absent (no stub fallback).
- GPUEngine.Dropout / DropoutBackward (compute/gpu_dropout.go) reuse the
  gpu_kernels.go scaffolding (getDevicePtr, dst-reuse, makeGPUResult). The mask
  is recomputed in backward from (seed,p), never cached -- capture-safe, no save
  pinned across an arena reset (ADR 006).
- CUDA-gated CPU-GPU parity tests (compute/gpu_dropout_parity_test.go): GPU vs
  CPU dropout forward (multiple shapes/p/train-eval) and backward must be
  bit-identical; skip cleanly without a GPU.

CPU suite stays green (go test ./... and -race). The GB10 parity gate runs via
the cuda-tagged path on the DGX (Spark); nvcc/CUDA toolkit is not present
locally, so the sm_121 kernel build + GPU parity run are the remaining step.

Claude-Session: https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH
One-pod GB10 verify mirroring deploy/spark/bf16-verify-gb10.yaml: clone the
bpb3a-dropout branch, build libkernels.so for sm_121, run the CUDA-gated
dropout CPU-GPU parity tests (TestGPUDropout_*) and the full-registry GB10
parity run (testing/parity -run _GPU, which now includes Dropout).

Claude-Session: https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH
The first GB10 run failed: GPU dropout returned all zeros vs the CPU reference.
Root cause is the purego/dlopen launch ABI -- the AAPCS64 trampoline
(internal/cuda/purego_linux_arm64.s) loads every argument into integer
registers R0-R7 only and never populates the V float registers. A C kernel with
`float` parameters reads them from V registers (garbage/zero) AND, because
integer and float args consume separate AAPCS64 register sequences, every
argument after the first float is shifted -- so seed/training/stream were all
misread and the kernel wrote zeros.

Fix: pass p and invKeep as their 32-bit IEEE-754 bit patterns in uint32 integer
parameters and reinterpret them inside the kernel with __uint_as_float. The ABI
is now integer-only and identical between the CGO (-tags cuda) and purego
launch paths. Updated dropout.cu, dropout.go (cgo), dropout_purego.go.

Claude-Session: https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH
…s_float)

nvcc rejected calling the device intrinsic __uint_as_float from the __host__
launcher dropout_f32. Reinterpret the uint32 bit patterns to float on the host
with a memcpy helper (host_uint_as_float) instead; the resulting float kernel
args marshal correctly through the CUDA <<<>>> launch (kernel-arg passing is
unaffected by the host dlopen ABI).

Claude-Session: https://claude.ai/code/session_01So96MEV1hiThH4XqCd6rLH
@dndungu dndungu merged commit d00cc0c into main Jun 20, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant