Improve eigh accuracy and benchmark balance#156
Conversation
Thread: eigh robustness, benchmark balance, and follow-upsStarting the upstream discussion thread here so this PR is the canonical place for the latest expert/red-team feedback. Benchmark shape balanceBryce's point: the benchmark geomean was dominated by expensive rather than concentrating many structures on two large shapes. This also reduces the gap between correctness-only shapes and benchmarked shapes; if a shape is only in tests, an agent can route it to slow/simple fallback code and optimize only the ranked rows. Current response in this PR: keep the 39-case correctness set, but prune the slowest/redundant ranked rows so the benchmark list is 10 cases. It still covers dense, mixed, rank-deficient, clustered, and one LAPACK dense-spectrum benchmark. Open question: should Accuracy checksMark Hoemmen raised that residual/reconstruction/orthogonality are backward-error style gates, but we should also think about explicit eigenvalue error bounds. This PR adds a direct eigenvalue accuracy check against Potential follow-up references for test design:
Open question: for v1, is this eigenvalue gate enough when combined with residual/reconstruction/orthogonality? For v2, should we add targeted high-relative-accuracy cases with separate thresholds rather than one global tolerance? Triangle semanticsFor a symmetric eigensolver, we should consider testing whether implementations read only the intended triangle. Suggested test: construct an input whose lower triangle plus diagonal and upper triangle plus diagonal imply very different spectra if each triangle is reflected. Open question: should Reward-hacking hardeningBryce's red-team report found attack surfaces similar to the QR object-identity replay issues. Related follow-up PRs already exist:
Open question: should this PR remain focused on the eigenvalue gate + benchmark pruning while those targeted hardening PRs land separately, or should any of them be folded in before merge? Profiling scope#157 scopes timed Open question: should this PR explicitly stay independent of profiling support, or should it wait for #157 before merge? Future problem sizes / v2 directionExternal input so far:
Proposed split:
|
Summary
eighoutputs usingtorch.linalg.eigvalsh(A).A @ Q = Q @ diag(L), reconstruction, orthogonality, sorting, shapes, devices, and finiteness.Rationale
The residual invariants validate the returned decomposition, but they do not explicitly bound the returned eigenvalue spectrum. Eigenvalues do not have the sign/eigenspace ambiguity that eigenvectors do, so comparing
Lagainsteigvalsh(A)is a clean extra correctness gate.The new check uses a loose
n * eps32-scaled tolerance consistent with the existing residual checks, and scales the error by the larger of||eigvalsh(A)||_inf,||A||_1 / n, and1.0.Benchmark feedback also pointed out that the previous ranked set was overly concentrated on expensive
n=512,batch=640andn=1024,batch=60cases. This PR keeps the broad 39-case correctness set, but removes three slow/redundant ranked rows:batch=8,n=2048,densebatch=60,n=1024,case=nearrankbatch=640,n=512,case=lapack_dense_even_spectrumThe resulting benchmark list has 10 rows covering dense, mixed, rank-deficient, clustered, and one LAPACK dense-spectrum case. The removed structures still remain covered by correctness tests where applicable.
Scope Notes
reference-kernels, underproblems/linalg/eigh_py.problems/linalg/eigh_py/task.ymlbenchmarks:.eval.py; profiling capture/range work is tracked separately ingpu-mode/reference-kernels#157.gpu-mode/reference-kernels#159,#160, and#161.Validation
/Users/mark/Dev/kernelbot/.venv/bin/ruff check problems/linalg/eigh_pypython3 -m py_compile problems/linalg/eigh_py/eval.py problems/linalg/eigh_py/reference.py problems/linalg/eigh_py/task.py problems/linalg/eigh_py/submission.py problems/linalg/eigh_py/submissions/torch_eigh.py problems/linalg/eigh_py/submissions/triton_diagonal_fast_path.pygit diff --checktask.ymlwith Ruby YAML:tests=39,benchmarks=10torch_eigh.pytest: 39/39 passed, evaluator duration 7.396striton_diagonal_fast_path.pytest: 39/39 passed, evaluator duration 7.470sNeed regenerated B200 benchmark measurements for the current 10-case benchmark set.