Improve eigh accuracy and benchmark balance by msaroufim · Pull Request #156 · gpu-mode/reference-kernels

msaroufim · 2026-06-26T22:24:21Z

Summary

Add a direct eigenvalue accuracy gate for eigh outputs using torch.linalg.eigvalsh(A).
Keep the existing invariant checks for A @ Q = Q @ diag(L), reconstruction, orthogonality, sorting, shapes, devices, and finiteness.
Document the eigenvalue check in the problem description.
Prune the ranked benchmark set from 13 to 10 rows after feedback that the geomean was dominated by the slowest dense shapes.
Mark the checked-in local benchmark measurements as pre-prune so stale 14-case numbers are not quoted as current results.

Rationale

The residual invariants validate the returned decomposition, but they do not explicitly bound the returned eigenvalue spectrum. Eigenvalues do not have the sign/eigenspace ambiguity that eigenvectors do, so comparing L against eigvalsh(A) is a clean extra correctness gate.

The new check uses a loose n * eps32-scaled tolerance consistent with the existing residual checks, and scales the error by the larger of ||eigvalsh(A)||_inf, ||A||_1 / n, and 1.0.

Benchmark feedback also pointed out that the previous ranked set was overly concentrated on expensive n=512,batch=640 and n=1024,batch=60 cases. This PR keeps the broad 39-case correctness set, but removes three slow/redundant ranked rows:

batch=8,n=2048,dense
batch=60,n=1024,case=nearrank
batch=640,n=512,case=lapack_dense_even_spectrum

The resulting benchmark list has 10 rows covering dense, mixed, rank-deficient, clustered, and one LAPACK dense-spectrum case. The removed structures still remain covered by correctness tests where applicable.

Scope Notes

Code changes are in reference-kernels, under problems/linalg/eigh_py.
The active leaderboard shape source is problems/linalg/eigh_py/task.yml benchmarks:.
Profile mode is not implemented in this PR's eval.py; profiling capture/range work is tracked separately in gpu-mode/reference-kernels#157.
Reward-hacking hardening follow-ups are tracked separately in gpu-mode/reference-kernels#159, #160, and #161.

Validation

/Users/mark/Dev/kernelbot/.venv/bin/ruff check problems/linalg/eigh_py
python3 -m py_compile problems/linalg/eigh_py/eval.py problems/linalg/eigh_py/reference.py problems/linalg/eigh_py/task.py problems/linalg/eigh_py/submission.py problems/linalg/eigh_py/submissions/torch_eigh.py problems/linalg/eigh_py/submissions/triton_diagonal_fast_path.py
git diff --check
Parsed task.yml with Ruby YAML: tests=39, benchmarks=10
Local KernelBot debug API on B200 before benchmark pruning:
- torch_eigh.py test: 39/39 passed, evaluator duration 7.396s
- triton_diagonal_fast_path.py test: 39/39 passed, evaluator duration 7.470s

Need regenerated B200 benchmark measurements for the current 10-case benchmark set.

msaroufim · 2026-07-01T23:14:35Z

Thread: eigh robustness, benchmark balance, and follow-ups

Starting the upstream discussion thread here so this PR is the canonical place for the latest expert/red-team feedback.

Benchmark shape balance

Bryce's point: the benchmark geomean was dominated by expensive batch=640,n=512 and batch=60,n=1024 rows. A more balanced future design would be closer to:

(input structure) x (matrix size, batch size)

rather than concentrating many structures on two large shapes. This also reduces the gap between correctness-only shapes and benchmarked shapes; if a shape is only in tests, an agent can route it to slow/simple fallback code and optimize only the ranked rows.

Current response in this PR: keep the 39-case correctness set, but prune the slowest/redundant ranked rows so the benchmark list is 10 cases. It still covers dense, mixed, rank-deficient, clustered, and one LAPACK dense-spectrum benchmark.

Open question: should eigh_v2 use an explicit cross product across representative (batch,n) pairs, even if that means dropping some current robustness rows from ranked benchmarking?

Accuracy checks

Mark Hoemmen raised that residual/reconstruction/orthogonality are backward-error style gates, but we should also think about explicit eigenvalue error bounds. This PR adds a direct eigenvalue accuracy check against torch.linalg.eigvalsh(A) with norm-scaled tolerance.

Potential follow-up references for test design:

LAPACK LUG eigenvalue error bounds: normwise backward stability and matrix-norm scaling.
LAWN 182 / LAWN 183: symmetric eigenproblem test design after MRRR was introduced.
LAWN 163: MRRR failure modes.
LAWN 7: when symmetric eigenvalues can be computed to high relative accuracy; LAPACK MRRR uses scaled diagonal dominance criteria here.

Open question: for v1, is this eigenvalue gate enough when combined with residual/reconstruction/orthogonality? For v2, should we add targeted high-relative-accuracy cases with separate thresholds rather than one global tolerance?

Triangle semantics

For a symmetric eigensolver, we should consider testing whether implementations read only the intended triangle. Suggested test: construct an input whose lower triangle plus diagonal and upper triangle plus diagonal imply very different spectra if each triangle is reflected.

Open question: should eigh_py specify lower-triangle, upper-triangle, or fully symmetrized input semantics? Today the task says the matrix is symmetric up to FP32 roundoff, so this is probably an eigh_v2 clarification unless we want to harden the current spec.

Reward-hacking hardening

Bryce's red-team report found attack surfaces similar to the QR object-identity replay issues. Related follow-up PRs already exist:

eigh_py: reject physically-impossible benchmark times (roofline floor) #159: reject physically impossible benchmark times.
eigh_py: regenerate a fresh input each timed benchmark iteration #160: regenerate fresh input each timed benchmark iteration.
eigh_py: reject output-object deferral in the correctness check #161: reject output-object deferral in correctness checks.

Open question: should this PR remain focused on the eigenvalue gate + benchmark pruning while those targeted hardening PRs land separately, or should any of them be folded in before merge?

Profiling scope

#157 scopes timed custom_kernel launches with CUDA profiler capture ranges. The ncu flags likely belong in the Brev / ncu-service path rather than this problem PR, but the evaluator-side range is relevant for clean traces.

Open question: should this PR explicitly stay independent of profiling support, or should it wait for #157 before merge?

Future problem sizes / v2 direction

External input so far:

Reduced-order modeling / POD may produce large snapshot-derived problems.
Some applications need millions of simultaneous 3x3 eigensolves, symmetric/SPD and nonsymmetric.
A future SVD exercise might be easier and more BLAS-like if scoped to two-sided reduction to tridiagonal / bidiagonal form, leaving the tridiagonal / bidiagonal eigensolve as a separate problem.

Proposed split:

v1: keep this PR interactive, robust enough, and not overfit to diagonal/banded shortcuts.
v2: incorporate expert-sourced sizes, a more explicit benchmark cross product, triangle semantics, and relative-accuracy-oriented test families.

msaroufim added 2 commits June 26, 2026 15:24

Add eigh eigenvalue error check

59b98e0

Prune eigh benchmark cases

e1909fe

msaroufim changed the title ~~Add eigh eigenvalue error check~~ Improve eigh accuracy and benchmark balance Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve eigh accuracy and benchmark balance#156

Improve eigh accuracy and benchmark balance#156
msaroufim wants to merge 2 commits into
mainfrom
add-eigh-eigenvalue-bound

msaroufim commented Jun 26, 2026 •

edited

Loading

Uh oh!

msaroufim commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

msaroufim commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Rationale

Scope Notes

Validation

Uh oh!

msaroufim commented Jul 1, 2026

Thread: eigh robustness, benchmark balance, and follow-ups

Benchmark shape balance

Accuracy checks

Triangle semantics

Reward-hacking hardening

Profiling scope

Future problem sizes / v2 direction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

msaroufim commented Jun 26, 2026 •

edited

Loading