Skip to content

Improve eigh accuracy and benchmark balance#156

Open
msaroufim wants to merge 2 commits into
mainfrom
add-eigh-eigenvalue-bound
Open

Improve eigh accuracy and benchmark balance#156
msaroufim wants to merge 2 commits into
mainfrom
add-eigh-eigenvalue-bound

Conversation

@msaroufim

@msaroufim msaroufim commented Jun 26, 2026

Copy link
Copy Markdown
Member

Summary

  • Add a direct eigenvalue accuracy gate for eigh outputs using torch.linalg.eigvalsh(A).
  • Keep the existing invariant checks for A @ Q = Q @ diag(L), reconstruction, orthogonality, sorting, shapes, devices, and finiteness.
  • Document the eigenvalue check in the problem description.
  • Prune the ranked benchmark set from 13 to 10 rows after feedback that the geomean was dominated by the slowest dense shapes.
  • Mark the checked-in local benchmark measurements as pre-prune so stale 14-case numbers are not quoted as current results.

Rationale

The residual invariants validate the returned decomposition, but they do not explicitly bound the returned eigenvalue spectrum. Eigenvalues do not have the sign/eigenspace ambiguity that eigenvectors do, so comparing L against eigvalsh(A) is a clean extra correctness gate.

The new check uses a loose n * eps32-scaled tolerance consistent with the existing residual checks, and scales the error by the larger of ||eigvalsh(A)||_inf, ||A||_1 / n, and 1.0.

Benchmark feedback also pointed out that the previous ranked set was overly concentrated on expensive n=512,batch=640 and n=1024,batch=60 cases. This PR keeps the broad 39-case correctness set, but removes three slow/redundant ranked rows:

  • batch=8,n=2048,dense
  • batch=60,n=1024,case=nearrank
  • batch=640,n=512,case=lapack_dense_even_spectrum

The resulting benchmark list has 10 rows covering dense, mixed, rank-deficient, clustered, and one LAPACK dense-spectrum case. The removed structures still remain covered by correctness tests where applicable.

Scope Notes

  • Code changes are in reference-kernels, under problems/linalg/eigh_py.
  • The active leaderboard shape source is problems/linalg/eigh_py/task.yml benchmarks:.
  • Profile mode is not implemented in this PR's eval.py; profiling capture/range work is tracked separately in gpu-mode/reference-kernels#157.
  • Reward-hacking hardening follow-ups are tracked separately in gpu-mode/reference-kernels#159, #160, and #161.

Validation

  • /Users/mark/Dev/kernelbot/.venv/bin/ruff check problems/linalg/eigh_py
  • python3 -m py_compile problems/linalg/eigh_py/eval.py problems/linalg/eigh_py/reference.py problems/linalg/eigh_py/task.py problems/linalg/eigh_py/submission.py problems/linalg/eigh_py/submissions/torch_eigh.py problems/linalg/eigh_py/submissions/triton_diagonal_fast_path.py
  • git diff --check
  • Parsed task.yml with Ruby YAML: tests=39, benchmarks=10
  • Local KernelBot debug API on B200 before benchmark pruning:
    • torch_eigh.py test: 39/39 passed, evaluator duration 7.396s
    • triton_diagonal_fast_path.py test: 39/39 passed, evaluator duration 7.470s

Need regenerated B200 benchmark measurements for the current 10-case benchmark set.

@msaroufim msaroufim changed the title Add eigh eigenvalue error check Improve eigh accuracy and benchmark balance Jul 1, 2026
@msaroufim

Copy link
Copy Markdown
Member Author

Thread: eigh robustness, benchmark balance, and follow-ups

Starting the upstream discussion thread here so this PR is the canonical place for the latest expert/red-team feedback.

Benchmark shape balance

Bryce's point: the benchmark geomean was dominated by expensive batch=640,n=512 and batch=60,n=1024 rows. A more balanced future design would be closer to:

(input structure) x (matrix size, batch size)

rather than concentrating many structures on two large shapes. This also reduces the gap between correctness-only shapes and benchmarked shapes; if a shape is only in tests, an agent can route it to slow/simple fallback code and optimize only the ranked rows.

Current response in this PR: keep the 39-case correctness set, but prune the slowest/redundant ranked rows so the benchmark list is 10 cases. It still covers dense, mixed, rank-deficient, clustered, and one LAPACK dense-spectrum benchmark.

Open question: should eigh_v2 use an explicit cross product across representative (batch,n) pairs, even if that means dropping some current robustness rows from ranked benchmarking?

Accuracy checks

Mark Hoemmen raised that residual/reconstruction/orthogonality are backward-error style gates, but we should also think about explicit eigenvalue error bounds. This PR adds a direct eigenvalue accuracy check against torch.linalg.eigvalsh(A) with norm-scaled tolerance.

Potential follow-up references for test design:

  • LAPACK LUG eigenvalue error bounds: normwise backward stability and matrix-norm scaling.
  • LAWN 182 / LAWN 183: symmetric eigenproblem test design after MRRR was introduced.
  • LAWN 163: MRRR failure modes.
  • LAWN 7: when symmetric eigenvalues can be computed to high relative accuracy; LAPACK MRRR uses scaled diagonal dominance criteria here.

Open question: for v1, is this eigenvalue gate enough when combined with residual/reconstruction/orthogonality? For v2, should we add targeted high-relative-accuracy cases with separate thresholds rather than one global tolerance?

Triangle semantics

For a symmetric eigensolver, we should consider testing whether implementations read only the intended triangle. Suggested test: construct an input whose lower triangle plus diagonal and upper triangle plus diagonal imply very different spectra if each triangle is reflected.

Open question: should eigh_py specify lower-triangle, upper-triangle, or fully symmetrized input semantics? Today the task says the matrix is symmetric up to FP32 roundoff, so this is probably an eigh_v2 clarification unless we want to harden the current spec.

Reward-hacking hardening

Bryce's red-team report found attack surfaces similar to the QR object-identity replay issues. Related follow-up PRs already exist:

Open question: should this PR remain focused on the eigenvalue gate + benchmark pruning while those targeted hardening PRs land separately, or should any of them be folded in before merge?

Profiling scope

#157 scopes timed custom_kernel launches with CUDA profiler capture ranges. The ncu flags likely belong in the Brev / ncu-service path rather than this problem PR, but the evaluator-side range is relevant for clean traces.

Open question: should this PR explicitly stay independent of profiling support, or should it wait for #157 before merge?

Future problem sizes / v2 direction

External input so far:

  • Reduced-order modeling / POD may produce large snapshot-derived problems.
  • Some applications need millions of simultaneous 3x3 eigensolves, symmetric/SPD and nonsymmetric.
  • A future SVD exercise might be easier and more BLAS-like if scoped to two-sided reduction to tridiagonal / bidiagonal form, leaving the tridiagonal / bidiagonal eigensolve as a separate problem.

Proposed split:

  • v1: keep this PR interactive, robust enough, and not overfit to diagonal/banded shortcuts.
  • v2: incorporate expert-sourced sizes, a more explicit benchmark cross product, triangle semantics, and relative-accuracy-oriented test families.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant