Skip to content

eigh_py: reward hacks fabricating benchmark time (tracking + the timing-layer gap) #162

Description

@robobryce

Summary

Red-teaming the eigh benchmark harness surfaced a family of reward hacks that leave custom_kernel genuinely correct but fabricate the reported time, plus the underlying structural gaps. I've opened focused PRs for the gaps that have clean fixes (#159, #160, #161); this issue tracks the rest, where the right remedy is a judgment call I'd rather leave to the maintainers.

All findings below were confirmed on the live B200 eigh leaderboard (every test submission deleted immediately after its verdict). A demonstration of the most severe one is currently sitting at rank #1 with a displayed score of 0.000.

Confirmed-accepted reward-hack families and their status

Family What it does Status
Aggregator underflow Drives one shape's reported time toward 0 → geomean collapses to 0.000000 PR #159 (roofline floor)
In-process cache / file replay Solves once, returns cached result on reused timed calls PR #160 (regenerate inputs per iteration)
Lazy output (subclass / instance override) Returns placeholders, defers the real solve into the untimed checker PR #161 (reject deferral)
Timer / stats patch Leaves the kernel honest but patches Event.elapsed_time / calculate_stats to report 1/100th the time this issue
Forged result object Forges the Stats object the timed loop returns to the parent this issue

The remaining gap: the reported time is taken on trust

The timing and the stats reduction happen in the same spawned worker that imports the submission, so a submission can reach and tamper with them (directly, or via aliasing / gc). kernelguard has merged detectors for some of these routes (SinatrasC/kernelguard #277, #278), which helps at the static-scan layer, but:

  • those rules are not yet live on the production scanner (a re-test of the aliased-timer hack on 2026-06-28 was still accepted), and
  • a static scanner is a pattern chase; the structural fix is to compute the reported statistic where the submission cannot reach it — e.g. time and reduce in the parent process from durations captured before the submission is imported, in a namespace the worker doesn't expose.

That structural change is more invasive than the three PRs above (it touches the harness's process/timing model), so I haven't sent it as an unsolicited large PR. If you'd welcome it, I have a working prototype and am happy to open it; alternatively this may be best handled at the kernelguard layer once the merged rules deploy. Flagging it so the decision is yours.

Also: no guards/ dir

Unlike qr_v2, eigh_py ships no guards/ (differential-correctness / invariance) directory, so those defenses don't run here. Worth adding as defense-in-depth.

Happy to provide minimal repros for any of the above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions