#84: Headroom × sandcastle: Phase 1 validation — measure token savings on a live run#96
Conversation
- Refactor context-compressor.ts to read HEADROOM_MODE/HEADROOM_MODEL at call time (not module load) so tests can mutate env vars after import - Add token savings logging (char counts + %) to the compression callback - Add context-compressor.test.ts: 10 tests covering getHeadroomMode, isCompressionActive, and getCompressionCallback structural shape - Add sandbox-runner.test.ts: 3 tests confirming ANTHROPIC_BASE_URL proxy injection fires on conservative/aggressive and is skipped on local tier - Wire context-compressor.test.ts into the npm test script (219 tests, all pass)
…scription (#98) Both PR #92's and PR #96's review runs failed with "Credit balance is too low" (the exit-causing error; an unrelated "workspace has not been trusted" permissions warning prints alongside it and had been masking the real cause). Root cause, confirmed live: reviewer-adapter.ts runs via noSandbox() + sandcastle's top-level run(), which inherits the devcontainer's own process env — ANTHROPIC_API_KEY (ADR-0018 cockpit passthrough; also needed by headroom's own compress() calls, ADR-0023) sits alongside CLAUDE_CODE_OAUTH_TOKEN (resolved separately from .sandcastle/.env by sandcastle for agent auth). Claude Code prefers an explicit API key over the subscription token whenever both are present, so every review silently authenticated against the API key — and failed outright once that key's balance ran out. Verified precisely: `ANTHROPIC_API_KEY="" CLAUDE_CODE_OAUTH_TOKEN=<real> claude --print ...` succeeds where the unmodified env fails. Confirmed this specific fix mechanism (claudeCode(model, { env }) on sandcastle's top-level run()) actually threads through — unlike SandboxRunner's createSandbox() + .run() path (see PR #97), where AgentProvider.env is silently discarded. Fix: force ANTHROPIC_API_KEY to an empty string (not omitted — Claude Code only falls back to OAuth when the var is unset/empty) on all 4 claudeCode() calls in reviewer-adapter.ts, via a shared FORCE_OAUTH_ENV constant.
|
AI review: changes-requested The PR delivers useful infrastructure (unit tests for context-compressor.ts, proxy-injection tests in sandbox-runner.test.ts, env-at-call-time refactor, and savings logging) but does not satisfy the core mandate of issue #84, which is Phase 1 validation via a live run. Three of four acceptance criteria are unmet: AC1 requires evidence that conservative mode completed a real issue without regressions; AC2 requires raw, measured token savings numbers (not just logging instrumentation); AC3 requires documented results from an aggressive-mode run including any over-compression observations. Only AC4 (default HEADROOM_MODE=off confirmed) is met by the unit tests. Additionally, a test comment incorrectly states that headroom-ai is not installed in the test environment when it is in fact listed as a runtime dependency in package.json. .sandcastle/context-compressor.test.ts:103 — The comment says "headroom-ai is not installed in the test environment", but headroom-ai is a listed dependency in package.json and will be present after npm install. The assertion (result instanceof Promise) is valid regardless — async functions always return a Promise before any dynamic import resolves — but the stated justification is wrong and will mislead future readers. Update the comment to explain that async functions always return a Promise synchronously, so the assertion holds whether or not the import succeeds. (Posted manually — gh REST rejects REQUEST_CHANGES reviews from a PR's own author; see the durable-fix discussion for this.) |
…locked (#99) Live-confirmed root cause: GitHub's REST API rejects REQUEST_CHANGES reviews from a PR's own author ("Review Can not request changes on your own pull request", HTTP 422) — the normal case for this repo, since the orchestrator's GH_TOKEN both opens every PR and posts its AI review. Both the inline-comment attempt and the existing body-only fallback hit the identical 422, so every changes-requested verdict has silently failed to post, historically. Doesn't break safety (auto-merge is gated by the reducer's own in-memory state, not GitHub's review status) but means the AI's feedback never actually reached a human. postPrReview now tries, in order: (1) inline review, (2) body-only review, (3) a plain PR comment carrying the same verdict + summary + per-file comments (formatReviewAsComment) — a plain comment has no self-review restriction. Live-verified against PR #96: both review attempts 422'd exactly as predicted, and the comment posted successfully.
…g comment The proxy-injection tests still asserted against input.agent.env, which PR #97 (merged after this branch was cut) replaced with input.sandboxEnv — the only field sandcastle's docker() provider actually forwards into the running container. Also fixes the test comment on the Promise-shape assertion, which incorrectly claimed headroom-ai isn't installed in the test environment (it is, per package.json) as the reason the assertion holds.
|
Phase 1 validation results (live, #84) Ran two real live sandbox issues through AC1 — conservative completes without regressions: PASS. Live run finished in 1 commit, exactly the requested file, no scope drift. AC3 — aggressive tested, over-compression documented: PASS (no scope drift observed). The aggressive-mode issue included explicit negative constraints ("don't touch other files, don't run tests, don't add deps") specifically to probe whether compression could strip guardrail text. It didn't — all constraints held, 1 clean commit, correct file only. AC2 — token savings measured (raw numbers): measured, and the number is 0%. Both This isn't a fluke of these two runs — queried headroom's own The only nonzero discount anywhere is Root cause (from headroom's own startup banner): Headroom's compression is turn-based/staleness-driven ( Conclusion: the integration is wired correctly end-to-end (prompt reaches headroom, live session traffic reaches headroom, stats are tracked accurately) — but in this repo's current unlicensed/OSS deployment, headroom compresses nothing. Filing a follow-up issue to decide whether to pursue a licensed tier or tune the turn-threshold settings before revisiting. Merging this PR on the strength of AC1/AC3 passing and AC2/AC4 being honestly measured (0% and off-by-default, respectively) rather than left as an estimate. |
Closes #84
Implemented autonomously by the AFK orchestrator in an isolated, git-isolated sandbox.
Commits: 4f3335a