#84: Headroom × sandcastle: Phase 1 validation — measure token savings on a live run by lsfera · Pull Request #96 · lsfera/agentic-dev

lsfera · 2026-07-03T07:38:29Z

Closes #84

Implemented autonomously by the AFK orchestrator in an isolated, git-isolated sandbox.

Commits: 4f3335a

- Refactor context-compressor.ts to read HEADROOM_MODE/HEADROOM_MODEL at call time (not module load) so tests can mutate env vars after import - Add token savings logging (char counts + %) to the compression callback - Add context-compressor.test.ts: 10 tests covering getHeadroomMode, isCompressionActive, and getCompressionCallback structural shape - Add sandbox-runner.test.ts: 3 tests confirming ANTHROPIC_BASE_URL proxy injection fires on conservative/aggressive and is skipped on local tier - Wire context-compressor.test.ts into the npm test script (219 tests, all pass)

…scription (#98) Both PR #92's and PR #96's review runs failed with "Credit balance is too low" (the exit-causing error; an unrelated "workspace has not been trusted" permissions warning prints alongside it and had been masking the real cause). Root cause, confirmed live: reviewer-adapter.ts runs via noSandbox() + sandcastle's top-level run(), which inherits the devcontainer's own process env — ANTHROPIC_API_KEY (ADR-0018 cockpit passthrough; also needed by headroom's own compress() calls, ADR-0023) sits alongside CLAUDE_CODE_OAUTH_TOKEN (resolved separately from .sandcastle/.env by sandcastle for agent auth). Claude Code prefers an explicit API key over the subscription token whenever both are present, so every review silently authenticated against the API key — and failed outright once that key's balance ran out. Verified precisely: `ANTHROPIC_API_KEY="" CLAUDE_CODE_OAUTH_TOKEN=<real> claude --print ...` succeeds where the unmodified env fails. Confirmed this specific fix mechanism (claudeCode(model, { env }) on sandcastle's top-level run()) actually threads through — unlike SandboxRunner's createSandbox() + .run() path (see PR #97), where AgentProvider.env is silently discarded. Fix: force ANTHROPIC_API_KEY to an empty string (not omitted — Claude Code only falls back to OAuth when the var is unset/empty) on all 4 claudeCode() calls in reviewer-adapter.ts, via a shared FORCE_OAUTH_ENV constant.

lsfera · 2026-07-03T08:44:45Z

AI review: changes-requested

The PR delivers useful infrastructure (unit tests for context-compressor.ts, proxy-injection tests in sandbox-runner.test.ts, env-at-call-time refactor, and savings logging) but does not satisfy the core mandate of issue #84, which is Phase 1 validation via a live run. Three of four acceptance criteria are unmet: AC1 requires evidence that conservative mode completed a real issue without regressions; AC2 requires raw, measured token savings numbers (not just logging instrumentation); AC3 requires documented results from an aggressive-mode run including any over-compression observations. Only AC4 (default HEADROOM_MODE=off confirmed) is met by the unit tests. Additionally, a test comment incorrectly states that headroom-ai is not installed in the test environment when it is in fact listed as a runtime dependency in package.json.

.sandcastle/context-compressor.test.ts:103 — The comment says "headroom-ai is not installed in the test environment", but headroom-ai is a listed dependency in package.json and will be present after npm install. The assertion (result instanceof Promise) is valid regardless — async functions always return a Promise before any dynamic import resolves — but the stated justification is wrong and will mislead future readers. Update the comment to explain that async functions always return a Promise synchronously, so the assertion holds whether or not the import succeeds.

(Posted manually — gh REST rejects REQUEST_CHANGES reviews from a PR's own author; see the durable-fix discussion for this.)

…locked (#99) Live-confirmed root cause: GitHub's REST API rejects REQUEST_CHANGES reviews from a PR's own author ("Review Can not request changes on your own pull request", HTTP 422) — the normal case for this repo, since the orchestrator's GH_TOKEN both opens every PR and posts its AI review. Both the inline-comment attempt and the existing body-only fallback hit the identical 422, so every changes-requested verdict has silently failed to post, historically. Doesn't break safety (auto-merge is gated by the reducer's own in-memory state, not GitHub's review status) but means the AI's feedback never actually reached a human. postPrReview now tries, in order: (1) inline review, (2) body-only review, (3) a plain PR comment carrying the same verdict + summary + per-file comments (formatReviewAsComment) — a plain comment has no self-review restriction. Live-verified against PR #96: both review attempts 422'd exactly as predicted, and the comment posted successfully.

…g comment The proxy-injection tests still asserted against input.agent.env, which PR #97 (merged after this branch was cut) replaced with input.sandboxEnv — the only field sandcastle's docker() provider actually forwards into the running container. Also fixes the test comment on the Promise-shape assertion, which incorrectly claimed headroom-ai isn't installed in the test environment (it is, per package.json) as the reason the assertion holds.

lsfera · 2026-07-03T11:05:50Z

Phase 1 validation results (live, #84)

Ran two real live sandbox issues through SandboxRunner.runIssue() — one with HEADROOM_MODE=conservative, one aggressive, using this branch's fixed code.

AC1 — conservative completes without regressions: PASS. Live run finished in 1 commit, exactly the requested file, no scope drift.

AC3 — aggressive tested, over-compression documented: PASS (no scope drift observed). The aggressive-mode issue included explicit negative constraints ("don't touch other files, don't run tests, don't add deps") specifically to probe whether compression could strip guardrail text. It didn't — all constraints held, 1 clean commit, correct file only.

AC2 — token savings measured (raw numbers): measured, and the number is 0%. Both compress() calls returned the prompt byte-for-byte unchanged:

[context-compressor] mode=conservative 2563→2563 chars (−0, −0.0%)
[context-compressor] mode=aggressive   3465→3465 chars (−0, −0.0%)

This isn't a fluke of these two runs — queried headroom's own /stats afterward and across all 14 real API requests logged this session (spanning this validation plus the earlier #97 proxy-routing verification):

compressions_by_strategy: {}          (empty — never fired, ever)
compression_cache.total_tokens_saved: 0
cost.total_tokens_saved: 0            (of $0.627 total spend)
agent_usage.totals.savings_percent: 0.0

The only nonzero discount anywhere is prefix_cache.discount_usd: $0.4988 — Anthropic's own native prompt-caching, unrelated to headroom's compression engine.

Root cause (from headroom's own startup banner):

License:      OSS (no license key)
Code-Aware:   DISABLED  (install headroom-ai[code] to enable)

Headroom's compression is turn-based/staleness-driven (HEADROOM_COMPRESSION_STABLE_AFTER_TURN / HEADROOM_STALE_READ_COMPRESS_AFTER_TURNS), and the code-aware strategies aren't available without a license. A one-shot prompt with no accumulated turn history, sent to a proxy whose main compression strategy is disabled, has nothing for it to compress.

Conclusion: the integration is wired correctly end-to-end (prompt reaches headroom, live session traffic reaches headroom, stats are tracked accurately) — but in this repo's current unlicensed/OSS deployment, headroom compresses nothing. Filing a follow-up issue to decide whether to pursue a licensed tier or tune the turn-threshold settings before revisiting. Merging this PR on the strength of AC1/AC3 passing and AC2/AC4 being honestly measured (0% and off-by-default, respectively) rather than left as an estimate.

This was referenced Jul 3, 2026

Headroom × sandcastle: Phase 1 validation — measure token savings on a live run #84

Closed

fix: AI reviewer authenticated via low-balance API key instead of subscription #98

Merged

lsfera mentioned this pull request Jul 3, 2026

fix: postPrReview falls back to a plain comment when self-review is blocked #99

Merged

3 tasks

lsfera added 2 commits July 3, 2026 12:52

Merge remote-tracking branch 'origin/main' into agent-issue-84-fix

ece7fc1

lsfera merged commit 61aad46 into main Jul 3, 2026
3 checks passed

lsfera deleted the agent/issue-84 branch July 3, 2026 11:06

lsfera mentioned this pull request Jul 3, 2026

Headroom compression measures 0% savings in current OSS/unlicensed deployment #100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#84: Headroom × sandcastle: Phase 1 validation — measure token savings on a live run#96

#84: Headroom × sandcastle: Phase 1 validation — measure token savings on a live run#96
lsfera merged 3 commits into
mainfrom
agent/issue-84

lsfera commented Jul 3, 2026

Uh oh!

lsfera commented Jul 3, 2026

Uh oh!

lsfera commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lsfera commented Jul 3, 2026

Uh oh!

lsfera commented Jul 3, 2026

Uh oh!

lsfera commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant