Skip to content

feat(harbor): the agent's first baseline eval is budget-free#25

Open
shehabyasser-scale wants to merge 1 commit into
harbor-2-mode-b-sample-timeout-warningfrom
harbor-2-free-baseline-eval
Open

feat(harbor): the agent's first baseline eval is budget-free#25
shehabyasser-scale wants to merge 1 commit into
harbor-2-mode-b-sample-timeout-warningfrom
harbor-2-free-baseline-eval

Conversation

@shehabyasser-scale

@shehabyasser-scale shehabyasser-scale commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Stacked on #20 (top of the sidecar chain: #11 -> #19 -> #20 -> this).

The problem, observed twice in live trials

The seeded baseline is the reference every candidate is implicitly compared to, yet it can never win selection (auto_best excludes base_commit, correctly, so "do nothing" cannot win). Metering its evaluation therefore forces the optimizer into a bad choice:

  1. Pay for it: the incident behind fix(harbor): floor rewards on no-candidate outcome instead of erroring #11/docs(harbor): warn in the task instruction that baseline evals create no candidate #12, an optimizer spent its whole budget measuring the baseline, left the candidate pool empty, and the trial died (now floored by fix(harbor): floor rewards on no-candidate outcome instead of erroring #11, warned against by docs(harbor): warn in the task instruction that baseline evals create no candidate #12).
  2. Skip it and fly blind: today's TB2 trial (exp5 v3). The optimizer, warned off baseline evals by docs(harbor): warn in the task instruction that baseline evals create no candidate #12's instruction, never learned the baseline scored 0.375. Its first edit scored 0.375, and with no reference it could not tell a no-op from an improvement; its second edit scored 0.125; it rationally gave up with 3 of its 5 budgeted evals unspent. Final reward: 0.375, exactly baseline.

#12 treated the symptom (don't waste budget); this treats the cause (the reference measurement should not cost budget).

The fix

EvaluationSidecar.evaluate: when the transferred commit resolves to base_commit and the free eval hasn't been used, run it through the engine's unmetered (admin) path while still routing results through the agent's real split tier. Capped at exactly one, so free compute is bounded; subsequent baseline evals debit normally. /status now reports base_commit and free_baseline_available, so agents can discover the mechanism.

Selection integrity is untouched: the baseline experiment is recorded as before and remains excluded from auto_best by the existing base_commit filter.

Tests

Seven new tests: first baseline eval unmetered but tier-routed; second one metered; non-baseline commits always metered; no-base_commit tasks never free; /status surfaces and flips the flag.

Follow-up (separate, compiler layer): update the #12 instruction paragraph to say the first baseline eval is free once this lands.

🤖 Generated with Claude Code

Greptile Summary

This PR makes the agent's first evaluation of the seeded baseline commit budget-free by routing it through the engine's admin (unmetered) path while still writing results to the agent volume via the normal tier. A boolean flag _free_baseline_used caps the free-eval to exactly one use, and /status now surfaces base_commit and free_baseline_available so agents can discover the mechanism.

  • server.py: evaluate() sets admin=True on the first baseline call by detecting sha == self.base_commit and not _free_baseline_used; result routing is unaffected (still uses the agent tier).
  • protocol.py / serve.py: StatusSummary and build_status gain base_commit / free_baseline_available fields wired from ServeConfig.
  • test_harbor_server.py: seven new tests cover the unmetered-first-call, metered-second-call, non-baseline, no-base-commit, and status-flip cases.

Confidence Score: 3/5

The core mechanic is sound, but a transient engine failure during the free baseline call permanently burns the one-time slot, leaving the agent in the exact blind-optimizer state the PR is designed to prevent.

The _free_baseline_used flag is set to True before await self.engine.evaluate(...). Any engine exception (timeout, infra error) leaves the flag consumed with no successful result recorded; subsequent retries are metered and the optimizer loses the reference it needs. The rest of the change — protocol fields, status surfacing, result routing separation — is clean and well-tested.

vero/src/vero/harbor/server.py — specifically the ordering of the flag assignment relative to the awaited engine call in evaluate().

Important Files Changed

Filename Overview
vero/src/vero/harbor/server.py Adds free-baseline-eval mechanic via _free_baseline_used flag; flag is set before the awaited engine call, so a transient engine failure permanently burns the slot.
vero/src/vero/harbor/protocol.py Extends StatusSummary and build_status with base_commit and free_baseline_available fields; clean additive change with correct defaults.
vero/src/vero/harbor/serve.py Wires config.base_commit into EvaluationSidecar; minimal, correct change.
vero/tests/test_harbor_server.py Adds seven tests for the free-baseline-eval feature covering the main cases; missing coverage for admin-call-on-base-commit preserving the agent's free slot, and for engine failure burning the slot.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant A as Agent
    participant S as EvaluationSidecar
    participant E as EvaluationEngine

    A->>S: evaluate(req) [baseline commit, first call]
    S->>S: "_transfer_commit(req.commit) → sha == base_commit"
    S->>S: "free_baseline=True, _free_baseline_used=True"
    S->>E: "engine.evaluate(req, admin=True) [unmetered]"
    E-->>S: Experiment
    S->>S: "_route_results(exp, admin=False) [agent tier]"
    S-->>A: EvalSummary + budget_remaining unchanged

    A->>S: evaluate(req) [baseline commit, second call]
    S->>S: "_transfer_commit → sha == base_commit"
    S->>S: "free_baseline=False (_free_baseline_used already True)"
    S->>E: "engine.evaluate(req, admin=False) [metered]"
    E-->>S: Experiment (budget debited)
    S->>S: "_route_results(exp, admin=False)"
    S-->>A: EvalSummary + updated budget_remaining

    A->>S: status()
    S-->>A: "StatusSummary{base_commit, free_baseline_available=False}"
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant A as Agent
    participant S as EvaluationSidecar
    participant E as EvaluationEngine

    A->>S: evaluate(req) [baseline commit, first call]
    S->>S: "_transfer_commit(req.commit) → sha == base_commit"
    S->>S: "free_baseline=True, _free_baseline_used=True"
    S->>E: "engine.evaluate(req, admin=True) [unmetered]"
    E-->>S: Experiment
    S->>S: "_route_results(exp, admin=False) [agent tier]"
    S-->>A: EvalSummary + budget_remaining unchanged

    A->>S: evaluate(req) [baseline commit, second call]
    S->>S: "_transfer_commit → sha == base_commit"
    S->>S: "free_baseline=False (_free_baseline_used already True)"
    S->>E: "engine.evaluate(req, admin=False) [metered]"
    E-->>S: Experiment (budget debited)
    S->>S: "_route_results(exp, admin=False)"
    S-->>A: EvalSummary + updated budget_remaining

    A->>S: status()
    S-->>A: "StatusSummary{base_commit, free_baseline_available=False}"
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
vero/src/vero/harbor/server.py:89-93
**Free slot burned on engine failure**

`_free_baseline_used` is set to `True` before the `await self.engine.evaluate(...)` call. If the engine raises (timeout, transient infrastructure error, etc.), the flag remains `True` and the agent permanently loses their one free baseline eval even though no successful result was recorded. On any retry the baseline is metered, which is exactly the "pay for it" failure mode the PR is trying to prevent — and in practice the agent's optimizer is left in the same bad state described in the problem statement. Moving the assignment to after the await (the check-and-set is safe in a single-threaded asyncio loop since there is no await between the read and the write) would let the agent retry freely after a transient failure.

### Issue 2 of 2
vero/tests/test_harbor_server.py:136-174
**Missing test: admin call on base_commit must not consume the free agent slot**

The seven new tests cover the happy path and the boundary cases (second call, non-baseline commit, no `base_commit`), but none of them verify that an admin evaluate call to the base commit leaves `_free_baseline_used = False` so that the agent's free slot is still available afterwards. The logic `not admin and ...` handles this correctly in production, but without a test this invariant is one refactor away from silently regressing — e.g. if someone moves the flag assignment outside the `if free_baseline` guard.

Reviews (1): Last reviewed commit: "feat(harbor): the agent's first baseline..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

The seeded baseline is the reference every candidate is implicitly
compared to, yet it can never win selection (auto_best excludes
base_commit by design), so metering it forced the optimizer to choose
between flying blind and paying a budgeted eval for an unselectable
commit. Observed live twice: the incident behind #11/#12 (an optimizer
burned its whole budget measuring the baseline), and exp5's optimizer
which, warned off by #12, skipped the reference entirely, could not tell
its no-op edit (train 0.375) from an improvement, and quit with 3 of 5
evals unspent.

The first eval whose transferred commit resolves to base_commit now runs
unmetered (engine admin path) while results still route through the
agent's real split tier. Capped at one: later baseline evals debit
normally, so free compute is bounded. /status exposes base_commit and
whether the free reference eval is still available.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment on lines +89 to +93
if free_baseline:
self._free_baseline_used = True
exp = await self.engine.evaluate(
replace(req, commit=sha), admin=admin or free_baseline
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Free slot burned on engine failure

_free_baseline_used is set to True before the await self.engine.evaluate(...) call. If the engine raises (timeout, transient infrastructure error, etc.), the flag remains True and the agent permanently loses their one free baseline eval even though no successful result was recorded. On any retry the baseline is metered, which is exactly the "pay for it" failure mode the PR is trying to prevent — and in practice the agent's optimizer is left in the same bad state described in the problem statement. Moving the assignment to after the await (the check-and-set is safe in a single-threaded asyncio loop since there is no await between the read and the write) would let the agent retry freely after a transient failure.

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/server.py
Line: 89-93

Comment:
**Free slot burned on engine failure**

`_free_baseline_used` is set to `True` before the `await self.engine.evaluate(...)` call. If the engine raises (timeout, transient infrastructure error, etc.), the flag remains `True` and the agent permanently loses their one free baseline eval even though no successful result was recorded. On any retry the baseline is metered, which is exactly the "pay for it" failure mode the PR is trying to prevent — and in practice the agent's optimizer is left in the same bad state described in the problem statement. Moving the assignment to after the await (the check-and-set is safe in a single-threaded asyncio loop since there is no await between the read and the write) would let the agent retry freely after a transient failure.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Comment on lines +136 to +174
async def test_first_baseline_eval_is_unmetered(self, tmp_path):
sidecar = _sidecar(tmp_path, split="validation", base_commit="abcdef123456")
await sidecar.evaluate(EvalRequest(dataset_id="ds1", split="validation"))
# engine.evaluate was called with admin=True (bypasses the ledger)
assert sidecar.engine.evaluate.await_args.kwargs["admin"] is True
# but results were routed with the agent tier (summary written)
dest = tmp_path / "agent_vol" / "results" / "validation__abcdef123456"
assert (dest / "summary.json").exists()

@pytest.mark.asyncio
async def test_second_baseline_eval_is_metered(self, tmp_path):
sidecar = _sidecar(tmp_path, split="validation", base_commit="abcdef123456")
await sidecar.evaluate(EvalRequest(dataset_id="ds1", split="validation"))
await sidecar.evaluate(EvalRequest(dataset_id="ds1", split="validation"))
assert sidecar.engine.evaluate.await_args.kwargs["admin"] is False

@pytest.mark.asyncio
async def test_non_baseline_commit_always_metered(self, tmp_path):
sidecar = _sidecar(tmp_path, split="validation", base_commit="other000000")
await sidecar.evaluate(EvalRequest(dataset_id="ds1", split="validation"))
assert sidecar.engine.evaluate.await_args.kwargs["admin"] is False

@pytest.mark.asyncio
async def test_no_base_commit_never_free(self, tmp_path):
sidecar = _sidecar(tmp_path, split="validation") # base_commit=None
await sidecar.evaluate(EvalRequest(dataset_id="ds1", split="validation"))
assert sidecar.engine.evaluate.await_args.kwargs["admin"] is False

def test_status_surfaces_free_baseline(self, tmp_path):
sidecar = _sidecar(tmp_path, split="train", base_commit="abcdef123456")
s = sidecar.status()
assert s.base_commit == "abcdef123456"
assert s.free_baseline_available is True

@pytest.mark.asyncio
async def test_status_flips_after_use(self, tmp_path):
sidecar = _sidecar(tmp_path, split="validation", base_commit="abcdef123456")
await sidecar.evaluate(EvalRequest(dataset_id="ds1", split="validation"))
assert sidecar.status().free_baseline_available is False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing test: admin call on base_commit must not consume the free agent slot

The seven new tests cover the happy path and the boundary cases (second call, non-baseline commit, no base_commit), but none of them verify that an admin evaluate call to the base commit leaves _free_baseline_used = False so that the agent's free slot is still available afterwards. The logic not admin and ... handles this correctly in production, but without a test this invariant is one refactor away from silently regressing — e.g. if someone moves the flag assignment outside the if free_baseline guard.

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/tests/test_harbor_server.py
Line: 136-174

Comment:
**Missing test: admin call on base_commit must not consume the free agent slot**

The seven new tests cover the happy path and the boundary cases (second call, non-baseline commit, no `base_commit`), but none of them verify that an admin evaluate call to the base commit leaves `_free_baseline_used = False` so that the agent's free slot is still available afterwards. The logic `not admin and ...` handles this correctly in production, but without a test this invariant is one refactor away from silently regressing — e.g. if someone moves the flag assignment outside the `if free_baseline` guard.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Cursor Fix in Claude Code Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant