feat(harbor): the agent's first baseline eval is budget-free by shehabyasser-scale · Pull Request #25 · scaleapi/vero

shehabyasser-scale · 2026-07-03T15:30:18Z

Stacked on #20 (top of the sidecar chain: #11 -> #19 -> #20 -> this).

The problem, observed twice in live trials

The seeded baseline is the reference every candidate is implicitly compared to, yet it can never win selection (auto_best excludes base_commit, correctly, so "do nothing" cannot win). Metering its evaluation therefore forces the optimizer into a bad choice:

Pay for it: the incident behind fix(harbor): floor rewards on no-candidate outcome instead of erroring #11/docs(harbor): warn in the task instruction that baseline evals create no candidate #12, an optimizer spent its whole budget measuring the baseline, left the candidate pool empty, and the trial died (now floored by fix(harbor): floor rewards on no-candidate outcome instead of erroring #11, warned against by docs(harbor): warn in the task instruction that baseline evals create no candidate #12).
Skip it and fly blind: today's TB2 trial (exp5 v3). The optimizer, warned off baseline evals by docs(harbor): warn in the task instruction that baseline evals create no candidate #12's instruction, never learned the baseline scored 0.375. Its first edit scored 0.375, and with no reference it could not tell a no-op from an improvement; its second edit scored 0.125; it rationally gave up with 3 of its 5 budgeted evals unspent. Final reward: 0.375, exactly baseline.

#12 treated the symptom (don't waste budget); this treats the cause (the reference measurement should not cost budget).

The fix

EvaluationSidecar.evaluate: when the transferred commit resolves to base_commit and the free eval hasn't been used, run it through the engine's unmetered (admin) path while still routing results through the agent's real split tier. Capped at exactly one, so free compute is bounded; subsequent baseline evals debit normally. /status now reports base_commit and free_baseline_available, so agents can discover the mechanism.

Selection integrity is untouched: the baseline experiment is recorded as before and remains excluded from auto_best by the existing base_commit filter.

Tests

Seven new tests: first baseline eval unmetered but tier-routed; second one metered; non-baseline commits always metered; no-base_commit tasks never free; /status surfaces and flips the flag.

Follow-up (separate, compiler layer): update the #12 instruction paragraph to say the first baseline eval is free once this lands.

🤖 Generated with Claude Code

Greptile Summary

This PR makes the agent's first evaluation of the seeded baseline commit budget-free by routing it through the engine's admin (unmetered) path while still writing results to the agent volume via the normal tier. A boolean flag _free_baseline_used caps the free-eval to exactly one use, and /status now surfaces base_commit and free_baseline_available so agents can discover the mechanism.

server.py: evaluate() sets admin=True on the first baseline call by detecting sha == self.base_commit and not _free_baseline_used; result routing is unaffected (still uses the agent tier).
protocol.py / serve.py: StatusSummary and build_status gain base_commit / free_baseline_available fields wired from ServeConfig.
test_harbor_server.py: seven new tests cover the unmetered-first-call, metered-second-call, non-baseline, no-base-commit, and status-flip cases.

Confidence Score: 3/5

The core mechanic is sound, but a transient engine failure during the free baseline call permanently burns the one-time slot, leaving the agent in the exact blind-optimizer state the PR is designed to prevent.

The _free_baseline_used flag is set to True before await self.engine.evaluate(...). Any engine exception (timeout, infra error) leaves the flag consumed with no successful result recorded; subsequent retries are metered and the optimizer loses the reference it needs. The rest of the change — protocol fields, status surfacing, result routing separation — is clean and well-tested.

vero/src/vero/harbor/server.py — specifically the ordering of the flag assignment relative to the awaited engine call in evaluate().

Important Files Changed

Filename	Overview
vero/src/vero/harbor/server.py	Adds free-baseline-eval mechanic via _free_baseline_used flag; flag is set before the awaited engine call, so a transient engine failure permanently burns the slot.
vero/src/vero/harbor/protocol.py	Extends StatusSummary and build_status with base_commit and free_baseline_available fields; clean additive change with correct defaults.
vero/src/vero/harbor/serve.py	Wires config.base_commit into EvaluationSidecar; minimal, correct change.
vero/tests/test_harbor_server.py	Adds seven tests for the free-baseline-eval feature covering the main cases; missing coverage for admin-call-on-base-commit preserving the agent's free slot, and for engine failure burning the slot.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant A as Agent
    participant S as EvaluationSidecar
    participant E as EvaluationEngine

    A->>S: evaluate(req) [baseline commit, first call]
    S->>S: "_transfer_commit(req.commit) → sha == base_commit"
    S->>S: "free_baseline=True, _free_baseline_used=True"
    S->>E: "engine.evaluate(req, admin=True) [unmetered]"
    E-->>S: Experiment
    S->>S: "_route_results(exp, admin=False) [agent tier]"
    S-->>A: EvalSummary + budget_remaining unchanged

    A->>S: evaluate(req) [baseline commit, second call]
    S->>S: "_transfer_commit → sha == base_commit"
    S->>S: "free_baseline=False (_free_baseline_used already True)"
    S->>E: "engine.evaluate(req, admin=False) [metered]"
    E-->>S: Experiment (budget debited)
    S->>S: "_route_results(exp, admin=False)"
    S-->>A: EvalSummary + updated budget_remaining

    A->>S: status()
    S-->>A: "StatusSummary{base_commit, free_baseline_available=False}"

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant A as Agent
    participant S as EvaluationSidecar
    participant E as EvaluationEngine

    A->>S: evaluate(req) [baseline commit, first call]
    S->>S: "_transfer_commit(req.commit) → sha == base_commit"
    S->>S: "free_baseline=True, _free_baseline_used=True"
    S->>E: "engine.evaluate(req, admin=True) [unmetered]"
    E-->>S: Experiment
    S->>S: "_route_results(exp, admin=False) [agent tier]"
    S-->>A: EvalSummary + budget_remaining unchanged

    A->>S: evaluate(req) [baseline commit, second call]
    S->>S: "_transfer_commit → sha == base_commit"
    S->>S: "free_baseline=False (_free_baseline_used already True)"
    S->>E: "engine.evaluate(req, admin=False) [metered]"
    E-->>S: Experiment (budget debited)
    S->>S: "_route_results(exp, admin=False)"
    S-->>A: EvalSummary + updated budget_remaining

    A->>S: status()
    S-->>A: "StatusSummary{base_commit, free_baseline_available=False}"

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
vero/src/vero/harbor/server.py:89-93
**Free slot burned on engine failure**

`_free_baseline_used` is set to `True` before the `await self.engine.evaluate(...)` call. If the engine raises (timeout, transient infrastructure error, etc.), the flag remains `True` and the agent permanently loses their one free baseline eval even though no successful result was recorded. On any retry the baseline is metered, which is exactly the "pay for it" failure mode the PR is trying to prevent — and in practice the agent's optimizer is left in the same bad state described in the problem statement. Moving the assignment to after the await (the check-and-set is safe in a single-threaded asyncio loop since there is no await between the read and the write) would let the agent retry freely after a transient failure.

### Issue 2 of 2
vero/tests/test_harbor_server.py:136-174
**Missing test: admin call on base_commit must not consume the free agent slot**

The seven new tests cover the happy path and the boundary cases (second call, non-baseline commit, no `base_commit`), but none of them verify that an admin evaluate call to the base commit leaves `_free_baseline_used = False` so that the agent's free slot is still available afterwards. The logic `not admin and ...` handles this correctly in production, but without a test this invariant is one refactor away from silently regressing — e.g. if someone moves the flag assignment outside the `if free_baseline` guard.

_{Reviews (1): Last reviewed commit: "feat(harbor): the agent's first baseline..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

The seeded baseline is the reference every candidate is implicitly compared to, yet it can never win selection (auto_best excludes base_commit by design), so metering it forced the optimizer to choose between flying blind and paying a budgeted eval for an unselectable commit. Observed live twice: the incident behind #11/#12 (an optimizer burned its whole budget measuring the baseline), and exp5's optimizer which, warned off by #12, skipped the reference entirely, could not tell its no-op edit (train 0.375) from an improvement, and quit with 3 of 5 evals unspent. The first eval whose transferred commit resolves to base_commit now runs unmetered (engine admin path) while results still route through the agent's real split tier. Capped at one: later baseline evals debit normally, so free compute is bounded. /status exposes base_commit and whether the free reference eval is still available. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-07-03T15:32:55Z

+        if free_baseline:
+            self._free_baseline_used = True
+        exp = await self.engine.evaluate(
+            replace(req, commit=sha), admin=admin or free_baseline
+        )


Free slot burned on engine failure

_free_baseline_used is set to True before the await self.engine.evaluate(...) call. If the engine raises (timeout, transient infrastructure error, etc.), the flag remains True and the agent permanently loses their one free baseline eval even though no successful result was recorded. On any retry the baseline is metered, which is exactly the "pay for it" failure mode the PR is trying to prevent — and in practice the agent's optimizer is left in the same bad state described in the problem statement. Moving the assignment to after the await (the check-and-set is safe in a single-threaded asyncio loop since there is no await between the read and the write) would let the agent retry freely after a transient failure.

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/server.py Line: 89-93 Comment: **Free slot burned on engine failure** `_free_baseline_used` is set to `True` before the `await self.engine.evaluate(...)` call. If the engine raises (timeout, transient infrastructure error, etc.), the flag remains `True` and the agent permanently loses their one free baseline eval even though no successful result was recorded. On any retry the baseline is metered, which is exactly the "pay for it" failure mode the PR is trying to prevent — and in practice the agent's optimizer is left in the same bad state described in the problem statement. Moving the assignment to after the await (the check-and-set is safe in a single-threaded asyncio loop since there is no await between the read and the write) would let the agent retry freely after a transient failure. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-07-03T15:32:56Z

+    async def test_first_baseline_eval_is_unmetered(self, tmp_path):
+        sidecar = _sidecar(tmp_path, split="validation", base_commit="abcdef123456")
+        await sidecar.evaluate(EvalRequest(dataset_id="ds1", split="validation"))
+        # engine.evaluate was called with admin=True (bypasses the ledger)
+        assert sidecar.engine.evaluate.await_args.kwargs["admin"] is True
+        # but results were routed with the agent tier (summary written)
+        dest = tmp_path / "agent_vol" / "results" / "validation__abcdef123456"
+        assert (dest / "summary.json").exists()
+
+    @pytest.mark.asyncio
+    async def test_second_baseline_eval_is_metered(self, tmp_path):
+        sidecar = _sidecar(tmp_path, split="validation", base_commit="abcdef123456")
+        await sidecar.evaluate(EvalRequest(dataset_id="ds1", split="validation"))
+        await sidecar.evaluate(EvalRequest(dataset_id="ds1", split="validation"))
+        assert sidecar.engine.evaluate.await_args.kwargs["admin"] is False
+
+    @pytest.mark.asyncio
+    async def test_non_baseline_commit_always_metered(self, tmp_path):
+        sidecar = _sidecar(tmp_path, split="validation", base_commit="other000000")
+        await sidecar.evaluate(EvalRequest(dataset_id="ds1", split="validation"))
+        assert sidecar.engine.evaluate.await_args.kwargs["admin"] is False
+
+    @pytest.mark.asyncio
+    async def test_no_base_commit_never_free(self, tmp_path):
+        sidecar = _sidecar(tmp_path, split="validation")  # base_commit=None
+        await sidecar.evaluate(EvalRequest(dataset_id="ds1", split="validation"))
+        assert sidecar.engine.evaluate.await_args.kwargs["admin"] is False
+
+    def test_status_surfaces_free_baseline(self, tmp_path):
+        sidecar = _sidecar(tmp_path, split="train", base_commit="abcdef123456")
+        s = sidecar.status()
+        assert s.base_commit == "abcdef123456"
+        assert s.free_baseline_available is True
+
+    @pytest.mark.asyncio
+    async def test_status_flips_after_use(self, tmp_path):
+        sidecar = _sidecar(tmp_path, split="validation", base_commit="abcdef123456")
+        await sidecar.evaluate(EvalRequest(dataset_id="ds1", split="validation"))
+        assert sidecar.status().free_baseline_available is False


Missing test: admin call on base_commit must not consume the free agent slot

The seven new tests cover the happy path and the boundary cases (second call, non-baseline commit, no base_commit), but none of them verify that an admin evaluate call to the base commit leaves _free_baseline_used = False so that the agent's free slot is still available afterwards. The logic not admin and ... handles this correctly in production, but without a test this invariant is one refactor away from silently regressing — e.g. if someone moves the flag assignment outside the if free_baseline guard.

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/tests/test_harbor_server.py Line: 136-174 Comment: **Missing test: admin call on base_commit must not consume the free agent slot** The seven new tests cover the happy path and the boundary cases (second call, non-baseline commit, no `base_commit`), but none of them verify that an admin evaluate call to the base commit leaves `_free_baseline_used = False` so that the agent's free slot is still available afterwards. The logic `not admin and ...` handles this correctly in production, but without a test this invariant is one refactor away from silently regressing — e.g. if someone moves the flag assignment outside the `if free_baseline` guard. How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps Bot reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(harbor): the agent's first baseline eval is budget-free#25

feat(harbor): the agent's first baseline eval is budget-free#25
shehabyasser-scale wants to merge 1 commit into
harbor-2-mode-b-sample-timeout-warningfrom
harbor-2-free-baseline-eval

shehabyasser-scale commented Jul 3, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot Jul 3, 2026

Uh oh!

greptile-apps Bot Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shehabyasser-scale commented Jul 3, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The problem, observed twice in live trials

The fix

Tests

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shehabyasser-scale commented Jul 3, 2026 •

edited by greptile-apps Bot

Loading