fix(harbor): salvage completed trials when the nested run times out#23
Open
shehabyasser-scale wants to merge 1 commit into
Open
fix(harbor): salvage completed trials when the nested run times out#23shehabyasser-scale wants to merge 1 commit into
shehabyasser-scale wants to merge 1 commit into
Conversation
A vero-side timeout on the nested `harbor run` raised SubprocessTimeoutError uncaught: the agent got a bare HTTP 500, no partial results were collated, and the already-reserved budget (never refunded by design) bought zero information. Completed trials are on disk, so treat a timeout like a non-zero exit: warn and collate what finished; tasks that were cut off become error samples. With zero completed trials the collate mismatch guard still fails loudly. Also warn whenever a mean-of-k sample scored fewer attempts than configured, so k never shrinks silently (n_scored records the actual k). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
891207c to
8a91465
Compare
b81a902 to
3dd10b8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #22.
The bug
A vero-side timeout on the nested
harbor runraisedSubprocessTimeoutErroruncaught: the agent got a bare HTTP 500, no partial results were collated, and the already-reserved budget (never refunded by design) bought zero information. With mean-of-k (n_attempts=3) tripling nested-run wall clock, this failure mode gets strictly more likely.The fix
n_scoredin the metrics records the actual k).Tests: timeout with partial trials collates the survivor and errors the cut-off task; timeout with zero trials raises; partial-k mean warns.
🤖 Generated with Claude Code
Greptile Summary
This PR fixes the
SubprocessTimeoutErrorfrom a timed-out nestedharbor runbeing propagated as an uncaught HTTP 500. The fix catches the exception in_run_harbor, warns with the stderr tail, and returns early so_collatecan salvage whatever trials landed on disk before the cutoff. A second change adds aWARNINGlog whenever mean-aggregation uses fewer scored attempts thann_attemptsconfigures, so k-shrinkage is never silent._run_harbor):SubprocessTimeoutErroris now caught; the zero-trial case still raises via the existing_collateguard, while partial-completion cases score what finished and emit error samples for the rest._sample_result): fires wheneverlen(scored) < self.config.n_attemptsin mean-aggregation mode;n_scoredin metrics records the actual k.n_scoredcorrect).Confidence Score: 4/5
Safe to merge for the stated fix; a known edge case in the resume-after-repeated-timeout path (already captured in a prior review comment) misfires the wrong collate guard and should be addressed before that flow is exercised in production.
The timeout-salvage logic is correct for first-time invocations: caught exceptions return early, the zero-trial guard still raises, and partial-completion trials are collated correctly. The partial-k warning fires correctly for mean-aggregation mode. The unresolved issue is the resume path: when a task has already been saved as an error sample and retried, its second timeout leaves only the first run's other task results in the shared jobs_dir; the guard's name-match check raises with a misleading 'task names must use canonical form' message instead of the true 'task kept timing out' diagnosis. This was documented in the previous review round and is still present in the current code.
vero/src/vero/harbor/runner.py — specifically the _collate guard interaction with the resume-after-timeout flow when jobs_dir contains stale results from a prior run.
Important Files Changed
Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant PSR as produce_sample_results participant RH as _run_harbor participant Sub as run_subprocess_with_tee participant Col as _collate PSR->>RH: await _run_harbor(pending_tasks, jobs_dir) RH->>Sub: await run_subprocess_with_tee(cmd, timeout) alt "normal exit (rc=0)" Sub-->>RH: "SubprocessResult(rc=0)" RH-->>PSR: return else non-zero exit Sub-->>RH: SubprocessResult(rc≠0) RH->>RH: logger.warning(stderr[:500]) RH-->>PSR: return else timeout (NEW) Sub-->>RH: raise SubprocessTimeoutError(result) RH->>RH: logger.warning(stderr[-500:]) RH-->>PSR: return (early, no raise) end PSR->>Col: "_collate(jobs_dir, pairs, ran=pending_tasks)" Col->>Col: _load_trials(jobs_dir) alt no trials on disk Col-->>PSR: raise RuntimeError(no trial results) else trials exist but none match ran tasks Col-->>PSR: raise RuntimeError(none match requested) else one or more matching trials loop each (sample_id, task_name) alt trial found on disk Col->>Col: "_sample_result -> SampleResult(score)" else task timed out / never ran Col->>Col: SampleResult(error) end Col->>Col: save_sample_result end Col-->>PSR: return end%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant PSR as produce_sample_results participant RH as _run_harbor participant Sub as run_subprocess_with_tee participant Col as _collate PSR->>RH: await _run_harbor(pending_tasks, jobs_dir) RH->>Sub: await run_subprocess_with_tee(cmd, timeout) alt "normal exit (rc=0)" Sub-->>RH: "SubprocessResult(rc=0)" RH-->>PSR: return else non-zero exit Sub-->>RH: SubprocessResult(rc≠0) RH->>RH: logger.warning(stderr[:500]) RH-->>PSR: return else timeout (NEW) Sub-->>RH: raise SubprocessTimeoutError(result) RH->>RH: logger.warning(stderr[-500:]) RH-->>PSR: return (early, no raise) end PSR->>Col: "_collate(jobs_dir, pairs, ran=pending_tasks)" Col->>Col: _load_trials(jobs_dir) alt no trials on disk Col-->>PSR: raise RuntimeError(no trial results) else trials exist but none match ran tasks Col-->>PSR: raise RuntimeError(none match requested) else one or more matching trials loop each (sample_id, task_name) alt trial found on disk Col->>Col: "_sample_result -> SampleResult(score)" else task timed out / never ran Col->>Col: SampleResult(error) end Col->>Col: save_sample_result end Col-->>PSR: return endComments Outside Diff (1)
vero/src/vero/harbor/runner.py, line 61 (link)When
produce_sample_resultsis retried after a partial-completion timeout (e.g.,t0finished and was saved with a score,t1got an error sample and will be retried), the second invocation setspending = [(1, "t1")]andran = ["t1"]. Ift1times out again with nothing written to disk,_load_trialsscans the sharedjobs_dirand findst0'sresult.jsonfrom the first run — sotrials = {"t0": ...}. The guard then checksnot any(t in trials for t in ran)→not ("t1" in {"t0": ...})→Trueand raises:This is the wrong diagnosis — there is no keying mismatch;
t1just keeps timing out. Before this PR theSubprocessTimeoutErrorpropagated directly and gave the correct signal. The guard was designed to fire when ALL ran-task results are absent fromtrials, but it cannot distinguish "absent because never written by this run" from "absent because a prior run wrote something else". A targeted fix is to restrict the guard'strialsview to only those entries whose keys appear inran.Prompt To Fix With AI
Reviews (2): Last reviewed commit: "fix(harbor): salvage completed trials wh..." | Re-trigger Greptile