fix(harbor): salvage completed trials when the nested run times out by shehabyasser-scale · Pull Request #23 · scaleapi/vero

shehabyasser-scale · 2026-07-03T12:25:21Z

Stacked on #22.

The bug

A vero-side timeout on the nested harbor run raised SubprocessTimeoutError uncaught: the agent got a bare HTTP 500, no partial results were collated, and the already-reserved budget (never refunded by design) bought zero information. With mean-of-k (n_attempts=3) tripling nested-run wall clock, this failure mode gets strictly more likely.

The fix

Treat a timeout like a non-zero exit: warn (including the captured stderr tail) and collate whatever trials completed before the cutoff; tasks that were cut off become error samples. With zero completed trials, the collate mismatch guard (fix(harbor): refuse to score all-zero when nested trials match no task names #16) still fails loudly rather than recording an all-error experiment (pinned by a test).
Warn whenever a mean-of-k sample scored fewer attempts than configured, so k never shrinks silently (n_scored in the metrics records the actual k).

Tests: timeout with partial trials collates the survivor and errors the cut-off task; timeout with zero trials raises; partial-k mean warns.

🤖 Generated with Claude Code

Greptile Summary

This PR fixes the SubprocessTimeoutError from a timed-out nested harbor run being propagated as an uncaught HTTP 500. The fix catches the exception in _run_harbor, warns with the stderr tail, and returns early so _collate can salvage whatever trials landed on disk before the cutoff. A second change adds a WARNING log whenever mean-aggregation uses fewer scored attempts than n_attempts configures, so k-shrinkage is never silent.

Timeout salvage (_run_harbor): SubprocessTimeoutError is now caught; the zero-trial case still raises via the existing _collate guard, while partial-completion cases score what finished and emit error samples for the rest.
Partial-k warning (_sample_result): fires whenever len(scored) < self.config.n_attempts in mean-aggregation mode; n_scored in metrics records the actual k.
Tests cover timeout with one completed trial (salvage), timeout with no trials (guard raises), and partial-k mean (warning fires + n_scored correct).

Confidence Score: 4/5

Safe to merge for the stated fix; a known edge case in the resume-after-repeated-timeout path (already captured in a prior review comment) misfires the wrong collate guard and should be addressed before that flow is exercised in production.

The timeout-salvage logic is correct for first-time invocations: caught exceptions return early, the zero-trial guard still raises, and partial-completion trials are collated correctly. The partial-k warning fires correctly for mean-aggregation mode. The unresolved issue is the resume path: when a task has already been saved as an error sample and retried, its second timeout leaves only the first run's other task results in the shared jobs_dir; the guard's name-match check raises with a misleading 'task names must use canonical form' message instead of the true 'task kept timing out' diagnosis. This was documented in the previous review round and is still present in the current code.

vero/src/vero/harbor/runner.py — specifically the _collate guard interaction with the resume-after-timeout flow when jobs_dir contains stale results from a prior run.

Important Files Changed

Filename	Overview
vero/src/vero/harbor/runner.py	Adds SubprocessTimeoutError catch in _run_harbor (returns early for salvage) and partial-k warning in _sample_result; the resume + repeated-timeout case can still trigger the wrong "none match the requested task names" guard (noted in previous comments).
vero/tests/test_harbor_runner.py	Adds TestTimeoutSalvage with three well-targeted cases: partial-completion salvage, zero-trial guard, and partial-k warning; all correctly patch at the module-level import site.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant PSR as produce_sample_results
    participant RH as _run_harbor
    participant Sub as run_subprocess_with_tee
    participant Col as _collate

    PSR->>RH: await _run_harbor(pending_tasks, jobs_dir)
    RH->>Sub: await run_subprocess_with_tee(cmd, timeout)
    alt "normal exit (rc=0)"
        Sub-->>RH: "SubprocessResult(rc=0)"
        RH-->>PSR: return
    else non-zero exit
        Sub-->>RH: SubprocessResult(rc≠0)
        RH->>RH: logger.warning(stderr[:500])
        RH-->>PSR: return
    else timeout (NEW)
        Sub-->>RH: raise SubprocessTimeoutError(result)
        RH->>RH: logger.warning(stderr[-500:])
        RH-->>PSR: return (early, no raise)
    end
    PSR->>Col: "_collate(jobs_dir, pairs, ran=pending_tasks)"
    Col->>Col: _load_trials(jobs_dir)
    alt no trials on disk
        Col-->>PSR: raise RuntimeError(no trial results)
    else trials exist but none match ran tasks
        Col-->>PSR: raise RuntimeError(none match requested)
    else one or more matching trials
        loop each (sample_id, task_name)
            alt trial found on disk
                Col->>Col: "_sample_result -> SampleResult(score)"
            else task timed out / never ran
                Col->>Col: SampleResult(error)
            end
            Col->>Col: save_sample_result
        end
        Col-->>PSR: return
    end

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant PSR as produce_sample_results
    participant RH as _run_harbor
    participant Sub as run_subprocess_with_tee
    participant Col as _collate

    PSR->>RH: await _run_harbor(pending_tasks, jobs_dir)
    RH->>Sub: await run_subprocess_with_tee(cmd, timeout)
    alt "normal exit (rc=0)"
        Sub-->>RH: "SubprocessResult(rc=0)"
        RH-->>PSR: return
    else non-zero exit
        Sub-->>RH: SubprocessResult(rc≠0)
        RH->>RH: logger.warning(stderr[:500])
        RH-->>PSR: return
    else timeout (NEW)
        Sub-->>RH: raise SubprocessTimeoutError(result)
        RH->>RH: logger.warning(stderr[-500:])
        RH-->>PSR: return (early, no raise)
    end
    PSR->>Col: "_collate(jobs_dir, pairs, ran=pending_tasks)"
    Col->>Col: _load_trials(jobs_dir)
    alt no trials on disk
        Col-->>PSR: raise RuntimeError(no trial results)
    else trials exist but none match ran tasks
        Col-->>PSR: raise RuntimeError(none match requested)
    else one or more matching trials
        loop each (sample_id, task_name)
            alt trial found on disk
                Col->>Col: "_sample_result -> SampleResult(score)"
            else task timed out / never ran
                Col->>Col: SampleResult(error)
            end
            Col->>Col: save_sample_result
        end
        Col-->>PSR: return
    end

Comments Outside Diff (1)

vero/src/vero/harbor/runner.py, line 61 (link)

Resume + complete-timeout triggers wrong guard message

When produce_sample_results is retried after a partial-completion timeout (e.g., t0 finished and was saved with a score, t1 got an error sample and will be retried), the second invocation sets pending = [(1, "t1")] and ran = ["t1"]. If t1 times out again with nothing written to disk, _load_trials scans the shared jobs_dir and finds t0's result.json from the first run — so trials = {"t0": ...}. The guard then checks not any(t in trials for t in ran) → not ("t1" in {"t0": ...}) → True and raises:

"produced 1 trial result(s), but none match the requested task names (requested e.g. 't1'; recorded e.g. 't0'). Task names must use harbor's canonical <org>/<name> form"

This is the wrong diagnosis — there is no keying mismatch; t1 just keeps timing out. Before this PR the SubprocessTimeoutError propagated directly and gave the correct signal. The guard was designed to fire when ALL ran-task results are absent from trials, but it cannot distinguish "absent because never written by this run" from "absent because a prior run wrote something else". A targeted fix is to restrict the guard's trials view to only those entries whose keys appear in ran.

Prompt To Fix With AI

This is a comment left during a code review.
Path: vero/src/vero/harbor/runner.py
Line: 61

Comment:
**Resume + complete-timeout triggers wrong guard message**

When `produce_sample_results` is retried after a partial-completion timeout (e.g., `t0` finished and was saved with a score, `t1` got an error sample and will be retried), the second invocation sets `pending = [(1, "t1")]` and `ran = ["t1"]`. If `t1` times out again with nothing written to disk, `_load_trials` scans the shared `jobs_dir` and finds `t0`'s `result.json` from the first run — so `trials = {"t0": ...}`. The guard then checks `not any(t in trials for t in ran)` → `not ("t1" in {"t0": ...})` → `True` and raises:

> *"produced 1 trial result(s), but none match the requested task names (requested e.g. 't1'; recorded e.g. 't0'). Task names must use harbor's canonical `<org>/<name>` form"*

This is the wrong diagnosis — there is no keying mismatch; `t1` just keeps timing out. Before this PR the `SubprocessTimeoutError` propagated directly and gave the correct signal. The guard was designed to fire when ALL ran-task results are absent from `trials`, but it cannot distinguish "absent because never written by this run" from "absent because a prior run wrote something else". A targeted fix is to restrict the guard's `trials` view to only those entries whose keys appear in `ran`.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (2): Last reviewed commit: "fix(harbor): salvage completed trials wh..." | Re-trigger Greptile}

A vero-side timeout on the nested `harbor run` raised SubprocessTimeoutError uncaught: the agent got a bare HTTP 500, no partial results were collated, and the already-reserved budget (never refunded by design) bought zero information. Completed trials are on disk, so treat a timeout like a non-zero exit: warn and collate what finished; tasks that were cut off become error samples. With zero completed trials the collate mismatch guard still fails loudly. Also warn whenever a mean-of-k sample scored fewer attempts than configured, so k never shrinks silently (n_scored records the actual k). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

shehabyasser-scale force-pushed the harbor-3-aggregate-validation branch from 891207c to 8a91465 Compare July 3, 2026 13:01

shehabyasser-scale force-pushed the harbor-3-timeout-salvage branch from b81a902 to 3dd10b8 Compare July 3, 2026 13:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(harbor): salvage completed trials when the nested run times out#23

fix(harbor): salvage completed trials when the nested run times out#23
shehabyasser-scale wants to merge 1 commit into
harbor-3-aggregate-validationfrom
harbor-3-timeout-salvage

shehabyasser-scale commented Jul 3, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shehabyasser-scale commented Jul 3, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The bug

The fix

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shehabyasser-scale commented Jul 3, 2026 •

edited by greptile-apps Bot

Loading