Retry transient agent-turn failures in the sandbox runner#102
Merged
Conversation
sandbox.run() calls occasionally fail on transient network blips (e.g. "API Error: Server disconnected") rather than real task failures. Extract withRetry into a standalone retry.ts (avoids a circular import between main.ts and sandbox-runner.ts, which main.ts already imports from), add a shouldRetry predicate so callers can skip retrying errors that retrying can't fix (session limits, auth), and wrap both sandbox.run() call sites in sandbox-runner.ts with a new isTransientAgentError classifier. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sandbox.run()fails on a transient network blip (e.g.API Error: Server disconnected, hit live during headroom testing) rather than a real task failure — previously this burned a full orchestrator-level retry (Re-queue failed sandboxes: handle SandboxFailed with bounded retry #76'sready-for-agentre-queue) for something a quick in-place retry could resolve.withRetryhelper out ofmain.tsinto a standalone.sandcastle/retry.ts(avoids a circular import, sincemain.tsalready imports fromsandbox-runner.ts), and added an optionalshouldRetrypredicate so callers can skip retrying errors that retrying can't fix (session limits, auth failures) — fully backward compatible, defaults to retrying everything.isTransientAgentErrorinsandbox-runner.ts(matches server-disconnect/connection-reset style messages) and wrapped bothsandbox.run()call sites withwithRetry(..., { shouldRetry: isTransientAgentError }).effectis only bundled/inlined inside@ai-hero/sandcastle's own dist bundle, not a real dependency of this repo or ofsandcastle.run()'s Promise-based API — extending the existing helper was the right-sized fix.Test plan
npm test— 229/229 passing, including newretry.test.ts(7 tests) and 5 newisTransientAgentErrortests insandbox-runner.test.tsnpm run typecheck— clean🤖 Generated with Claude Code