Skip to content

Retry transient agent-turn failures in the sandbox runner#102

Merged
lsfera merged 1 commit into
mainfrom
agent-retry-transient-errors
Jul 3, 2026
Merged

Retry transient agent-turn failures in the sandbox runner#102
lsfera merged 1 commit into
mainfrom
agent-retry-transient-errors

Conversation

@lsfera

@lsfera lsfera commented Jul 3, 2026

Copy link
Copy Markdown
Owner

Summary

  • Occasionally sandbox.run() fails on a transient network blip (e.g. API Error: Server disconnected, hit live during headroom testing) rather than a real task failure — previously this burned a full orchestrator-level retry (Re-queue failed sandboxes: handle SandboxFailed with bounded retry #76's ready-for-agent re-queue) for something a quick in-place retry could resolve.
  • Extracted the existing withRetry helper out of main.ts into a standalone .sandcastle/retry.ts (avoids a circular import, since main.ts already imports from sandbox-runner.ts), and added an optional shouldRetry predicate so callers can skip retrying errors that retrying can't fix (session limits, auth failures) — fully backward compatible, defaults to retrying everything.
  • Added isTransientAgentError in sandbox-runner.ts (matches server-disconnect/connection-reset style messages) and wrapped both sandbox.run() call sites with withRetry(..., { shouldRetry: isTransientAgentError }).
  • Considered using Effect's built-in retry combinators (per suggestion) but effect is only bundled/inlined inside @ai-hero/sandcastle's own dist bundle, not a real dependency of this repo or of sandcastle.run()'s Promise-based API — extending the existing helper was the right-sized fix.

Test plan

  • npm test — 229/229 passing, including new retry.test.ts (7 tests) and 5 new isTransientAgentError tests in sandbox-runner.test.ts
  • npm run typecheck — clean

🤖 Generated with Claude Code

sandbox.run() calls occasionally fail on transient network blips (e.g.
"API Error: Server disconnected") rather than real task failures.
Extract withRetry into a standalone retry.ts (avoids a circular import
between main.ts and sandbox-runner.ts, which main.ts already imports
from), add a shouldRetry predicate so callers can skip retrying errors
that retrying can't fix (session limits, auth), and wrap both
sandbox.run() call sites in sandbox-runner.ts with a new
isTransientAgentError classifier.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
@lsfera lsfera merged commit 1b782f6 into main Jul 3, 2026
3 checks passed
@lsfera lsfera deleted the agent-retry-transient-errors branch July 3, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant