Skip to content

Add automatic provider instance fallback#3482

Open
edoedac0 wants to merge 4 commits into
pingdotgg:mainfrom
edoedac0:codex/provider-instance-fallback
Open

Add automatic provider instance fallback#3482
edoedac0 wants to merge 4 commits into
pingdotgg:mainfrom
edoedac0:codex/provider-instance-fallback

Conversation

@edoedac0

@edoedac0 edoedac0 commented Jun 21, 2026

Copy link
Copy Markdown

What Changed

Adds opt-in automatic fallback between multiple instances of the same provider driver.

  • Adds a global Automatic fallback setting, disabled by default.
  • Adds a per-instance Use for automatic fallback setting, enabled by default. The control is disabled while global fallback is off and explains why on hover.
  • Retries eligible provider instances in deterministic provider-list order and stops after the first successful attempt.
  • Handles both initial request failures and operational failures that occur during an active task.
  • Updates the thread's active instance and model selection only after a successful handoff.
  • Reports one final success or failure toast, with skipped candidates and their reasons in expandable details.

Why

A provider instance can become unusable because of a usage limit, expired authentication, process failure, transport failure, or temporary unavailability. Users with another configured instance of the same provider previously had to surface the error, switch instances manually, and restart or continue the task themselves.

This keeps those failures recoverable without changing normal behavior when the feature is disabled or when the error is not operational.

How It Works

Failure classification

Fallback is attempted only for operational failures:

  • rate or usage limits, quotas, and exhausted credits
  • authentication failures
  • network, transport, timeout, and upstream 5xx failures
  • provider process failures
  • unavailable or unsupported provider runtimes

Validation errors, cancellation, permission decisions, malformed prompts, and unrelated provider errors continue through the existing error path.

Candidate selection

Candidates are considered in provider-list order. An instance is skipped when:

  • it belongs to a different provider driver
  • its per-instance fallback setting is off
  • it is disabled, not installed, unavailable, or in an error state
  • it does not expose the exact model used by the current thread
  • continuation compatibility is required and its provider home/continuation store differs from the active instance
  • starting its session or sending the retry fails

Each skipped instance records a user-facing reason. The workflow stops immediately when one candidate accepts the turn; later candidates are not attempted.

First request versus an active task

For the first user message, a compatible same-driver instance receives the original prompt, attachments, model, runtime mode, and interaction mode.

For an existing conversation or a failure during an active task, fallback additionally requires the same provider-native continuation group. It starts the candidate with the original native resume cursor and sends a hidden Continue. turn. No summary or synthesized context is inserted, so the provider resumes from its own conversation state.

Success and total failure

On success, the server commits the candidate session and model selection, persists one fallback activity, and the UI shows:

Switched from Original Instance to New Instance

Switched after “Original Instance” reported: error details

Skipped candidates are available under expandable toast details.

If every candidate is skipped or fails, the server restores the original provider binding when necessary, preserves the original model selection, emits one fallback-failed activity, and then allows the original provider error to follow the existing error path. If fallback infrastructure itself fails unexpectedly, that failure is logged and the original error is still surfaced.

UI Changes

Before this change, the Providers settings page had no fallback controls and operational provider errors surfaced immediately. The screenshots below show the new states.

Global opt-in

Providers settings with automatic fallback

Per-instance participation

Per-instance automatic fallback setting

Disabled control explanation

Disabled per-instance setting tooltip

Exhausted fallback details

Fallback failure toast with skipped instance details

Successful switching demo

The recording shows an operational failure on the active provider instance, the automatic handoff, the final switch toast, and the updated active instance/model selection.

https://raw.githubusercontent.com/edoedac0/t3code/7322c743ac3586fe82839bd25a8bda40b4019c39/pr-assets/provider-instance-fallback/successful-switch.mp4

Validation

  • ./node_modules/.bin/vp check — passed (0 errors; 20 existing warnings)
  • ./node_modules/.bin/vp run typecheck — passed
  • ./node_modules/.bin/vp test — 536 files passed, 2 skipped; 4,073 tests passed, 7 skipped

Checklist

  • This PR is focused on one provider reliability feature
  • I explained what changed and why
  • I included clear screenshots for the UI states
  • I included a video demonstrating the successful switching interaction

Note

Add automatic provider instance fallback when a provider turn fails

  • Introduces a full fallback workflow (providerFallbackWorkflow.ts) that plans candidate provider instances, serializes concurrent attempts per thread, switches the session/model selection, and emits provider.fallback.succeeded or provider.fallback.failed activity records.
  • Adds providerFallbackChain.ts to track attempted instances across a chain per thread, and providerFallbackTrialGate.ts to defer or reject runtime events from a candidate instance until the trial commits.
  • Updates ProviderCommandReactor.ts and ProviderRuntimeIngestion.ts to trigger fallback on classified service and runtime failures, with deduplication and chain reset on new user turns.
  • Adds a global providerFallback.enabled server setting (default false) and a per-instance allowFallback flag so operators control participation.
  • The chat UI shows toasts summarizing fallback success or failure, including skipped instances, via ChatView.tsx.
  • Risk: fallback is off by default but once enabled, a failed turn triggers session replacement and model selection changes on the thread automatically.

Macroscope summarized 43d4637.


Note

High Risk
Changes core provider session binding, turn recovery, and runtime event ingestion; misclassification or trial/handoff bugs could hide errors, leak partial output, or leave threads on the wrong instance.

Overview
Introduces automatic provider instance fallback (off by default) so threads can recover from operational failures by trying other same-driver instances before surfacing errors.

Server orchestration classifies service and runtime failures, plans candidates in provider-list order (model match, availability, per-instance allowFallback, continuation compatibility, no re-tries in the current chain), and runs attemptProviderFallback under a per-thread lock. Turn-start failures in ProviderCommandReactor attempt fallback before the usual failure activity; ProviderRuntimeIngestion does the same on mid-task failures with a hidden Continue. turn, filters stale instance events, and uses a trial gate so provisional candidate output is held until a handoff commits or is discarded.

Contracts & UI: ServerSettings.providerFallback.enabled and per-instance allowFallback; settings switches and ChatView toasts for provider.fallback.succeeded / provider.fallback.failed with skipped-instance details.

Reviewed by Cursor Bugbot for commit 43d4637. Bugbot is set up for automated code reviews on this repo. Configure here.

@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ccb89775-2dc8-425e-ba70-a21586cdbf95

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added vouch:unvouched PR author is not yet trusted in the VOUCHED list. size:XL 500-999 changed lines (additions + deletions). labels Jun 21, 2026
@edoedac0 edoedac0 marked this pull request as ready for review June 21, 2026 12:18
@edoedac0 edoedac0 changed the title [codex] add automatic provider instance fallback Add automatic provider instance fallback Jun 21, 2026
Comment thread apps/server/src/orchestration/providerFallbackWorkflow.ts
Comment thread apps/server/src/orchestration/Layers/ProviderCommandReactor.ts Outdated
Comment thread apps/server/src/orchestration/providerFallbackWorkflow.ts
Comment thread apps/server/src/orchestration/Layers/ProviderRuntimeIngestion.ts
Comment thread apps/server/src/orchestration/Layers/ProviderCommandReactor.ts
activeThreadKey,
new Set(threadActivities.map((activity) => String(activity.id))),
);
return;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First open skips fallback toast

Medium Severity

The fallback toast effect seeds every existing activity id the first time a thread key is seen, then returns without showing toasts. If automatic fallback finishes while the user is on another thread (or before they open that thread in the session), the success or failure activity is already in the list and no toast is shown.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 26e4553. Configure here.

@macroscopeapp

macroscopeapp Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Approvability

Verdict: Needs human review

This PR introduces automatic provider instance fallback - a substantial new feature with complex orchestration logic, new state management, and significant runtime behavior changes. New features of this scope warrant human review. An unresolved bug report in the toast handling adds further reason for review.

You can customize Macroscope's approvability policy. Learn more.

Comment thread apps/server/src/orchestration/Layers/ProviderRuntimeIngestion.ts Outdated
edoedac0 added 2 commits June 21, 2026 15:54
- Preserve the original instance/session in fallback chains
- Emit restore metadata and clearer chat status when no fallback succeeds
@github-actions github-actions Bot added size:XXL 1,000+ changed lines (additions + deletions). and removed size:XL 500-999 changed lines (additions + deletions). labels Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL 1,000+ changed lines (additions + deletions). vouch:unvouched PR author is not yet trusted in the VOUCHED list.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant