Add automatic provider instance fallback by edoedac0 · Pull Request #3482 · pingdotgg/t3code

edoedac0 · 2026-06-21T12:14:43Z

What Changed

Adds opt-in automatic fallback between multiple instances of the same provider driver.

Adds a global Automatic fallback setting, disabled by default.
Adds a per-instance Use for automatic fallback setting, enabled by default. The control is disabled while global fallback is off and explains why on hover.
Retries eligible provider instances in deterministic provider-list order and stops after the first successful attempt.
Handles both initial request failures and operational failures that occur during an active task.
Updates the thread's active instance and model selection only after a successful handoff.
Reports one final success or failure toast, with skipped candidates and their reasons in expandable details.

Why

A provider instance can become unusable because of a usage limit, expired authentication, process failure, transport failure, or temporary unavailability. Users with another configured instance of the same provider previously had to surface the error, switch instances manually, and restart or continue the task themselves.

This keeps those failures recoverable without changing normal behavior when the feature is disabled or when the error is not operational.

How It Works

Failure classification

Fallback is attempted only for operational failures:

rate or usage limits, quotas, and exhausted credits
authentication failures
network, transport, timeout, and upstream 5xx failures
provider process failures
unavailable or unsupported provider runtimes

Validation errors, cancellation, permission decisions, malformed prompts, and unrelated provider errors continue through the existing error path.

Candidate selection

Candidates are considered in provider-list order. An instance is skipped when:

it belongs to a different provider driver
its per-instance fallback setting is off
it is disabled, not installed, unavailable, or in an error state
it does not expose the exact model used by the current thread
continuation compatibility is required and its provider home/continuation store differs from the active instance
starting its session or sending the retry fails

Each skipped instance records a user-facing reason. The workflow stops immediately when one candidate accepts the turn; later candidates are not attempted.

First request versus an active task

For the first user message, a compatible same-driver instance receives the original prompt, attachments, model, runtime mode, and interaction mode.

For an existing conversation or a failure during an active task, fallback additionally requires the same provider-native continuation group. It starts the candidate with the original native resume cursor and sends a hidden Continue. turn. No summary or synthesized context is inserted, so the provider resumes from its own conversation state.

Success and total failure

On success, the server commits the candidate session and model selection, persists one fallback activity, and the UI shows:

Switched from Original Instance to New Instance

Switched after “Original Instance” reported: error details

Skipped candidates are available under expandable toast details.

If every candidate is skipped or fails, the server restores the original provider binding when necessary, preserves the original model selection, emits one fallback-failed activity, and then allows the original provider error to follow the existing error path. If fallback infrastructure itself fails unexpectedly, that failure is logged and the original error is still surfaced.

UI Changes

Before this change, the Providers settings page had no fallback controls and operational provider errors surfaced immediately. The screenshots below show the new states.

Global opt-in

Per-instance participation

Disabled control explanation

Exhausted fallback details

Successful switching demo

The recording shows an operational failure on the active provider instance, the automatic handoff, the final switch toast, and the updated active instance/model selection.

https://raw.githubusercontent.com/edoedac0/t3code/7322c743ac3586fe82839bd25a8bda40b4019c39/pr-assets/provider-instance-fallback/successful-switch.mp4

Validation

./node_modules/.bin/vp check — passed (0 errors; 20 existing warnings)
./node_modules/.bin/vp run typecheck — passed
./node_modules/.bin/vp test — 536 files passed, 2 skipped; 4,073 tests passed, 7 skipped

Checklist

This PR is focused on one provider reliability feature
I explained what changed and why
I included clear screenshots for the UI states
I included a video demonstrating the successful switching interaction

Note

Add automatic provider instance fallback when a provider turn fails

Introduces a full fallback workflow (providerFallbackWorkflow.ts) that plans candidate provider instances, serializes concurrent attempts per thread, switches the session/model selection, and emits provider.fallback.succeeded or provider.fallback.failed activity records.
Adds providerFallbackChain.ts to track attempted instances across a chain per thread, and providerFallbackTrialGate.ts to defer or reject runtime events from a candidate instance until the trial commits.
Updates ProviderCommandReactor.ts and ProviderRuntimeIngestion.ts to trigger fallback on classified service and runtime failures, with deduplication and chain reset on new user turns.
Adds a global providerFallback.enabled server setting (default false) and a per-instance allowFallback flag so operators control participation.
The chat UI shows toasts summarizing fallback success or failure, including skipped instances, via ChatView.tsx.
Risk: fallback is off by default but once enabled, a failed turn triggers session replacement and model selection changes on the thread automatically.

^{Macroscope summarized 43d4637.}

Note

High Risk
Changes core provider session binding, turn recovery, and runtime event ingestion; misclassification or trial/handoff bugs could hide errors, leak partial output, or leave threads on the wrong instance.

Overview
Introduces automatic provider instance fallback (off by default) so threads can recover from operational failures by trying other same-driver instances before surfacing errors.

Server orchestration classifies service and runtime failures, plans candidates in provider-list order (model match, availability, per-instance allowFallback, continuation compatibility, no re-tries in the current chain), and runs attemptProviderFallback under a per-thread lock. Turn-start failures in ProviderCommandReactor attempt fallback before the usual failure activity; ProviderRuntimeIngestion does the same on mid-task failures with a hidden Continue. turn, filters stale instance events, and uses a trial gate so provisional candidate output is held until a handoff commits or is discarded.

Contracts & UI: ServerSettings.providerFallback.enabled and per-instance allowFallback; settings switches and ChatView toasts for provider.fallback.succeeded / provider.fallback.failed with skipped-instance details.

^{Reviewed by Cursor Bugbot for commit 43d4637. Bugbot is set up for automated code reviews on this repo. Configure here.}

coderabbitai · 2026-06-21T12:14:52Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ccb89775-2dc8-425e-ba70-a21586cdbf95

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cursor · 2026-06-21T12:24:38Z

+        activeThreadKey,
+        new Set(threadActivities.map((activity) => String(activity.id))),
+      );
+      return;


First open skips fallback toast

Medium Severity

The fallback toast effect seeds every existing activity id the first time a thread key is seen, then returns without showing toasts. If automatic fallback finishes while the user is on another thread (or before they open that thread in the session), the success or failure activity is already in the list and no toast is shown.

^{Reviewed by Cursor Bugbot for commit 26e4553. Configure here.}

macroscopeapp · 2026-06-21T12:29:07Z

Approvability

Verdict: Needs human review

This PR introduces automatic provider instance fallback - a substantial new feature with complex orchestration logic, new state management, and significant runtime behavior changes. New features of this scope warrant human review. An unresolved bug report in the toast handling adds further reason for review.

^{You can customize Macroscope's approvability policy. Learn more.}

- Preserve the original instance/session in fallback chains - Emit restore metadata and clearer chat status when no fallback succeeds

add automatic provider instance fallback

26e4553

github-actions Bot added vouch:unvouched PR author is not yet trusted in the VOUCHED list. size:XL 500-999 changed lines (additions + deletions). labels Jun 21, 2026

edoedac0 marked this pull request as ready for review June 21, 2026 12:18

edoedac0 changed the title ~~[codex] add automatic provider instance fallback~~ Add automatic provider instance fallback Jun 21, 2026

cursor Bot reviewed Jun 21, 2026

View reviewed changes

harden provider fallback edge cases

3059b56

cursor Bot reviewed Jun 21, 2026

View reviewed changes

Comment thread apps/server/src/orchestration/Layers/ProviderRuntimeIngestion.ts Outdated

edoedac0 added 2 commits June 21, 2026 15:54

harden provider fallback retry chains

95e66a6

Restore original provider after fallback fails

43d4637

- Preserve the original instance/session in fallback chains - Emit restore metadata and clearer chat status when no fallback succeeds

github-actions Bot added size:XXL 1,000+ changed lines (additions + deletions). and removed size:XL 500-999 changed lines (additions + deletions). labels Jun 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add automatic provider instance fallback#3482

Add automatic provider instance fallback#3482
edoedac0 wants to merge 4 commits into
pingdotgg:mainfrom
edoedac0:codex/provider-instance-fallback

edoedac0 commented Jun 21, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading

Review skipped

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jun 21, 2026

Uh oh!

macroscopeapp Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

edoedac0 commented Jun 21, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Changed

Why

How It Works

Failure classification

Candidate selection

First request versus an active task

Success and total failure

UI Changes

Global opt-in

Per-instance participation

Disabled control explanation

Exhausted fallback details

Successful switching demo

Validation

Checklist

Add automatic provider instance fallback when a provider turn fails

Uh oh!

coderabbitai Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jun 21, 2026

Choose a reason for hiding this comment

First open skips fallback toast

Uh oh!

macroscopeapp Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

edoedac0 commented Jun 21, 2026 •

edited by macroscopeapp Bot

Loading

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading

macroscopeapp Bot commented Jun 21, 2026 •

edited

Loading