Skip to content

test(harness): raise CI Eventually scale ×3→×5 to fix autoplace-convergence flake#176

Draft
Andrei Kvapil (kvaps) wants to merge 1 commit into
mainfrom
blue/ci-scale-autoplace-flake
Draft

test(harness): raise CI Eventually scale ×3→×5 to fix autoplace-convergence flake#176
Andrei Kvapil (kvaps) wants to merge 1 commit into
mainfrom
blue/ci-scale-autoplace-flake

Conversation

@kvaps

Copy link
Copy Markdown
Member

What

Raise the CI Eventually budget scale from ×3 to ×5 (per-group base 30s → 150s on CI) in tests/integration/harness/asserts.go.

Why

The ×3 stretch introduced in #173 (30s → 90s on CI) still let the Integration lane rotate-flake under full-suite contention. The heaviest autoplace-convergence cases — TestGroupFRToggleDiskful2DisklessReapsTieBreaker and TestGroupJ/CSICreateVolumeFromEmpty — both timed out at exactly 90s on a loaded GitHub runner (Eventually timed out after 1m30s: ... never reached 2 diskful replicas / autoplace did not converge to placeCount=2), while the same tests complete in ~8s locally and pass on other CI runs. That signature is CPU starvation under load, not a hang — the placer / mock-satellite reconcile loop is simply not getting scheduled enough within 90s when the whole suite runs concurrently. More wall-clock is the correct, targeted mitigation.

Fail-safe

Eventually returns the instant its predicate passes, so green runs pay nothing for the larger budget — only genuinely slow/failing runs report later, and those are still capped by the job-level -timeout=15m. So ×5 cannot cause runaway jobs; it only widens the headroom for load-starved convergence.

Scope

Test-infrastructure only — no product code changes. The pin test TestScaledTimeoutStretchesOnCI is updated to the new 150s expectation. The CHANGELOG entry lands in the v0.1.17 release section (this repo writes the CHANGELOG at release time, not per-PR).

The x3 CI budget stretch (30s->90s) still let the heaviest autoplace-
convergence cases rotate-flake the Integration lane under full-suite
contention: TestGroupFRToggleDiskful2DisklessReapsTieBreaker and
TestGroupJ/CSICreateVolumeFromEmpty both timed out at exactly 90s on a
loaded GitHub runner while completing in ~8s locally — CPU starvation,
not a hang. Raise the scale to x5 (30s->150s) so the placer / mock-
satellite reconcile loop gets more wall-clock under contention.

Fail-safe: Eventually returns the instant the predicate passes, so green
runs pay nothing for the larger budget; a genuinely stuck test still
fails at the job-level -timeout=15m ceiling. Pin test updated to 150s.

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8c8c9f7e-5e2e-4601-b9b0-03a274488578

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch blue/ci-scale-autoplace-flake

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the CI timeout scaling factor from 3x to 5x (adjusting a 30-second timeout to 150 seconds) in the integration test harness to prevent flaky test failures caused by resource contention on CI runners. The corresponding test case has been updated to reflect this change. No review comments were provided, so there is no additional feedback to address.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant