ci: add ci fail-fast implementation plan by mesakhcienet · Pull Request #4315 · AI-Hypercomputer/maxtext

mesakhcienet · 2026-07-01T09:59:08Z

Description

This PR refactors the MaxText CI pipeline to orchestrate quality checks (linting/pre-commit) and documentation builds as the first phase of a Directed Acyclic Graph (DAG) before triggering heavy functional tests.

Why is this change being made?

Previously, ci_pipeline.yml, code_quality.yml, and check_docs_build.yml ran as entirely independent, concurrent workflows on Pull Requests. If a simple linting or documentation build error occurred, the pipeline would continue running the expensive, long-running TPU/GPU/CPU test suites for upwards of > 20 minutes. This resulted in wasted cloud compute resources and forced developers to wait unnecessarily for runs that were already bound to fail.

Specific Implementation Details

code_quality.yml: Added support for workflow_call and an optional maxtext_sha input, fallback checkout logic (${{ inputs.maxtext_sha || github.sha }}), and removed the separate concurrency block to prevent interference with the caller.
check_docs_build.yml: Relaxed the maxtext_sha input to be optional (required: false) so the workflow can still run independently when needed while gracefully falling back to github.sha if not provided.
ci_pipeline.yml: Added the code_quality_check and docs_build_check jobs at the top of the file, forced the heavy test matrixes (tpu-tests, gpu-tests, cpu-tests, maxtext_tpu_pathways_unit_tests, and maxtext_tpu_pathways_integration_tests) to require their success, and updated the all_tests_passed reporter job.

FIXES: b/527592378

Tests

The changes have been thoroughly validated:

1. Local Syntax and Schema Validation

Validated all modified GitHub Actions files to ensure correct YAML structure:

python -c "import yaml; yaml.safe_load(open('.github/workflows/ci_pipeline.yml'))"
python -c "import yaml; yaml.safe_load(open('.github/workflows/code_quality.yml'))"
python -c "import yaml; yaml.safe_load(open('.github/workflows/check_docs_build.yml'))"

Result: All files parsed successfully with zero syntax errors.

2. Live GitHub Actions Run Verification

We successfully triggered a dry run of the refactored pipeline here:
https://github.com/AI-Hypercomputer/maxtext/actions/runs/28509679093

Observations:
- The fast phase jobs started concurrently at 10:05:16Z.
- docs_build_check / build-docs successfully passed in 21 seconds.
- code_quality_check failed in 46 seconds due to pre-commit issues.
- Immediately after the quality failure, all downstream heavy test jobs were instantly skipped (tpu-tests, gpu-tests, cpu-tests, and pathways tests), and running notebook jobs were automatically cancelled.
- This confirmed a total fail-fast execution time of under 1 minute, preventing any TPU/GPU compute from being scheduled or billed.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-07-01T10:04:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

shralex

Thanks for the changes!

Please rebase your PR as tpu7x tests were added, and its "needs" should also be updated in your PR

shralex · 2026-07-03T00:43:50Z

Previously, code_quality.yml was configured to run only on pull request events. In this PR, code_quality_check is being integrated as a job directly inside the main ci_pipeline.yml workflow. But ci_pipeline.yml runs on both pull requests and scheduled runs. The pull request-specific environment variables $GITHUB_BASE_REF and $GITHUB_HEAD_REF are empty on scheduled runs. The commands to fetch and branch them in code_quality.yml will fail and cause the scheduled build to fail.

So there are 2 issues here:

On scheduled (or manual) runs on main, we want to collect as much test coverage and signal as possible. Having a minor code quality or documentation warning skip the entire TPU and GPU test suite is counterproductive for nightly or scheduled builds.

Suggestion: Add the following if: condition to all test suites that depend on code_quality_check and docs_build_check (such as tpu-tests, gpu-tests, cpu-tests, and pathways tests):

if: | always() && needs.analyze_code_changes.outputs.run_tests == 'true' && needs.build_and_upload_maxtext_package.result == 'success' && (github.event_name != 'pull_request' || (needs.code_quality_check.result == 'success' && needs.docs_build_check.result == 'success'))

Update .github/workflows/code_quality.yml to check the event name:

name: Run pre-commit checks on just the files that have changed
run: |
. "$GITHUB_WORKSPACE"/venv/bin/activate
if [ "${{ github.event_name }}" == "pull_request" ]; then
git fetch origin "$GITHUB_BASE_REF":"$GITHUB_BASE_REF"
git branch "$GITHUB_HEAD_REF"
pre-commit run --from-ref "$GITHUB_BASE_REF" --to-ref "$GITHUB_HEAD_REF" --show-diff-on-failure
else
echo "Not a pull request event (${{ github.event_name }}), checking all files."
pre-commit run --all-files --show-diff-on-failure
fi

Done, both points addressed:

Gating scoped to PRs. The test suites (tpu-tests, tpu7x-tests, gpu-tests, cpu-tests, and both pathways jobs) now use the condition you suggested:

if: | always() && needs.analyze_code_changes.outputs.run_tests == 'true' && needs.build_and_upload_maxtext_package.result == 'success' && (github.event_name != 'pull_request' || (needs.code_quality_check.result == 'success' && needs.docs_build_check.result == 'success'))

So on schedule / workflow_dispatch the suites run regardless of the code-quality/docs outcome (full nightly signal), and on PRs they still gate.

Event guard in code_quality.yml. The pre-commit step now branches on github.event_name: on pull_request it runs the diff-scoped --from-ref/--to-ref check; otherwise it runs pre-commit run --all-files. (check_docs_build.yml already had the equivalent non-PR guard, so no change needed there.)

mesakhcienet · 2026-07-03T03:58:30Z

Thanks @shralex and @xibinliu for the careful review. Rebased onto latest main (picking up the new tpu7x tests) and pushed fixes for every thread. Summary:

Addressed

Unused dep (@xibinliu): dropped needs: analyze_code_changes from code_quality_check + docs_build_check.
all_tests_passed early-exit (@shralex): removed it — it was dead code (variable-name mismatch) and would have masked quality/docs failures on docs-only runs. contains(needs.*.result, 'failure') already covers every case.
Scheduled-run breakage (@shralex):
- Test suites now use always() && … && (github.event_name != 'pull_request' || (code_quality_check + docs_build_check success)) — nightly/dispatch runs get full test signal even when code-quality/docs warn; PRs still gate.
- code_quality.yml guards the PR-only git fetch/branch steps by github.event_name and runs pre-commit --all-files on non-PR events.
Failure notifications (@shralex): added code_quality_check + docs_build_check to notify_failure and investigate_failure needs, so scheduled quality/docs failures create the tracking issue and trigger investigation.
Rebase: onto latest main; tpu7x-tests needs updated to the fail-fast pattern, and it's wired into all_tests_passed / notify_failure / investigate_failure.
Build-failure masking (follow-up spotted during review): added build_and_upload_maxtext_package to all_tests_passed's needs. A real build failure skips every test job, and skipped ≠ failure, so the aggregate previously reported green on a broken build. It now fails correctly — and as a bonus the release gate blocks properly on an unbuildable package.

mesakhcienet force-pushed the ci/fail-fast-orchestration branch from 1f1c036 to d23b43b Compare July 1, 2026 10:07

mesakhcienet marked this pull request as ready for review July 1, 2026 10:31

mesakhcienet requested review from bvandermoon, darisoy, gobbleturk, khatwanimohit, parambole, richjames0, shralex and xibinliu as code owners July 1, 2026 10:31

xibinliu reviewed Jul 2, 2026

View reviewed changes

Comment thread .github/workflows/ci_pipeline.yml Outdated

shralex requested changes Jul 3, 2026

View reviewed changes

mesakhcienet force-pushed the ci/fail-fast-orchestration branch 2 times, most recently from 3613b05 to fb8f03f Compare July 3, 2026 03:43

ci: add ci fail-fast implementation plan

9002577

mesakhcienet force-pushed the ci/fail-fast-orchestration branch from fb8f03f to 9002577 Compare July 3, 2026 04:24

mesakhcienet requested review from shralex and xibinliu July 3, 2026 04:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: add ci fail-fast implementation plan#4315

ci: add ci fail-fast implementation plan#4315
mesakhcienet wants to merge 1 commit into
AI-Hypercomputer:mainfrom
CIeNET-International:ci/fail-fast-orchestration

mesakhcienet commented Jul 1, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jul 1, 2026

Uh oh!

Uh oh!

shralex left a comment

Uh oh!

Uh oh!

shralex Jul 3, 2026

Uh oh!

mesakhcienet Jul 3, 2026

Uh oh!

Uh oh!

Uh oh!

mesakhcienet commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mesakhcienet commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why is this change being made?

Specific Implementation Details

Tests

1. Local Syntax and Schema Validation

2. Live GitHub Actions Run Verification

Checklist

Uh oh!

codecov Bot commented Jul 1, 2026

Codecov Report

Uh oh!

Uh oh!

shralex left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shralex Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

mesakhcienet Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mesakhcienet commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mesakhcienet commented Jul 1, 2026 •

edited

Loading

mesakhcienet commented Jul 3, 2026 •

edited

Loading