Skip to content

ci: add ci fail-fast implementation plan#4315

Open
mesakhcienet wants to merge 1 commit into
AI-Hypercomputer:mainfrom
CIeNET-International:ci/fail-fast-orchestration
Open

ci: add ci fail-fast implementation plan#4315
mesakhcienet wants to merge 1 commit into
AI-Hypercomputer:mainfrom
CIeNET-International:ci/fail-fast-orchestration

Conversation

@mesakhcienet

@mesakhcienet mesakhcienet commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR refactors the MaxText CI pipeline to orchestrate quality checks (linting/pre-commit) and documentation builds as the first phase of a Directed Acyclic Graph (DAG) before triggering heavy functional tests.

Why is this change being made?

Previously, ci_pipeline.yml, code_quality.yml, and check_docs_build.yml ran as entirely independent, concurrent workflows on Pull Requests. If a simple linting or documentation build error occurred, the pipeline would continue running the expensive, long-running TPU/GPU/CPU test suites for upwards of > 20 minutes. This resulted in wasted cloud compute resources and forced developers to wait unnecessarily for runs that were already bound to fail.

Specific Implementation Details

  1. code_quality.yml: Added support for workflow_call and an optional maxtext_sha input, fallback checkout logic (${{ inputs.maxtext_sha || github.sha }}), and removed the separate concurrency block to prevent interference with the caller.
  2. check_docs_build.yml: Relaxed the maxtext_sha input to be optional (required: false) so the workflow can still run independently when needed while gracefully falling back to github.sha if not provided.
  3. ci_pipeline.yml: Added the code_quality_check and docs_build_check jobs at the top of the file, forced the heavy test matrixes (tpu-tests, gpu-tests, cpu-tests, maxtext_tpu_pathways_unit_tests, and maxtext_tpu_pathways_integration_tests) to require their success, and updated the all_tests_passed reporter job.

FIXES: b/527592378


Tests

The changes have been thoroughly validated:

1. Local Syntax and Schema Validation

Validated all modified GitHub Actions files to ensure correct YAML structure:

python -c "import yaml; yaml.safe_load(open('.github/workflows/ci_pipeline.yml'))"
python -c "import yaml; yaml.safe_load(open('.github/workflows/code_quality.yml'))"
python -c "import yaml; yaml.safe_load(open('.github/workflows/check_docs_build.yml'))"

Result: All files parsed successfully with zero syntax errors.

2. Live GitHub Actions Run Verification

We successfully triggered a dry run of the refactored pipeline here:
https://github.com/AI-Hypercomputer/maxtext/actions/runs/28509679093

  • Observations:
    • The fast phase jobs started concurrently at 10:05:16Z.
    • docs_build_check / build-docs successfully passed in 21 seconds.
    • code_quality_check failed in 46 seconds due to pre-commit issues.
    • Immediately after the quality failure, all downstream heavy test jobs were instantly skipped (tpu-tests, gpu-tests, cpu-tests, and pathways tests), and running notebook jobs were automatically cancelled.
    • This confirmed a total fail-fast execution time of under 1 minute, preventing any TPU/GPU compute from being scheduled or billed.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@mesakhcienet mesakhcienet force-pushed the ci/fail-fast-orchestration branch from 1f1c036 to d23b43b Compare July 1, 2026 10:07
@mesakhcienet mesakhcienet marked this pull request as ready for review July 1, 2026 10:31
Comment thread .github/workflows/ci_pipeline.yml Outdated

@shralex shralex left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!

Please rebase your PR as tpu7x tests were added, and its "needs" should also be updated in your PR

Comment thread .github/workflows/ci_pipeline.yml Outdated
Comment thread .github/workflows/code_quality.yml Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, code_quality.yml was configured to run only on pull request events. In this PR, code_quality_check is being integrated as a job directly inside the main ci_pipeline.yml workflow. But ci_pipeline.yml runs on both pull requests and scheduled runs. The pull request-specific environment variables $GITHUB_BASE_REF and $GITHUB_HEAD_REF are empty on scheduled runs. The commands to fetch and branch them in code_quality.yml will fail and cause the scheduled build to fail.

So there are 2 issues here:

  1. On scheduled (or manual) runs on main, we want to collect as much test coverage and signal as possible. Having a minor code quality or documentation warning skip the entire TPU and GPU test suite is counterproductive for nightly or scheduled builds.

Suggestion: Add the following if: condition to all test suites that depend on code_quality_check and docs_build_check (such as tpu-tests, gpu-tests, cpu-tests, and pathways tests):

if: |
  always() &&
  needs.analyze_code_changes.outputs.run_tests == 'true' &&
  needs.build_and_upload_maxtext_package.result == 'success' &&
  (github.event_name != 'pull_request' || (needs.code_quality_check.result == 'success' && needs.docs_build_check.result == 'success'))
  1. Update .github/workflows/code_quality.yml to check the event name:

    • name: Run pre-commit checks on just the files that have changed
      run: |
      . "$GITHUB_WORKSPACE"/venv/bin/activate
      if [ "${{ github.event_name }}" == "pull_request" ]; then
      git fetch origin "$GITHUB_BASE_REF":"$GITHUB_BASE_REF"
      git branch "$GITHUB_HEAD_REF"
      pre-commit run --from-ref "$GITHUB_BASE_REF" --to-ref "$GITHUB_HEAD_REF" --show-diff-on-failure
      else
      echo "Not a pull request event (${{ github.event_name }}), checking all files."
      pre-commit run --all-files --show-diff-on-failure
      fi

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, both points addressed:

  1. Gating scoped to PRs. The test suites (tpu-tests, tpu7x-tests, gpu-tests, cpu-tests, and both pathways jobs) now use the condition you suggested:

    if: |
      always() &&
      needs.analyze_code_changes.outputs.run_tests == 'true' &&
      needs.build_and_upload_maxtext_package.result == 'success' &&
      (github.event_name != 'pull_request' || (needs.code_quality_check.result == 'success' && needs.docs_build_check.result == 'success'))

    So on schedule / workflow_dispatch the suites run regardless of the code-quality/docs outcome (full nightly signal), and on PRs they still gate.

  2. Event guard in code_quality.yml. The pre-commit step now branches on github.event_name: on pull_request it runs the diff-scoped --from-ref/--to-ref check; otherwise it runs pre-commit run --all-files. (check_docs_build.yml already had the equivalent non-PR guard, so no change needed there.)

Comment thread .github/workflows/ci_pipeline.yml Outdated
Comment thread .github/workflows/ci_pipeline.yml Outdated
@mesakhcienet mesakhcienet force-pushed the ci/fail-fast-orchestration branch 2 times, most recently from 3613b05 to fb8f03f Compare July 3, 2026 03:43
@mesakhcienet

mesakhcienet commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks @shralex and @xibinliu for the careful review. Rebased onto latest main (picking up the new tpu7x tests) and pushed fixes for every thread. Summary:

Addressed

  • Unused dep (@xibinliu): dropped needs: analyze_code_changes from code_quality_check + docs_build_check.
  • all_tests_passed early-exit (@shralex): removed it — it was dead code (variable-name mismatch) and would have masked quality/docs failures on docs-only runs. contains(needs.*.result, 'failure') already covers every case.
  • Scheduled-run breakage (@shralex):
    • Test suites now use always() && … && (github.event_name != 'pull_request' || (code_quality_check + docs_build_check success)) — nightly/dispatch runs get full test signal even when code-quality/docs warn; PRs still gate.
    • code_quality.yml guards the PR-only git fetch/branch steps by github.event_name and runs pre-commit --all-files on non-PR events.
  • Failure notifications (@shralex): added code_quality_check + docs_build_check to notify_failure and investigate_failure needs, so scheduled quality/docs failures create the tracking issue and trigger investigation.
  • Rebase: onto latest main; tpu7x-tests needs updated to the fail-fast pattern, and it's wired into all_tests_passed / notify_failure / investigate_failure.
  • Build-failure masking (follow-up spotted during review): added build_and_upload_maxtext_package to all_tests_passed's needs. A real build failure skips every test job, and skipped ≠ failure, so the aggregate previously reported green on a broken build. It now fails correctly — and as a bonus the release gate blocks properly on an unbuildable package.

@mesakhcienet mesakhcienet force-pushed the ci/fail-fast-orchestration branch from fb8f03f to 9002577 Compare July 3, 2026 04:24
@mesakhcienet mesakhcienet requested review from shralex and xibinliu July 3, 2026 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants