ci: add ci fail-fast implementation plan#4315
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
1f1c036 to
d23b43b
Compare
shralex
left a comment
There was a problem hiding this comment.
Thanks for the changes!
Please rebase your PR as tpu7x tests were added, and its "needs" should also be updated in your PR
There was a problem hiding this comment.
Previously, code_quality.yml was configured to run only on pull request events. In this PR, code_quality_check is being integrated as a job directly inside the main ci_pipeline.yml workflow. But ci_pipeline.yml runs on both pull requests and scheduled runs. The pull request-specific environment variables $GITHUB_BASE_REF and $GITHUB_HEAD_REF are empty on scheduled runs. The commands to fetch and branch them in code_quality.yml will fail and cause the scheduled build to fail.
So there are 2 issues here:
- On scheduled (or manual) runs on main, we want to collect as much test coverage and signal as possible. Having a minor code quality or documentation warning skip the entire TPU and GPU test suite is counterproductive for nightly or scheduled builds.
Suggestion: Add the following if: condition to all test suites that depend on code_quality_check and docs_build_check (such as tpu-tests, gpu-tests, cpu-tests, and pathways tests):
if: |
always() &&
needs.analyze_code_changes.outputs.run_tests == 'true' &&
needs.build_and_upload_maxtext_package.result == 'success' &&
(github.event_name != 'pull_request' || (needs.code_quality_check.result == 'success' && needs.docs_build_check.result == 'success'))
-
Update .github/workflows/code_quality.yml to check the event name:
- name: Run pre-commit checks on just the files that have changed
run: |
. "$GITHUB_WORKSPACE"/venv/bin/activate
if [ "${{ github.event_name }}" == "pull_request" ]; then
git fetch origin "$GITHUB_BASE_REF":"$GITHUB_BASE_REF"
git branch "$GITHUB_HEAD_REF"
pre-commit run --from-ref "$GITHUB_BASE_REF" --to-ref "$GITHUB_HEAD_REF" --show-diff-on-failure
else
echo "Not a pull request event (${{ github.event_name }}), checking all files."
pre-commit run --all-files --show-diff-on-failure
fi
- name: Run pre-commit checks on just the files that have changed
There was a problem hiding this comment.
Done, both points addressed:
-
Gating scoped to PRs. The test suites (
tpu-tests,tpu7x-tests,gpu-tests,cpu-tests, and both pathways jobs) now use the condition you suggested:if: | always() && needs.analyze_code_changes.outputs.run_tests == 'true' && needs.build_and_upload_maxtext_package.result == 'success' && (github.event_name != 'pull_request' || (needs.code_quality_check.result == 'success' && needs.docs_build_check.result == 'success'))
So on
schedule/workflow_dispatchthe suites run regardless of the code-quality/docs outcome (full nightly signal), and on PRs they still gate. -
Event guard in
code_quality.yml. The pre-commit step now branches ongithub.event_name: onpull_requestit runs the diff-scoped--from-ref/--to-refcheck; otherwise it runspre-commit run --all-files. (check_docs_build.ymlalready had the equivalent non-PR guard, so no change needed there.)
3613b05 to
fb8f03f
Compare
|
Thanks @shralex and @xibinliu for the careful review. Rebased onto latest Addressed
|
fb8f03f to
9002577
Compare
Description
This PR refactors the MaxText CI pipeline to orchestrate quality checks (linting/pre-commit) and documentation builds as the first phase of a Directed Acyclic Graph (DAG) before triggering heavy functional tests.
Why is this change being made?
Previously,
ci_pipeline.yml,code_quality.yml, andcheck_docs_build.ymlran as entirely independent, concurrent workflows on Pull Requests. If a simple linting or documentation build error occurred, the pipeline would continue running the expensive, long-running TPU/GPU/CPU test suites for upwards of > 20 minutes. This resulted in wasted cloud compute resources and forced developers to wait unnecessarily for runs that were already bound to fail.Specific Implementation Details
code_quality.yml: Added support forworkflow_calland an optionalmaxtext_shainput, fallback checkout logic (${{ inputs.maxtext_sha || github.sha }}), and removed the separate concurrency block to prevent interference with the caller.check_docs_build.yml: Relaxed themaxtext_shainput to be optional (required: false) so the workflow can still run independently when needed while gracefully falling back togithub.shaif not provided.ci_pipeline.yml: Added thecode_quality_checkanddocs_build_checkjobs at the top of the file, forced the heavy test matrixes (tpu-tests,gpu-tests,cpu-tests,maxtext_tpu_pathways_unit_tests, andmaxtext_tpu_pathways_integration_tests) to require their success, and updated theall_tests_passedreporter job.FIXES: b/527592378
Tests
The changes have been thoroughly validated:
1. Local Syntax and Schema Validation
Validated all modified GitHub Actions files to ensure correct YAML structure:
Result: All files parsed successfully with zero syntax errors.
2. Live GitHub Actions Run Verification
We successfully triggered a dry run of the refactored pipeline here:
https://github.com/AI-Hypercomputer/maxtext/actions/runs/28509679093
10:05:16Z.docs_build_check / build-docssuccessfully passed in 21 seconds.code_quality_checkfailed in 46 seconds due to pre-commit issues.tpu-tests,gpu-tests,cpu-tests, and pathways tests), and running notebook jobs were automatically cancelled.Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.