Skip to content

Avoid checkpoint work on skipped steps#4326

Open
SujeethJinesh wants to merge 1 commit into
mainfrom
sujinesh/avoid-skipped-checkpoint-work
Open

Avoid checkpoint work on skipped steps#4326
SujeethJinesh wants to merge 1 commit into
mainfrom
sujinesh/avoid-skipped-checkpoint-work

Conversation

@SujeethJinesh

Copy link
Copy Markdown
Collaborator

Description

In the Pathways profiler for train.py, we see that the checkpointing logic appears to take the majority of time in a trace for Pathways, but this is due to the maybe_save_checkpoint function misclassifying it as doing work right now. This makes reading the profile harder and is a major red herring for debugging.

With this fix, it shows a more reasonable amount of work performed by the Pathways CPU node (screenshot).

This should also help avoid some of the unnecessary checkpointing related logic for other paths as well.

FIXES: b/530256100

Tests

Tested on internal cluster (details in b/529621972#comment6)

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@SujeethJinesh SujeethJinesh force-pushed the sujinesh/avoid-skipped-checkpoint-work branch 2 times, most recently from cab90e8 to 1cef908 Compare July 1, 2026 23:01
@SujeethJinesh SujeethJinesh force-pushed the sujinesh/avoid-skipped-checkpoint-work branch from 1cef908 to 1cf2be5 Compare July 2, 2026 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant