[discrete diffusion] Add DiffusionGemma pipeline and schedulers#13986
[discrete diffusion] Add DiffusionGemma pipeline and schedulers#13986kashif wants to merge 17 commits into
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Looking great! A couple questions from quick skimming
yiyixuxu
left a comment
There was a problem hiding this comment.
thanks for the PR! i left a few comments
I reviewed this through the lens of diffuser convention/style. If some of these choices are intentional to keep things familiar for Transformers users, let me know, and we can figure out the right balance together
| def __call__( | ||
| self, | ||
| prompt: str | list[str] | None = None, | ||
| messages: list[dict[str, str]] | None = None, |
There was a problem hiding this comment.
I think between prompt and messages, we only need accept prompt since it's a really cheap into messages
it's just this, no?
messages = [{"role": "user", "content": prompt}]There was a problem hiding this comment.
Makes sense. The one wrinkle is image prompts, which we pass through messages today, so I'll fold the prompt/messages simplification into the image input rework so single-image and text both stay clean. Coming in a follow-up.
There was a problem hiding this comment.
Made prompt the primary input and dropped the tokenized intermediates. Kept messages for raw multi-turn/multimodal conversations (per the thread below with zucchini), and added a raw image arg for the simple prompt+image case, so it is all raw inputs now.
Adds a DiffusionGemma block-diffusion pipeline, alongside the schedulers already on this branch (discrete DDIM, entropy bound, and a uniform mode for block refinement).
DiffusionGemma is an encoder-decoder block-diffusion model: the encoder reads the prompt into a KV cache and the decoder denoises a fixed-size canvas by cross-attending to it. The pipeline runs the outer canvas loop and the inner denoising loop, sampling candidates each step, committing the most confident ones via
BlockRefinementSchedulerin uniform corruption mode, and renoising the rest. Structure mirrors the LLaDA2 and dflash (#13699) pipelines.The model itself lives in transformers as
DiffusionGemmaForBlockDiffusion(released in 5.12.0).Tested:
Quality on the full
google/diffusiongemma-26B-A4B-itcheckpoint still needs a GPU run.