Skip to content

[ExecuTorch][WebGPU] Dynamic resize hook for slice_copy (dynamic start/end)#20581

Open
JulianCloudNTH wants to merge 4 commits into
gh/JulianCloudNTH/73/basefrom
gh/JulianCloudNTH/73/head
Open

[ExecuTorch][WebGPU] Dynamic resize hook for slice_copy (dynamic start/end)#20581
JulianCloudNTH wants to merge 4 commits into
gh/JulianCloudNTH/73/basefrom
gh/JulianCloudNTH/73/head

Conversation

@JulianCloudNTH

@JulianCloudNTH JulianCloudNTH commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Make slice_copy support a dynamic gather range so the RoPE-freqs slice [input_pos : input_pos + S] works under one dynamic graph.

Problem: the static slice handler read start via a scalar reader that throws on a SymInt and ignored end (output length baked AOT). The RoPE-freqs slice uses a SymInt input_pos for start and a live S for the range, so the static op could neither build nor resize for it.

Solution: read start/end as possibly-dynamic SymInts and add a resize hook that recomputes the gather offset and live output length each step.

  • Before: start is a static scalar (SymInt throws); end ignored; output length fixed at the serialized max.
  • After: start/end read via a SymInt-aware reader; a hook recomputes out[dim] = (end - start + step - 1) / step, rewrites out_meta/in_meta/params UBOs + the dispatch count, and sets the output cur_dims.

Implementation:

  • Hook registered on the start/end value-ids when they are SymInts and on the input tensor always (inert until resized, so a static slice is byte-identical).
  • Output/input TensorMeta rebuilt from live dims; dim/step stay static.
  • Keep the uniforms alive via own_uniform_buffer so the hook can rewrite them.
  • Mirrors Vulkan resize_slice_copy_node.

Constraints: fp32-only; dim/step static; numerics + layout unchanged; inert on a static graph.

NOTE (stacking): this diff sits on top of the in-review slice_copy op (D108793168); rebase onto it once that op lands on master.

Co-authored-with: Claude Code.
@exported-using-ghexport

Differential Revision: D109906092

Differential Revision: D109906092

[ghstack-poisoned]
@pytorch-bot

pytorch-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20581

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 99a8cbe with merge base db5f1ba (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 28, 2026
@JulianCloudNTH

Copy link
Copy Markdown
Contributor Author

@claude review and check for any areas or opportunities for modularization

@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown

Claude finished @JulianCloudNTH's task in 2m 32s —— View job


Code Review: Dynamic resize hook for slice_copy

  • Read PR diff and full Slice.cpp
  • Check WebGPUGraph resize API (read_symint, add_resize_hook, add_tensor_resize_hook, own_uniform_buffer, propagate_resize)
  • Compare against Vulkan resize_slice_copy_node
  • Identify correctness issues
  • Identify modularization opportunities

Overall this is clean and matches the established WebGPU dynamic-shape pattern well (SymInt hooks → tensor hooks, idempotent recompute, own_uniform_buffer to keep UBOs writable). The hook is correctly inert on a static graph. A few items below — one I'd consider a real (latent) correctness bug, the rest are quality/modularization.


1. Output non-slice dims should be derived from live input dims (divergence from Vulkan)

Slice.cpp:224-226

std::vector<int64_t> od = g.cur_dims(out_id);   // <- seeds from output's *prior* dims
od[dim] = len;
g.set_cur_dims(out_id, od);

The Vulkan reference this claims to mirror derives the output shape from the input sizes each resize:

std::vector<int64_t> new_out_sizes = in_sizes;        // backends/vulkan/.../Slice.cpp:63-64
new_out_sizes.at(dim) = (end - start + step - 1) / step;

For slice_copy, every non-dim axis of the output is by definition equal to the input's. By seeding od from the output's prior cur_dims and only rewriting od[dim], this hook leaves the other axes stale. The in_meta it writes does track in_dims, so if any non-slice axis of the input is also dynamic, out_meta and in_meta disagree on those axes — a silent shape mismatch.

For the RoPE-freqs case only dim is dynamic, so it happens to work today, but it's fragile and diverges from the reference. Suggest mirroring Vulkan:

std::vector<int64_t> od = in_dims;   // already have `const auto& in_dims = g.cur_dims(in_id);`
od[dim] = len;

Fix this →


2. Redundant hook executions (minor; idempotent so not a bug)

The same recompute closure is registered up to three times — on start_id, end_id, and in_id. When more than one trigger is dirty in the same step (e.g. input_pos SymInt + an input resize), recompute runs 2–3× per propagate_resize. It's fully idempotent (recomputes from live state), so output is correct, just wasted wgpuQueueWriteBuffer calls. Worth a one-line comment noting the intentional redundancy, or skip if the duplicate writes are negligible.


3. Modularization opportunities (the explicit ask)

a. Extract shared slice geometry — highest value. The build path (Slice.cpp:100-116) and the hook (Slice.cpp:217-242) independently compute start/clamp and fill SliceParams (dim/start/step). A small helper would remove the duplication and make the build path and resize path provably consistent:

struct SliceGeom { std::vector<int64_t> out_dims; SliceParams params; };
SliceGeom compute_slice_geom(const std::vector<int64_t>& in_dims,
                             int64_t dim, int64_t start, int64_t end, int64_t step);

Both sites call it; the hook additionally writes the UBOs/dispatch. This also naturally fixes finding #1, since the helper would build out_dims from in_dims.

b. read_index / is_symint / norm_clamp are reusable across transfer ops. slice, and soon select/narrow/index_select, all need "read a possibly-dynamic index" + "normalize+clamp into [0, size]". Vulkan already centralizes this as normalize_idx. Consider hoisting these three free functions into a shared WebGPU helper header (e.g. alongside TensorMeta.h or a new transfer helper) rather than re-declaring them per op. Even within this file, read_index and read_scalar (Slice.cpp:34-61) share the Int → INT64_MAX→default branch — read_scalar could be expressed as the strict variant of read_index.

c. Bind-group / pipeline boilerplate (out of scope here). Slice.cpp:122-200 is ~80 lines of BGL + pipeline + bind-group setup that is near-identical to other WebGPU ops. Not for this PR, but a make_compute_pipeline(...) / make_bind_group(...) helper would shrink every op file substantially. Flagging as a follow-up direction.


Minor

  • Slice.cpp:76 comment still reads end unread (out shape is AOT) — now stale, end_id is read in the hook. Update or drop.
  • len == 0 path → om.numel == 0compute_1d_workgroup_count(..., 0, ...). Confirm a 0-element dispatch is a safe no-op (Vulkan tolerates it); a quick assertion/comment would document the intent.

No blocking issues beyond #1, which I'd recommend addressing before this lands given it directly contradicts the "Mirrors Vulkan resize_slice_copy_node" claim in the description.
• Base: origin/gh/JulianCloudNTH/73/base

[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants