Skip to content

feat: enable prefill cudagraph by default#1352

Open
sufubao wants to merge 1 commit into
ModelTC:mainfrom
sufubao:prefill_cudagraph
Open

feat: enable prefill cudagraph by default#1352
sufubao wants to merge 1 commit into
ModelTC:mainfrom
sufubao:prefill_cudagraph

Conversation

@sufubao

@sufubao sufubao commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Make prefill cudagraph the default. Replaces --enable_prefill_cudagraph with opt-out --disable_prefill_cudagraph; auto-skips ep moe and dp prefill balance.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request changes the prefill CUDA Graph feature to be enabled by default, replacing the --enable_prefill_cudagraph option with --disable_prefill_cudagraph and updating the relevant models, CLI arguments, documentation, and tests. Feedback on these changes highlights two main issues: first, enabling this feature by default may cause initialization failures or crashes on unsupported models (such as ChatGLM or Baichuan), so a check should be added to restrict it to supported models; second, the test script changes make the baseline and test runs identical, so --disable_prefill_cudagraph should be explicitly added to the baseline run to maintain the distinction.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +278 to 295
args = get_env_start_args()
enable_prefill_cudagraph = not args.disable_prefill_cudagraph
# prefill cudagraph currently only supports llama/qwen-style models. It is incompatible with
# ep moe and dp prefill balance, so auto-skip those configs even when not explicitly disabled.
if enable_prefill_cudagraph and args.enable_ep_moe:
logger.warning("prefill cudagraph does not support ep moe, disabling it.")
enable_prefill_cudagraph = False
if enable_prefill_cudagraph and args.enable_dp_prefill_balance:
logger.warning("prefill cudagraph does not support dp prefill balance, disabling it.")
enable_prefill_cudagraph = False

self.prefill_graph = (
None
if not get_env_start_args().enable_prefill_cudagraph
else PrefillCudaGraph(decode_cuda_graph=self.graph, tp_world_size=self.tp_world_size_)
PrefillCudaGraph(decode_cuda_graph=self.graph, tp_world_size=self.tp_world_size_)
if enable_prefill_cudagraph
else None
)
if self.prefill_graph is not None:
if get_env_start_args().enable_prefill_microbatch_overlap:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Since prefill CUDA Graph is now enabled by default, it will attempt to initialize and run warmup for all models. However, as noted in the comments, prefill CUDA Graph currently only supports llama/qwen-style models. Running it on unsupported models (e.g., ChatGLM, Baichuan) will lead to initialization failures or runtime crashes during warmup.

We should add a check to restrict the default enablement to supported models (such as llama, qwen, yi, mistral). Additionally, we can use getattr for safer attribute access on the args object and reuse the args variable instead of calling get_env_start_args() multiple times.

        args = get_env_start_args()
        enable_prefill_cudagraph = not getattr(args, "disable_prefill_cudagraph", False)
        
        # prefill cudagraph currently only supports llama/qwen-style models.
        model_class_name = self.__class__.__name__.lower()
        is_supported_model = any(m in model_class_name for m in ["llama", "qwen", "yi", "mistral"])
        if enable_prefill_cudagraph and not is_supported_model:
            enable_prefill_cudagraph = False

        # It is incompatible with ep moe and dp prefill balance, so auto-skip those configs even when not explicitly disabled.
        if enable_prefill_cudagraph and getattr(args, "enable_ep_moe", False):
            logger.warning("prefill cudagraph does not support ep moe, disabling it.")
            enable_prefill_cudagraph = False
        if enable_prefill_cudagraph and getattr(args, "enable_dp_prefill_balance", False):
            logger.warning("prefill cudagraph does not support dp prefill balance, disabling it.")
            enable_prefill_cudagraph = False

        self.prefill_graph = (
            PrefillCudaGraph(decode_cuda_graph=self.graph, tp_world_size=self.tp_world_size_)
            if enable_prefill_cudagraph
            else None
        )
        if self.prefill_graph is not None:
            if getattr(args, "enable_prefill_microbatch_overlap", False):

Comment thread test/acc/test_qwen3.5.sh
Comment on lines 13 to +16
LOADWORKER=18 CUDA_VISIBLE_DEVICES=6,7 python -m lightllm.server.api_server \
--model_dir /root/.cache/huggingface/hub/models--Qwen--Qwen3.5-0.8B/snapshots/2fc06364715b967f1860aea9cf38778875588b17 \
--tp 2 \
--port 8089 \
--enable_prefill_cudagraph
--port 8089

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

By removing --enable_prefill_cudagraph from this second test run, it becomes completely identical to the first test run (which also has no prefill cudagraph flags and thus now runs with prefill cudagraph enabled by default).

To preserve the distinction between the baseline test (without prefill cudagraph) and the prefill cudagraph test, we should add --disable_prefill_cudagraph to this second run (making it the baseline test) and keep the first run as the default prefill cudagraph test.

Suggested change
LOADWORKER=18 CUDA_VISIBLE_DEVICES=6,7 python -m lightllm.server.api_server \
--model_dir /root/.cache/huggingface/hub/models--Qwen--Qwen3.5-0.8B/snapshots/2fc06364715b967f1860aea9cf38778875588b17 \
--tp 2 \
--port 8089 \
--enable_prefill_cudagraph
--port 8089
LOADWORKER=18 CUDA_VISIBLE_DEVICES=6,7 python -m lightllm.server.api_server \
--model_dir /root/.cache/huggingface/hub/models--Qwen--Qwen3.5-0.8B/snapshots/2fc06364715b967f1860aea9cf38778875588b17 \
--tp 2 \
--port 8089 \
--disable_prefill_cudagraph

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant