feat: enable prefill cudagraph by default#1352
Conversation
There was a problem hiding this comment.
Code Review
This pull request changes the prefill CUDA Graph feature to be enabled by default, replacing the --enable_prefill_cudagraph option with --disable_prefill_cudagraph and updating the relevant models, CLI arguments, documentation, and tests. Feedback on these changes highlights two main issues: first, enabling this feature by default may cause initialization failures or crashes on unsupported models (such as ChatGLM or Baichuan), so a check should be added to restrict it to supported models; second, the test script changes make the baseline and test runs identical, so --disable_prefill_cudagraph should be explicitly added to the baseline run to maintain the distinction.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| args = get_env_start_args() | ||
| enable_prefill_cudagraph = not args.disable_prefill_cudagraph | ||
| # prefill cudagraph currently only supports llama/qwen-style models. It is incompatible with | ||
| # ep moe and dp prefill balance, so auto-skip those configs even when not explicitly disabled. | ||
| if enable_prefill_cudagraph and args.enable_ep_moe: | ||
| logger.warning("prefill cudagraph does not support ep moe, disabling it.") | ||
| enable_prefill_cudagraph = False | ||
| if enable_prefill_cudagraph and args.enable_dp_prefill_balance: | ||
| logger.warning("prefill cudagraph does not support dp prefill balance, disabling it.") | ||
| enable_prefill_cudagraph = False | ||
|
|
||
| self.prefill_graph = ( | ||
| None | ||
| if not get_env_start_args().enable_prefill_cudagraph | ||
| else PrefillCudaGraph(decode_cuda_graph=self.graph, tp_world_size=self.tp_world_size_) | ||
| PrefillCudaGraph(decode_cuda_graph=self.graph, tp_world_size=self.tp_world_size_) | ||
| if enable_prefill_cudagraph | ||
| else None | ||
| ) | ||
| if self.prefill_graph is not None: | ||
| if get_env_start_args().enable_prefill_microbatch_overlap: |
There was a problem hiding this comment.
Since prefill CUDA Graph is now enabled by default, it will attempt to initialize and run warmup for all models. However, as noted in the comments, prefill CUDA Graph currently only supports llama/qwen-style models. Running it on unsupported models (e.g., ChatGLM, Baichuan) will lead to initialization failures or runtime crashes during warmup.
We should add a check to restrict the default enablement to supported models (such as llama, qwen, yi, mistral). Additionally, we can use getattr for safer attribute access on the args object and reuse the args variable instead of calling get_env_start_args() multiple times.
args = get_env_start_args()
enable_prefill_cudagraph = not getattr(args, "disable_prefill_cudagraph", False)
# prefill cudagraph currently only supports llama/qwen-style models.
model_class_name = self.__class__.__name__.lower()
is_supported_model = any(m in model_class_name for m in ["llama", "qwen", "yi", "mistral"])
if enable_prefill_cudagraph and not is_supported_model:
enable_prefill_cudagraph = False
# It is incompatible with ep moe and dp prefill balance, so auto-skip those configs even when not explicitly disabled.
if enable_prefill_cudagraph and getattr(args, "enable_ep_moe", False):
logger.warning("prefill cudagraph does not support ep moe, disabling it.")
enable_prefill_cudagraph = False
if enable_prefill_cudagraph and getattr(args, "enable_dp_prefill_balance", False):
logger.warning("prefill cudagraph does not support dp prefill balance, disabling it.")
enable_prefill_cudagraph = False
self.prefill_graph = (
PrefillCudaGraph(decode_cuda_graph=self.graph, tp_world_size=self.tp_world_size_)
if enable_prefill_cudagraph
else None
)
if self.prefill_graph is not None:
if getattr(args, "enable_prefill_microbatch_overlap", False):| LOADWORKER=18 CUDA_VISIBLE_DEVICES=6,7 python -m lightllm.server.api_server \ | ||
| --model_dir /root/.cache/huggingface/hub/models--Qwen--Qwen3.5-0.8B/snapshots/2fc06364715b967f1860aea9cf38778875588b17 \ | ||
| --tp 2 \ | ||
| --port 8089 \ | ||
| --enable_prefill_cudagraph | ||
| --port 8089 |
There was a problem hiding this comment.
By removing --enable_prefill_cudagraph from this second test run, it becomes completely identical to the first test run (which also has no prefill cudagraph flags and thus now runs with prefill cudagraph enabled by default).
To preserve the distinction between the baseline test (without prefill cudagraph) and the prefill cudagraph test, we should add --disable_prefill_cudagraph to this second run (making it the baseline test) and keep the first run as the default prefill cudagraph test.
| LOADWORKER=18 CUDA_VISIBLE_DEVICES=6,7 python -m lightllm.server.api_server \ | |
| --model_dir /root/.cache/huggingface/hub/models--Qwen--Qwen3.5-0.8B/snapshots/2fc06364715b967f1860aea9cf38778875588b17 \ | |
| --tp 2 \ | |
| --port 8089 \ | |
| --enable_prefill_cudagraph | |
| --port 8089 | |
| LOADWORKER=18 CUDA_VISIBLE_DEVICES=6,7 python -m lightllm.server.api_server \ | |
| --model_dir /root/.cache/huggingface/hub/models--Qwen--Qwen3.5-0.8B/snapshots/2fc06364715b967f1860aea9cf38778875588b17 \ | |
| --tp 2 \ | |
| --port 8089 \ | |
| --disable_prefill_cudagraph |
Make prefill cudagraph the default. Replaces
--enable_prefill_cudagraphwith opt-out--disable_prefill_cudagraph; auto-skips ep moe and dp prefill balance.