Skip to content

fix: preserve recurrent/hybrid model state when the full prompt is already cached#2306

Merged
abetlen merged 1 commit into
abetlen:mainfrom
allthatido:bugfix/hybrid_model_state_reset
Jun 22, 2026
Merged

fix: preserve recurrent/hybrid model state when the full prompt is already cached#2306
abetlen merged 1 commit into
abetlen:mainfrom
allthatido:bugfix/hybrid_model_state_reset

Conversation

@allthatido

Copy link
Copy Markdown
Contributor

Summary

generate() always resets the recurrent state for hybrid models because its prefix matching compares self._input_ids (N tokens) against tokens[:-1] (N-1 tokens). When the full prompt is already cached, longest_prefix is N-1, which is always < self.n_tokens = N, so the reset always fires.

Impact

This breaks multimodal models like MiniCPM-V 4.6 where MTMDChatHandler pre-evaluates image embeddings into the state via its manual eval loop. When generate() resets, those embeddings are wiped and the model responds with "blank image".

Fix

Check that the full prompt is byte-identical to the cached state before pulling the reset trigger. If it is, skip reset and set tokens=[] so generation proceeds directly from the existing state.

@abetlen abetlen force-pushed the bugfix/hybrid_model_state_reset branch 10 times, most recently from 57c7683 to 4faeb81 Compare June 22, 2026 04:09
@abetlen

abetlen commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Hey @allthatido thanks you for this! There were a few changes I had to make to the original PR to make it correct.

I made a few changes because unfortunately this isn't as simple as it looks. The issue is that the Llama class supports loading serialised llama context state. The problem there is that this state doesn't include the logits on the final position which we need for sampling. The consequence of that is that if the matched prefix is <= the length of the history and there's no new tokens to eval as part of the prefill we need to "back-up" and eval one token so we have logits evaluated at that final position. Now for transformer models this used to work fine because we could always do this but this became a problem for hybrid / recurrent models.

The solution is to also keep a flag to check if we need to evaluate the loaded prompt history. This removed the need to always "back-up" the sequence history so it should work a little better for regular transformer models too.

@abetlen abetlen force-pushed the bugfix/hybrid_model_state_reset branch from 4faeb81 to e78de05 Compare June 22, 2026 05:17
@abetlen abetlen merged commit 9be3cd1 into abetlen:main Jun 22, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants