fix: preserve recurrent/hybrid model state when the full prompt is already cached by allthatido · Pull Request #2306 · abetlen/llama-cpp-python

allthatido · 2026-06-14T21:53:16Z

Summary

generate() always resets the recurrent state for hybrid models because its prefix matching compares self._input_ids (N tokens) against tokens[:-1] (N-1 tokens). When the full prompt is already cached, longest_prefix is N-1, which is always < self.n_tokens = N, so the reset always fires.

Impact

This breaks multimodal models like MiniCPM-V 4.6 where MTMDChatHandler pre-evaluates image embeddings into the state via its manual eval loop. When generate() resets, those embeddings are wiped and the model responds with "blank image".

Fix

Check that the full prompt is byte-identical to the cached state before pulling the reset trigger. If it is, skip reset and set tokens=[] so generation proceeds directly from the existing state.

abetlen · 2026-06-22T05:14:52Z

Hey @allthatido thanks you for this! There were a few changes I had to make to the original PR to make it correct.

I made a few changes because unfortunately this isn't as simple as it looks. The issue is that the Llama class supports loading serialised llama context state. The problem there is that this state doesn't include the logits on the final position which we need for sampling. The consequence of that is that if the matched prefix is <= the length of the history and there's no new tokens to eval as part of the prefill we need to "back-up" and eval one token so we have logits evaluated at that final position. Now for transformer models this used to work fine because we could always do this but this became a problem for hybrid / recurrent models.

The solution is to also keep a flag to check if we need to evaluate the loaded prompt history. This removed the need to always "back-up" the sequence history so it should work a little better for regular transformer models too.

…ready cached

abetlen force-pushed the bugfix/hybrid_model_state_reset branch 10 times, most recently from 57c7683 to 4faeb81 Compare June 22, 2026 04:09

fix: preserve recurrent/hybrid model state when the full prompt is al…

e78de05

…ready cached

abetlen force-pushed the bugfix/hybrid_model_state_reset branch from 4faeb81 to e78de05 Compare June 22, 2026 05:17

abetlen merged commit 9be3cd1 into abetlen:main Jun 22, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: preserve recurrent/hybrid model state when the full prompt is already cached#2306

fix: preserve recurrent/hybrid model state when the full prompt is already cached#2306
abetlen merged 1 commit into
abetlen:mainfrom
allthatido:bugfix/hybrid_model_state_reset

allthatido commented Jun 14, 2026

Uh oh!

abetlen commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

allthatido commented Jun 14, 2026

Summary

Impact

Fix

Uh oh!

abetlen commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants