fix: preserve recurrent/hybrid model state when the full prompt is already cached#2306
Conversation
57c7683 to
4faeb81
Compare
|
Hey @allthatido thanks you for this! There were a few changes I had to make to the original PR to make it correct. I made a few changes because unfortunately this isn't as simple as it looks. The issue is that the The solution is to also keep a flag to check if we need to evaluate the loaded prompt history. This removed the need to always "back-up" the sequence history so it should work a little better for regular transformer models too. |
4faeb81 to
e78de05
Compare
Summary
generate()always resets the recurrent state for hybrid models because its prefix matching comparesself._input_ids(N tokens) againsttokens[:-1](N-1 tokens). When the full prompt is already cached,longest_prefixis N-1, which is always< self.n_tokens = N, so the reset always fires.Impact
This breaks multimodal models like MiniCPM-V 4.6 where
MTMDChatHandlerpre-evaluates image embeddings into the state via its manual eval loop. Whengenerate()resets, those embeddings are wiped and the model responds with "blank image".Fix
Check that the full prompt is byte-identical to the cached state before pulling the reset trigger. If it is, skip reset and set
tokens=[]so generation proceeds directly from the existing state.