-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Fix beam search for Llama models by adding reorder_cache method #2753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
When using beam search with Unsloth-optimized Llama models, users encounter: NotImplementedError: Make sure that a `reorder_cache` function is correctly implemented in transformers.models.llama.modeling_llama This occurs because Unsloth patches LlamaForCausalLM but doesn't preserve the reorder_cache static method required for beam search operations. The fix adds the missing reorder_cache method after Unsloth's patching, ensuring compatibility with transformers' beam search functionality. This allows users to use generation methods like model.generate(num_beams=N) without errors. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
The license field was using a simple string format which causes installation failures with modern setuptools. PEP 621 requires the license to be specified as either {file: "path"} or {text: "value"}. This change updates the license field to use the correct object format, allowing successful package installation from source. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Thanks! I think pyproject.toml doesn't need changes hopefully right? |
Oops I wasn't trying to merge it in yet, it's not ready. I do think pyproject.toml needs changes (unrelated to the inference issue). Whenever I try to pip install, I get:
And the change here fixes it. |
The transformers library expects the method to be named _reorder_cache (with underscore) for beam search functionality. The previous commit incorrectly used reorder_cache without the underscore, causing the NotImplementedError to persist. This corrects the method name to match transformers' expectations, properly enabling beam search for Unsloth-optimized Llama models. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Moved the _llama_reorder_cache function outside of the pre_patch method to make it a clean module-level function. This improves code organization and follows Python best practices by avoiding nested function definitions where unnecessary. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
I'd also like to come up with a more general solution so we don't have to do this for every model, since I know Gemma has had issues too. Will update and ping you when it's ready :) |
…ForCausalLM The beam search error was occurring because PEFT models wrap the base model and need their own _reorder_cache method. This commit adds the same _reorder_cache implementation to PeftModelForCausalLM class. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
Added _reorder_cache method to: 1. Model instances in from_pretrained 2. PEFT model instances in patch_peft_model 3. Base model of PEFT models (model.base_model) This ensures beam search works regardless of how the model is accessed during generation. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
- Created general_reorder_cache function in _utils.py that works for all transformer models - Created patch_model_for_beam_search function to patch any model instance - Updated llama.py to use the general solution instead of model-specific implementation - This fix now automatically works for all models (Gemma, Mistral, Qwen, etc.) that inherit from FastLlamaModel - Models that don't inherit can easily use patch_model_for_beam_search to add beam search support This is a more maintainable solution that fixes beam search for all Unsloth models. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
- Added explanation of why we need to provide our own _reorder_cache implementation - Added None check in general_reorder_cache for safety - Documented that newer models rely on Cache classes but beam search might still use legacy format - Clarified that our implementation is based on working models from transformers library 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
The transformers library expects _reorder_cache to be a staticmethod on the model class, not a regular method. This matches the pattern used by models like GPT2 and OPT. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
The key issue was that we were only patching our local import of LlamaForCausalLM, but the error was looking for _reorder_cache in transformers.models.llama.modeling_llama. Now we patch both the local import AND the transformers module's version. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
…ints - Added detailed debugging history to CLAUDE.md including all attempted fixes - Documented the root cause and current state of the beam search issue - Removed debug print statements from llama.py - Ready to transfer to Windows machine for further debugging The beam search fix attempts to patch _reorder_cache in multiple places: 1. Local LlamaForCausalLM class 2. PeftModelForCausalLM class 3. transformers.models.llama.modeling_llama.LlamaForCausalLM 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
@amrothemich Actually can you move the pyrpoject.toml PR to a new PR thanks :) |
This notebook provides a minimal test case to debug the beam search issue on a GPU-enabled environment. It: - Clones our fix-reorder-cache branch - Installs unsloth from source - Tests beam search with a small model - Includes debugging code to check module paths To use: Upload to Google Colab and run with GPU runtime. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
@amrothemich I think it's best to open a new PR - I think you included Claude Code :) |
When using beam search with Unsloth-optimized Llama models, users encounter: NotImplementedError: Make sure that a
reorder_cache
function is correctly implemented in transformers.models.llama.modeling_llamaThis occurs because Unsloth patches LlamaForCausalLM but doesn't preserve the reorder_cache static method required for beam search operations.
The fix adds the missing reorder_cache method after Unsloth's patching, ensuring compatibility with transformers' beam search functionality. This allows users to use generation methods like model.generate(num_beams=N) without errors.