diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index eb60a5a20..abbaedd5b 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -3,27 +3,30 @@ Thank you for not only using Unsloth but also for being interested in helping out! We value all contributions, whether they come in the form of code, ideas, support for others or just by simply spreading the word of Unsloth! 💕 - **[Support the Community](https://github.com/unslothai/unsloth/issues)**: Answer questions, review pull requests, or assist others in discussions. -- **Fix Bugs**: Identify and resolve issues with the existing codebase. -- **Submit Ideas**: Request new features or share enhancements you'd like to see. +- **Fix Bugs**: Identify and resolve issues with the existing codebase. +- **Submit Ideas**: Request new features or share enhancements you'd like to see. - **Develop Features**: Implement new functionality or improve existing tools which can be done via PRs. - **[Improve Documentation](https://docs.unsloth.ai/)**: Help by creating guides, FAQs, or enhancing clarity. One of the best ways to support us is by spreading the word about Unsloth! Share how it’s powering your amazing projects in blog posts or social media, and inspire others to explore its potential. Even a simple star on our repo goes a long way in showing your support and helping the community grow. 🌟 -## Submitting Issues -If you find a bug or have a feature idea, we’d love to hear from you! Here’s how to make your submission stand out: +## Submitting Issues +If you find a bug or have a feature idea, we’d love to hear from you! Here’s how to make your submission stand out: -### Reporting Bugs -1. **Search First**: Check if the issue has already been reported using GitHub’s search bar under Issues. -2. **Details Matter**: Is this on Google Colab, Kaggle, or on another platform service? Are you using Unsloth's official notebook? Include your OS, Python version, and other relevant details. For bugs, a concise code snippet that reproduces the issue is incredibly helpful. +### Reporting Bugs +1. **Search First**: Check if the issue has already been reported using GitHub’s search bar under Issues. +2. **Details Matter**: Is this on Google Colab, Kaggle, or on another platform service? Are you using Unsloth's official notebook? Include your OS, Python version, and other relevant details. For bugs, a concise code snippet that reproduces the issue is incredibly helpful. 3. **Be Thorough**: Attach screenshots, traceback logs, or any additional information that might speed up resolution. ## Spread the Word -Your support extends beyond code: -- Spread the word by writing about Unsloth in blogs or social media. -- Share how Unsloth powers your projects. -- Star our repository to show your appreciation. +Your support extends beyond code: +- Spread the word by writing about Unsloth in blogs or social media. +- Share how Unsloth powers your projects. +- Star our repository to show your appreciation. -Finally, please be mindful of our [Code of Conduct](https://github.com/unslothai/unsloth/blob/main/CODE_OF_CONDUCT.md) to ensure a welcoming and inclusive environment for everyone. +## Note +We have added a new section in the `README.md` under "✨ Finetune for Free" titled "Exporting Models from Colab to Local Machine" with detailed steps. Please refer to it for guidance on exporting models from Colab to your local machine. + +Finally, please be mindful of our [Code of Conduct](https://github.com/unslothai/unsloth/tree/main/unsloth/CODE_OF_CONDUCT.md) to ensure a welcoming and inclusive environment for everyone. Thank you so much for reading and we hope you have lots of fun using Unsloth! 🦥 diff --git a/README.md b/README.md index 1314cb1c5..83eb45ad1 100644 --- a/README.md +++ b/README.md @@ -1,191 +1,152 @@
- + unsloth logo - - - + + + -### Train gpt-oss, DeepSeek, Gemma, Qwen & Llama 2x faster with 70% less VRAM! +### Finetune Llama 3.3, Mistral, Phi-4, Qwen 2.5 & Gemma 2x faster with 80% less memory! ![](https://i.ibb.co/sJ7RhGG/image-41.png)
-## ✨ Train for Free +## ✨ Finetune for Free -Notebooks are beginner friendly. Read our [guide](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide). Add dataset, run, then deploy your trained model. +All notebooks are **beginner friendly**! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, Ollama, vLLM or uploaded to Hugging Face. -| Model | Free Notebooks | Performance | Memory use | +| Unsloth supports | Free Notebooks | Performance | Memory use | |-----------|---------|--------|----------| -| **Qwen3.5 (4B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_5_(4B)_Vision.ipynb) | 1.5x faster | 60% less | -| **gpt-oss (20B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-Fine-tuning.ipynb) | 2x faster | 70% less | -| **gpt-oss (20B): GRPO** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) | 2x faster | 80% less | -| **Qwen3: Advanced GRPO** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb) | 2x faster | 50% less | -| **Gemma 3 (4B) Vision** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_(4B)-Vision.ipynb) | 1.7x faster | 60% less | -| **embeddinggemma (300M)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/EmbeddingGemma_(300M).ipynb) | 2x faster | 20% less | -| **Mistral Ministral 3 (3B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Ministral_3_VL_(3B)_Vision.ipynb) | 1.5x faster | 60% less | -| **Llama 3.1 (8B) Alpaca** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb) | 2x faster | 70% less | -| **Llama 3.2 Conversational** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) | 2x faster | 70% less | -| **Orpheus-TTS (3B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Orpheus_(3B)-TTS.ipynb) | 1.5x faster | 50% less | +| **Llama 3.2 (3B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) | 2x faster | 70% less | +| **GRPO (reasoning)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb) | 2x faster | 80% less | +| **Phi-4 (14B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_4-Conversational.ipynb) | 2x faster | 70% less | +| **Llama 3.2 Vision (11B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb) | 2x faster | 50% less | +| **Llama 3.1 (8B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb) | 2x faster | 70% less | +| **Gemma 2 (9B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma2_(9B)-Alpaca.ipynb) | 2x faster | 70% less | +| **Qwen 2.5 (7B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(7B)-Alpaca.ipynb) | 2x faster | 70% less | +| **Mistral v0.3 (7B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-Conversational.ipynb) | 2.2x faster | 75% less | +| **Ollama** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb) | 1.9x faster | 60% less | +| **DPO Zephyr** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Zephyr_(7B)-DPO.ipynb) | 1.9x faster | 50% less | -- See all our notebooks for: [Kaggle](https://github.com/unslothai/notebooks?tab=readme-ov-file#-kaggle-notebooks), [GRPO](https://unsloth.ai/docs/get-started/unsloth-notebooks#grpo-reasoning-rl-notebooks), [TTS](https://unsloth.ai/docs/get-started/unsloth-notebooks#text-to-speech-tts-notebooks), [embedding](https://unsloth.ai/docs/new/embedding-finetuning) & [Vision](https://unsloth.ai/docs/get-started/unsloth-notebooks#vision-multimodal-notebooks) -- See [all our models](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [all our notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks) -- See detailed documentation for Unsloth [here](https://unsloth.ai/docs) +- See [all our notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks) and [all our models](https://docs.unsloth.ai/get-started/all-our-models) +- **Kaggle Notebooks** for [Llama 3.2 Kaggle notebook](https://www.kaggle.com/danielhanchen/kaggle-llama-3-2-1b-3b-unsloth-notebook), [Llama 3.1 (8B)](https://www.kaggle.com/danielhanchen/kaggle-llama-3-1-8b-unsloth-notebook), [Gemma 2 (9B)](https://www.kaggle.com/code/danielhanchen/kaggle-gemma-7b-unsloth-notebook/), [Mistral (7B)](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook) +- Run notebooks for [Llama 3.2 conversational](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb), [Llama 3.1 conversational](https://colab.research.google.com/drive/15OyFkGoCImV9dSsewU1wa2JuKB4-mDE_?usp=sharing) and [Mistral v0.3 ChatML](https://colab.research.google.com/drive/15F1xyn8497_dUbxZP4zWmPZ3PJx1Oymv?usp=sharing) +- This [continued pretraining notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-CPT.ipynb) is for learning another language +- Click [here](https://docs.unsloth.ai/) for detailed documentation for Unsloth. -## ⚡ Quickstart -### Linux or WSL -```bash -pip install unsloth +## Exporting Models from Colab to Local Machine + +If you have fine-tuned a model in Colab and want to use it locally on your machine, follow these steps: + +1. **Save the Model in Colab**: Ensure you have saved the model in a format that can be easily downloaded. You can use the `unsloth_save_model` function to save the model in the desired format. + +2. **Connect to Google Drive**: Mount your Google Drive in Colab to save the model files there. This allows you to download the files to your local machine later. + +```python +from google.colab import drive +drive.mount('/content/drive') ``` -### Windows -For Windows, `pip install unsloth` works only if you have Pytorch installed. Read our [Windows Guide](https://unsloth.ai/docs/get-started/install/windows-installation). -### Docker -Use our official [Unsloth Docker image](https://hub.docker.com/r/unsloth/unsloth) ```unsloth/unsloth``` container. Read our [Docker Guide](https://unsloth.ai/docs/get-started/install/docker). +3. **Save Model to Google Drive**: Save the model files to a directory in your Google Drive. -### AMD, Intel, Blackwell & DGX Spark -For RTX 50x, B200, 6000 GPUs: `pip install unsloth`. Read our guides for: [Blackwell](https://unsloth.ai/docs/blog/fine-tuning-llms-with-blackwell-rtx-50-series-and-unsloth) and [DGX Spark](https://unsloth.ai/docs/blog/fine-tuning-llms-with-nvidia-dgx-spark-and-unsloth).
-To install Unsloth on **AMD** and **Intel** GPUs, follow our [AMD Guide](https://unsloth.ai/docs/get-started/install/amd) and [Intel Guide](https://unsloth.ai/docs/get-started/install/intel). +```python +model.save_pretrained('/content/drive/MyDrive/your_model_directory') +tokenizer.save_pretrained('/content/drive/MyDrive/your_model_directory') +``` -## 🦥 Unsloth News -- **Qwen3.5** - 0.8B, 2B, 4B, 9B, 27B, 35-A3B, 112B-A10B are now supported. [Guide + notebooks](https://unsloth.ai/docs/models/qwen3.5/fine-tune) -- Train **MoE LLMs 12x faster** with 35% less VRAM - DeepSeek, GLM, Qwen and gpt-oss. [Blog](https://unsloth.ai/docs/new/faster-moe) -- **Embedding models**: Unsloth now supports ~1.8-3.3x faster embedding fine-tuning. [Blog](https://unsloth.ai/docs/new/embedding-finetuning) • [Notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks#embedding-models) -- New **7x longer context RL** vs. all other setups, via our new batching algorithms. [Blog](https://unsloth.ai/docs/new/grpo-long-context) -- New RoPE & MLP **Triton Kernels** & **Padding Free + Packing**: 3x faster training & 30% less VRAM. [Blog](https://unsloth.ai/docs/new/3x-faster-training-packing) -- **500K Context**: Training a 20B model with >500K context is now possible on an 80GB GPU. [Blog](https://unsloth.ai/docs/blog/500k-context-length-fine-tuning) -- **FP8 & Vision RL**: You can now do FP8 & VLM GRPO on consumer GPUs. [FP8 Blog](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/vision-reinforcement-learning-vlm-rl) -- **Docker**: Use Unsloth with no setup & environment issues with our new image. [Guide](https://unsloth.ai/docs/blog/how-to-fine-tune-llms-with-unsloth-and-docker) • [Docker image](https://hub.docker.com/r/unsloth/unsloth) -- **gpt-oss** by OpenAI: Read our [RL blog](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune/gpt-oss-reinforcement-learning), [Flex Attention](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune/long-context-gpt-oss-training) blog and [Guide](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune). +4. **Download Model Files**: After saving the model files to Google Drive, you can download them to your local machine. Go to your Google Drive, locate the model directory, and download the files. +5. **Load Model Locally**: Once you have downloaded the model files to your local machine, you can load the model using the `from_pretrained` method. + +```python +from transformers import AutoModel, AutoTokenizer + +model = AutoModel.from_pretrained('path_to_your_model_directory') +tokenizer = AutoTokenizer.from_pretrained('path_to_your_model_directory') +``` + +By following these steps, you can easily export a fine-tuned model from Colab and use it locally on your machine. + +## 🦥 Unsloth.ai News +- 📣 NEW! Introducing [Reasoning](https://unsloth.ai/blog/r1-reasoning) in Unsloth. You can now reproduce DeepSeek-R1's "aha" moment with just 7GB VRAM. Transform Llama, Phi, Mistral etc. into reasoning LLMs! +- 📣 NEW! [DeepSeek-R1](https://unsloth.ai/blog/deepseek-r1) - the most powerful open reasoning models with Llama & Qwen distillations. Run or fine-tune them now! More details: [unsloth.ai/blog/deepseek-r1](https://unsloth.ai/blog/deepseek-r1). All model uploads: [here](https://huggingface.co/collections/unsloth/deepseek-r1-all-versions-678e1c48f5d2fce87892ace5). +- 📣 NEW! [Phi-4](https://unsloth.ai/blog/phi4) by Microsoft is now supported. We also [fixed bugs](https://unsloth.ai/blog/phi4) in Phi-4 and [uploaded GGUFs, 4-bit](https://huggingface.co/collections/unsloth/phi-4-all-versions-677eecf93784e61afe762afa). Try the [Phi-4 Colab notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_4-Conversational.ipynb) +- 📣 NEW! [Llama 3.3 (70B)](https://huggingface.co/collections/unsloth/llama-33-all-versions-67535d7d994794b9d7cf5e9f), Meta's latest model is supported. +- 📣 NEW! We worked with Apple to add [Cut Cross Entropy](https://arxiv.org/abs/2411.09009). Unsloth now supports 89K context for Meta's Llama 3.3 (70B) on a 80GB GPU - 13x longer than HF+FA2. For Llama 3.1 (8B), Unsloth enables 342K context, surpassing its native 128K support. +- 📣 Introducing Unsloth [Dynamic 4-bit Quantization](https://unsloth.ai/blog/dynamic-4bit)! We dynamically opt not to quantize certain parameters and this greatly increases accuracy while only using <10% more VRAM than BnB 4-bit. See our collection on [Hugging Face here.](https://huggingface.co/collections/unsloth/unsloth-4-bit-dynamic-quants-67503bb873f89e15276c44e7) +- 📣 [Vision models](https://unsloth.ai/blog/vision) now supported! [Llama 3.2 Vision (11B)](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb), [Qwen 2.5 VL (7B)](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb) and [Pixtral (12B) 2409](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Pixtral_(12B)-Vision.ipynb)
Click for more news - -- **Quantization-Aware Training**: We collabed with Pytorch, recovering ~70% accuracy. [Read blog](https://unsloth.ai/docs/blog/quantization-aware-training-qat) -- **Memory-efficient RL**: We're introducing even better RL. Our new kernels & algos allows faster RL with 50% less VRAM & 10× more context. [Read blog](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/memory-efficient-rl) -- **Mistral 3**: Run Ministral 3 or Devstral 2 and fine-tune with vision/RL sudoku notebooks. [Guide](https://unsloth.ai/docs/models/tutorials/ministral-3) • [Notebooks](https://unsloth.ai/docs/models/ministral-3#fine-tuning-ministral-3) -- **Gemma 3n** by Google: [Read Blog](https://unsloth.ai/docs/models/gemma-3-how-to-run-and-fine-tune/gemma-3n-how-to-run-and-fine-tune). We [uploaded GGUFs, 4-bit models](https://huggingface.co/collections/unsloth/gemma-3n-685d3874830e49e1c93f9339). -- **[Text-to-Speech (TTS)](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning)** is now supported, including `sesame/csm-1b` and STT `openai/whisper-large-v3`. -- **[Qwen3](https://unsloth.ai/docs/models/qwen3-how-to-run-and-fine-tune)** is now supported. Qwen3-30B-A3B fits on 17.5GB VRAM. -- Introducing **[Dynamic 2.0](https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs)** quants that set new benchmarks on 5-shot MMLU & Aider Polyglot. -- [**EVERYTHING** is now supported](https://unsloth.ai/blog/gemma3#everything) - all models (TTS, BERT, Mamba), FFT, etc. [MultiGPU](https://unsloth.ai/docs/basics/multi-gpu-training-with-unsloth) is now supported. Enable FFT with `full_finetuning = True`, 8-bit with `load_in_8bit = True`. -- 📣 [DeepSeek-R1](https://unsloth.ai/blog/deepseek-r1) - run or fine-tune them [with our guide](https://unsloth.ai/blog/deepseek-r1). All model uploads: [here](https://huggingface.co/collections/unsloth/deepseek-r1-all-versions-678e1c48f5d2fce87892ace5). -- 📣 Introducing Long-context [Reasoning (GRPO)](https://unsloth.ai/blog/grpo) in Unsloth. Train your own reasoning model with just 5GB VRAM. Transform Llama, Phi, Mistral etc. into reasoning LLMs! -- 📣 Introducing Unsloth [Dynamic 4-bit Quantization](https://unsloth.ai/blog/dynamic-4bit)! We dynamically opt not to quantize certain parameters and this greatly increases accuracy while only using <10% more VRAM than BnB 4-bit. See our collection on [Hugging Face here.](https://huggingface.co/collections/unsloth/unsloth-4-bit-dynamic-quants-67503bb873f89e15276c44e7) -- 📣 **[Llama 4](https://unsloth.ai/blog/llama4)** by Meta, including Scout & Maverick are now supported. -- 📣 [Phi-4](https://unsloth.ai/blog/phi4) by Microsoft: We also [fixed bugs](https://unsloth.ai/blog/phi4) in Phi-4 and [uploaded GGUFs, 4-bit](https://huggingface.co/collections/unsloth/phi-4-all-versions-677eecf93784e61afe762afa). -- 📣 [Vision models](https://unsloth.ai/blog/vision) now supported! [Llama 3.2 Vision (11B)](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb), [Qwen 2.5 VL (7B)](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb) and [Pixtral (12B) 2409](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Pixtral_(12B)-Vision.ipynb) -- 📣 [Llama 3.3 (70B)](https://huggingface.co/collections/unsloth/llama-33-all-versions-67535d7d994794b9d7cf5e9f), Meta's latest model is supported. -- 📣 We worked with Apple to add [Cut Cross Entropy](https://arxiv.org/abs/2411.09009). Unsloth now supports 89K context for Meta's Llama 3.3 (70B) on a 80GB GPU - 13x longer than HF+FA2. For Llama 3.1 (8B), Unsloth enables 342K context, surpassing its native 128K support. + - 📣 We found and helped fix a [gradient accumulation bug](https://unsloth.ai/blog/gradient)! Please update Unsloth and transformers. +- 📣 Try out [Chat interface](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Unsloth_Studio.ipynb)! +- 📣 NEW! Qwen-2.5 including [Coder](https://unsloth.ai/blog/qwen-coder) models are now supported with bugfixes. 14b fits in a Colab GPU! [Qwen 2.5 conversational notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_Coder_(14B)-Conversational.ipynb) +- 📣 NEW! [Mistral Small 22b notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_Small_(22B)-Alpaca.ipynb) finetuning fits in under 16GB of VRAM! +- 📣 NEW! `pip install unsloth` now works! Head over to [pypi](https://pypi.org/project/unsloth/) to check it out! This allows non git pull installs. Use `pip install unsloth[colab-new]` for non dependency installs. +- 📣 NEW! Continued Pretraining [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-CPT.ipynb) for other languages like Korean! +- 📣 [2x faster inference](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Inference.ipynb) added for all our models - 📣 We cut memory usage by a [further 30%](https://unsloth.ai/blog/long-context) and now support [4x longer context windows](https://unsloth.ai/blog/long-context)!
## 🔗 Links and Resources -| Type | Links | -| ----------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | -|   **r/unsloth Reddit** | [Join Reddit community](https://reddit.com/r/unsloth) | -| 📚 **Documentation & Wiki** | [Read Our Docs](https://unsloth.ai/docs) | -|   **Twitter (aka X)** | [Follow us on X](https://twitter.com/unslothai) | -| 💾 **Installation** | [Pip & Docker Install](https://unsloth.ai/docs/get-started/install) | -| 🔮 **Our Models** | [Unsloth Catalog](https://unsloth.ai/docs/get-started/unsloth-model-catalog) | -| ✍️ **Blog** | [Read our Blogs](https://unsloth.ai/blog) | +| Type | Links | +| ------------------------------- | --------------------------------------- | +| 📚 **Documentation & Wiki** | [Read Our Docs](https://docs.unsloth.ai) | +|   **Twitter (aka X)** | [Follow us on X](https://twitter.com/unslothai)| +| 💾 **Installation** | [unsloth/README.md](https://github.com/unslothai/unsloth/tree/main#-installation-instructions)| +| 🥇 **Benchmarking** | [Performance Tables](https://github.com/unslothai/unsloth/tree/main#-performance-benchmarking) +| 🌐 **Released Models** | [Unsloth Releases](https://docs.unsloth.ai/get-started/all-our-models)| +| ✍️ **Blog** | [Read our Blogs](https://unsloth.ai/blog)| +|   **Reddit** | [Join our Reddit page](https://reddit.com/r/unsloth)| ## ⭐ Key Features +- All kernels written in [OpenAI's Triton](https://openai.com/index/triton/) language. **Manual backprop engine**. +- **0% loss in accuracy** - no approximation methods - all exact. +- No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) [Check your GPU!](https://developer.nvidia.com/cuda-gpus) GTX 1070, 1080 works, but is slow. +- Works on **Linux** and **Windows** via WSL. +- Supports 4bit and 16bit QLoRA / LoRA finetuning via [bitsandbytes](https://github.com/TimDettmers/bitsandbytes). +- Open source trains 5x faster - see [Unsloth Pro](https://unsloth.ai/) for up to **30x faster training**! +- If you trained a model with 🦥Unsloth, you can use this cool sticker!   -* Supports **full-finetuning**, pretraining, 4-bit, 16-bit and **FP8** training -* Supports **all models** including [TTS](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning), multimodal, [embedding](https://unsloth.ai/docs/new/embedding-finetuning) and more! Any model that works in transformers, works in Unsloth. -* The most efficient library for [Reinforcement Learning (RL)](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide), using 80% less VRAM. Supports GRPO, GSPO, DrGRPO, DAPO etc. -* **0% loss in accuracy** - no approximation methods - all exact. -* Export and [deploy your model](https://unsloth.ai/docs/basics/inference-and-deployment) to [GGUF](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf) llama.cpp, [vLLM](https://unsloth.ai/docs/basics/inference-and-deployment/vllm-guide), [SGLang](https://unsloth.ai/docs/basics/inference-and-deployment/sglang-guide) and Hugging Face. -* Supports NVIDIA (since 2018), [AMD](https://unsloth.ai/docs/get-started/install/amd) and [Intel](https://unsloth.ai/docs/get-started/install/intel) GPUs. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) -* Works on **Linux**, WSL and **[Windows](https://unsloth.ai/docs/get-started/install/windows-installation)** -* All kernels written in OpenAI's Triton language. Manual backprop engine. -* If you trained a model with 🦥Unsloth, you can use this cool sticker!   -## 💾 Install Unsloth -You can also see our docs for more detailed installation and updating instructions [here](https://unsloth.ai/docs/get-started/install). +## 🥇 Performance Benchmarking +- For our most detailed benchmarks, read our [Llama 3.3 Blog](https://unsloth.ai/blog/llama3-3). +- Benchmarking of Unsloth was also conducted by [🤗Hugging Face](https://huggingface.co/blog/unsloth-trl). -Unsloth supports Python 3.13 or lower. +We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down): + +| Model | VRAM | 🦥 Unsloth speed | 🦥 VRAM reduction | 🦥 Longer context | 😊 Hugging Face + FA2 | +|----------------|-------|-----------------|----------------|----------------|--------------------| +| Llama 3.3 (70B)| 80GB | 2x | >75% | 13x longer | 1x | +| Llama 3.1 (8B) | 80GB | 2x | >70% | 12x longer | 1x | -### Pip Installation -**Install with pip (recommended) for Linux devices:** -``` -pip install unsloth -``` -**To update Unsloth:** -``` -pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo -``` -See [here](#advanced-pip-installation) for advanced pip install instructions. -### Windows Installation -For this method, we will be utilizing Anaconda. You can view the [full guide with screenshots here](https://unsloth.ai/docs/get-started/install/windows-installation). -1. **Install Miniconda (or Anaconda):** Miniconda is recommended. Install [Miniconda](https://www.anaconda.com/docs/getting-started/miniconda/install) or [Anaconda](https://www.anaconda.com/download), then open Anaconda PowerShell Prompt to continue. +
-2. **Create a Conda Environment:** Create and activate a fresh Python 3.12 environment for Unsloth. +![](https://i.ibb.co/sJ7RhGG/image-41.png) - ```bash - conda create --name unsloth_env python==3.12 -y - conda activate unsloth_env - ``` +## 💾 Installation Instructions -3. **Check Your GPU and CUDA Version:** Run `nvidia-smi` to confirm that your NVIDIA GPU is detected and note the CUDA version shown in the output. If `nvidia-smi` does not work, reinstall the latest [NVIDIA drivers](https://www.nvidia.com/en-us/drivers/). +For stable releases, use `pip install unsloth`. We recommend `pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"` for most installations though. -4. **Install PyTorch:** Install the Windows pip build of PyTorch that matches your CUDA version. Use [Install PyTorch](https://pytorch.org/get-started/locally/) to select the correct command for your system, then verify that PyTorch can see your GPU. - - ```python - import torch - print(torch.cuda.is_available()) - A = torch.ones((10, 10), device="cuda") - B = torch.ones((10, 10), device="cuda") - A @ B - ``` - -5. **Install Unsloth:** Only install Unsloth after PyTorch is working correctly. - - ```bash - pip install unsloth - ``` - -#### Advanced/Troubleshooting -For **advanced installation instructions** or if you see weird errors during installations: - -First try using an isolated environment via then `pip install unsloth` +### Conda Installation +`⚠️Only use Conda if you have it. If not, use Pip`. Select either `pytorch-cuda=11.8,12.1` for CUDA 11.8 or CUDA 12.1. We support `python=3.10,3.11,3.12`. ```bash -python -m venv unsloth -source unsloth/bin/activate -pip install unsloth -``` - -1. Install `torch` and `triton`. Go to https://pytorch.org to install it. For example `pip install torch torchvision torchaudio triton` -2. Confirm if CUDA is installed correctly. Try `nvcc`. If that fails, you need to install `cudatoolkit` or CUDA drivers. -3. Install `xformers` manually via: - ```bash - pip install ninja - pip install -v --no-build-isolation -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers - ``` - Check if `xformers` succeeded with `python -m xformers.info` Go to https://github.com/facebookresearch/xformers. Another option is to install `flash-attn` for Ampere GPUs and ignore `xformers` - -4. For GRPO runs, you can try installing `vllm` and seeing if `pip install vllm` succeeds. -5. Double check that your versions of Python, CUDA, CUDNN, `torch`, `triton`, and `xformers` are compatible with one another. The [PyTorch Compatibility Matrix](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-compatibility-matrix) may be useful. -6. Finally, install `bitsandbytes` and check it with `python -m bitsandbytes` - -### Conda Installation (Optional) -`⚠️Only use Conda if you have it. If not, use Pip`. We support `python=3.10,3.11,3.12,3.13`. -```bash -conda create --name unsloth_env python==3.12 -y +conda create --name unsloth_env \ + python=3.11 \ + pytorch-cuda=12.1 \ + pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers \ + -y conda activate unsloth_env + +pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" +pip install --no-deps trl peft accelerate bitsandbytes ``` -Use `nvidia-smi` to get the correct CUDA version like 13.0 which becomes `cu130` -```bash -pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130 -pip3 install unsloth -``` +
If you're looking to install Conda in a Linux environment, read here, or run the below 🔽 @@ -199,10 +160,10 @@ pip3 install unsloth ```
-### Advanced Pip Installation -`⚠️Do **NOT** use this if you have Conda.` Pip is a bit more complex since there are dependency issues. The pip command is different for `torch 2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10` and CUDA versions. +### Pip Installation +`⚠️Do **NOT** use this if you have Conda.` Pip is a bit more complex since there are dependency issues. The pip command is different for `torch 2.2,2.3,2.4,2.5` and CUDA versions. -For other torch versions, we support `torch211`, `torch212`, `torch220`, `torch230`, `torch240`, `torch250`, `torch260`, `torch270`, `torch280`, `torch290`, `torch2100` and for CUDA versions, we support `cu118` and `cu121` and `cu124`. For Ampere devices (A100, H100, RTX3090) and above, use `cu118-ampere` or `cu121-ampere` or `cu124-ampere`. Note: torch 2.10 only supports CUDA 12.6, 12.8, and 13.0. +For other torch versions, we support `torch211`, `torch212`, `torch220`, `torch230`, `torch240` and for CUDA versions, we support `cu118` and `cu121` and `cu124`. For Ampere devices (A100, H100, RTX3090) and above, use `cu118-ampere` or `cu121-ampere` or `cu124-ampere`. For example, if you have `torch 2.4` and `CUDA 12.1`, use: ```bash @@ -210,16 +171,10 @@ pip install --upgrade pip pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git" ``` -Another example, if you have `torch 2.9` and `CUDA 13.0`, use: +Another example, if you have `torch 2.5` and `CUDA 12.4`, use: ```bash pip install --upgrade pip -pip install "unsloth[cu130-torch290] @ git+https://github.com/unslothai/unsloth.git" -``` - -Another example, if you have `torch 2.10` and `CUDA 12.6`, use: -```bash -pip install --upgrade pip -pip install "unsloth[cu126-torch2100] @ git+https://github.com/unslothai/unsloth.git" +pip install "unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git" ``` And other examples: @@ -246,81 +201,79 @@ Or, run the below manually in a Python REPL: try: import torch except: raise ImportError('Install torch via `pip install torch`') from packaging.version import Version as V -import re -v = V(re.match(r"[0-9\.]{3,}", torch.__version__).group(0)) +v = V(torch.__version__) cuda = str(torch.version.cuda) is_ampere = torch.cuda.get_device_capability()[0] >= 8 -USE_ABI = torch._C._GLIBCXX_USE_CXX11_ABI -if cuda not in ("11.8", "12.1", "12.4", "12.6", "12.8", "13.0"): raise RuntimeError(f"CUDA = {cuda} not supported!") +if cuda != "12.1" and cuda != "11.8" and cuda != "12.4": raise RuntimeError(f"CUDA = {cuda} not supported!") if v <= V('2.1.0'): raise RuntimeError(f"Torch = {v} too old!") elif v <= V('2.1.1'): x = 'cu{}{}-torch211' elif v <= V('2.1.2'): x = 'cu{}{}-torch212' elif v < V('2.3.0'): x = 'cu{}{}-torch220' elif v < V('2.4.0'): x = 'cu{}{}-torch230' elif v < V('2.5.0'): x = 'cu{}{}-torch240' -elif v < V('2.5.1'): x = 'cu{}{}-torch250' -elif v <= V('2.5.1'): x = 'cu{}{}-torch251' -elif v < V('2.7.0'): x = 'cu{}{}-torch260' -elif v < V('2.7.9'): x = 'cu{}{}-torch270' -elif v < V('2.8.0'): x = 'cu{}{}-torch271' -elif v < V('2.8.9'): x = 'cu{}{}-torch280' -elif v < V('2.9.1'): x = 'cu{}{}-torch290' -elif v < V('2.9.2'): x = 'cu{}{}-torch291' -elif v < V('2.10.1'): x = 'cu{}{}-torch2100' +elif v < V('2.6.0'): x = 'cu{}{}-torch250' else: raise RuntimeError(f"Torch = {v} too new!") -if v > V('2.6.9') and cuda not in ("11.8", "12.6", "12.8", "13.0"): raise RuntimeError(f"CUDA = {cuda} not supported!") -if v >= V('2.10.0') and cuda not in ("12.6", "12.8", "13.0"): raise RuntimeError(f"Torch 2.10 requires CUDA 12.6, 12.8, or 13.0! Got CUDA = {cuda}") -x = x.format(cuda.replace(".", ""), "-ampere" if False else "") # is_ampere is broken due to flash-attn -print(f'pip install --upgrade pip && pip install --no-deps git+https://github.com/unslothai/unsloth-zoo.git && pip install "unsloth[{x}] @ git+https://github.com/unslothai/unsloth.git" --no-build-isolation') -``` -### Docker Installation -You can use our pre-built Docker container with all dependencies to use Unsloth instantly with no setup required. -[Read our guide](https://unsloth.ai/docs/get-started/install/docker). - -This container requires installing [NVIDIA's Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html). - -```bash -docker run -d -e JUPYTER_PASSWORD="mypassword" \ - -p 8888:8888 -p 2222:22 \ - -v $(pwd)/work:/workspace/work \ - --gpus all \ - unsloth/unsloth +x = x.format(cuda.replace(".", ""), "-ampere" if is_ampere else "") +print(f'pip install --upgrade pip && pip install "unsloth[{x}] @ git+https://github.com/unslothai/unsloth.git"') ``` -Access Jupyter Lab at `http://localhost:8888` and start fine-tuning! +### Windows Installation -## 📜 Documentation -* Go to our official [Documentation](https://unsloth.ai/docs) for [running models](https://unsloth.ai/docs/basics/inference-and-deployment), [saving to GGUF](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf), [checkpointing](https://unsloth.ai/docs/basics/finetuning-from-last-checkpoint), [evaluation](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide#evaluation) and more! -* Read our Guides for: [Fine-tuning](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide), [Reinforcement Learning](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide), [Text-to-Speech (TTS)](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning), [Vision](https://unsloth.ai/docs/basics/vision-fine-tuning) and [any model](https://unsloth.ai/docs/models/tutorials). -* We support Huggingface's transformers, TRL, Trainer, Seq2SeqTrainer and Pytorch code. +To run Unsloth directly on Windows: +- Install Triton from this Windows fork and follow the instructions: https://github.com/woct0rdho/triton-windows +- In the SFTTrainer, set `dataset_num_proc=1` to avoid a crashing issue: +```python +trainer = SFTTrainer( + dataset_num_proc=1, + ... +) +``` -Unsloth example code to fine-tune gpt-oss-20b: +For **advanced installation instructions** or if you see weird errors during installations: + +1. Install `torch` and `triton`. Go to https://pytorch.org to install it. For example `pip install torch torchvision torchaudio triton` +2. Confirm if CUDA is installated correctly. Try `nvcc`. If that fails, you need to install `cudatoolkit` or CUDA drivers. +3. Install `xformers` manually. You can try installing `vllm` and seeing if `vllm` succeeds. Check if `xformers` succeeded with `python -m xformers.info` Go to https://github.com/facebookresearch/xformers. Another option is to install `flash-attn` for Ampere GPUs. +4. Finally, install `bitsandbytes` and check it with `python -m bitsandbytes` + +## 📜 [Documentation](https://docs.unsloth.ai) +- Go to our official [Documentation](https://docs.unsloth.ai) for saving to GGUF, checkpointing, evaluation and more! +- We support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code! +- We're in 🤗Hugging Face's official docs! Check out the [SFT docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth) and [DPO docs](https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth)! +- If you want to download models from the ModelScope community, please use an environment variable: `UNSLOTH_USE_MODELSCOPE=1`, and install the modelscope library by: `pip install modelscope -U`. + +> unsloth_cli.py also supports `UNSLOTH_USE_MODELSCOPE=1` to download models and datasets. please remember to use the model and dataset id in the ModelScope community. ```python -from unsloth import FastLanguageModel, FastModel, FastVisionModel +from unsloth import FastLanguageModel +from unsloth import is_bfloat16_supported import torch -from trl import SFTTrainer, SFTConfig +from trl import SFTTrainer +from transformers import TrainingArguments from datasets import load_dataset -max_seq_length = 2048 # Supports RoPE Scaling internally, so choose any! +max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any! # Get LAION dataset url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl" dataset = load_dataset("json", data_files = {"train" : url}, split = "train") # 4bit pre quantized models we support for 4x faster downloading + no OOMs. fourbit_models = [ - "unsloth/gpt-oss-20b-unsloth-bnb-4bit", #or choose any model - + "unsloth/mistral-7b-v0.3-bnb-4bit", # New Mistral v3 2x faster! + "unsloth/mistral-7b-instruct-v0.3-bnb-4bit", + "unsloth/llama-3-8b-bnb-4bit", # Llama-3 15 trillion tokens model 2x faster! + "unsloth/llama-3-8b-Instruct-bnb-4bit", + "unsloth/llama-3-70b-bnb-4bit", + "unsloth/Phi-3-mini-4k-instruct", # Phi-3 2x faster! + "unsloth/Phi-3-medium-4k-instruct", + "unsloth/mistral-7b-bnb-4bit", + "unsloth/gemma-7b-bnb-4bit", # Gemma 2.2x faster! ] # More models at https://huggingface.co/unsloth model, tokenizer = FastLanguageModel.from_pretrained( - model_name = "unsloth/gpt-oss-20b", - max_seq_length = max_seq_length, # Choose any for long context! - load_in_4bit = True, # 4-bit quantization. False = 16-bit LoRA. - load_in_8bit = False, # 8-bit quantization - load_in_16bit = False, # 16-bit LoRA - full_finetuning = False, # Use for full fine-tuning. - trust_remote_code = False, # Enable to support new models - # token = "hf_...", # use one if using gated models + model_name = "unsloth/llama-3-8b-bnb-4bit", + max_seq_length = max_seq_length, + dtype = None, + load_in_4bit = True, ) # Do model patching and add fast LoRA weights @@ -343,13 +296,16 @@ model = FastLanguageModel.get_peft_model( trainer = SFTTrainer( model = model, train_dataset = dataset, + dataset_text_field = "text", + max_seq_length = max_seq_length, tokenizer = tokenizer, - args = SFTConfig( - max_seq_length = max_seq_length, + args = TrainingArguments( per_device_train_batch_size = 2, gradient_accumulation_steps = 4, warmup_steps = 10, max_steps = 60, + fp16 = not is_bfloat16_supported(), + bf16 = is_bfloat16_supported(), logging_steps = 1, output_dir = "outputs", optim = "adamw_8bit", @@ -358,42 +314,79 @@ trainer = SFTTrainer( ) trainer.train() -# Go to https://unsloth.ai/docs for advanced tips like -# (1) Saving to GGUF / merging to 16bit for vLLM or SGLang +# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like +# (1) Saving to GGUF / merging to 16bit for vLLM # (2) Continued training from a saved LoRA adapter # (3) Adding an evaluation loop / OOMs # (4) Customized chat templates ``` - -## 💡 Reinforcement Learning -[RL](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) including [GRPO](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide#training-with-grpo), [GSPO](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/gspo-reinforcement-learning), [**FP8** training](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning), DrGRPO, DAPO, PPO, Reward Modelling, Online DPO all work with Unsloth. + +## DPO Support +DPO (Direct Preference Optimization), PPO, Reward Modelling all seem to work as per 3rd party independent testing from [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory). We have a preliminary Google Colab notebook for reproducing Zephyr on Tesla T4 here: [notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing). -Read our [Reinforcement Learning Guide](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) or our [advanced RL docs](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/advanced-rl-documentation) for batching, generation & training parameters. +We're in 🤗Hugging Face's official docs! We're on the [SFT docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth) and the [DPO docs](https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth)! -List of RL notebooks: -- gpt-oss GRPO notebook: [Link](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) -- ***FP8*** Qwen3-8B GRPO notebook (L4): [Link](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_8B_FP8_GRPO.ipynb) -- Qwen3-VL GSPO notebook: [Link](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_VL_(8B)-Vision-GRPO.ipynb) -- Advanced Qwen3 GRPO notebook: [Link](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb) -- ORPO notebook: [Link](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-ORPO.ipynb) -- DPO Zephyr notebook: [Link](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Zephyr_(7B)-DPO.ipynb) -- KTO notebook: [Link](https://colab.research.google.com/drive/1MRgGtLWuZX4ypSfGguFgC-IblTvO2ivM?usp=sharing) -- SimPO notebook: [Link](https://colab.research.google.com/drive/1Hs5oQDovOay4mFA6Y9lQhVJ8TnbFLFh2?usp=sharing) +```python +import os +os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Optional set GPU device ID -## 🥇 Performance Benchmarking -- For our most detailed benchmarks, read our [Llama 3.3 Blog](https://unsloth.ai/blog/llama3-3). -- Benchmarking of Unsloth was also conducted by [🤗Hugging Face](https://huggingface.co/blog/unsloth-trl). +from unsloth import FastLanguageModel, PatchDPOTrainer +from unsloth import is_bfloat16_supported +PatchDPOTrainer() +import torch +from transformers import TrainingArguments +from trl import DPOTrainer -We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down): - -| Model | VRAM | 🦥 Unsloth speed | 🦥 VRAM reduction | 🦥 Longer context | 😊 Hugging Face + FA2 | -|----------------|-------|-----------------|----------------|----------------|--------------------| -| Llama 3.3 (70B)| 80GB | 2x | >75% | 13x longer | 1x | -| Llama 3.1 (8B) | 80GB | 2x | >70% | 12x longer | 1x | +model, tokenizer = FastLanguageModel.from_pretrained( + model_name = "unsloth/zephyr-sft-bnb-4bit", + max_seq_length = max_seq_length, + dtype = None, + load_in_4bit = True, +) +# Do model patching and add fast LoRA weights +model = FastLanguageModel.get_peft_model( + model, + r = 64, + target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", + "gate_proj", "up_proj", "down_proj",], + lora_alpha = 64, + lora_dropout = 0, # Supports any, but = 0 is optimized + bias = "none", # Supports any, but = "none" is optimized + # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes! + use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context + random_state = 3407, + max_seq_length = max_seq_length, +) + +dpo_trainer = DPOTrainer( + model = model, + ref_model = None, + args = TrainingArguments( + per_device_train_batch_size = 4, + gradient_accumulation_steps = 8, + warmup_ratio = 0.1, + num_train_epochs = 3, + fp16 = not is_bfloat16_supported(), + bf16 = is_bfloat16_supported(), + logging_steps = 1, + optim = "adamw_8bit", + seed = 42, + output_dir = "outputs", + ), + beta = 0.1, + train_dataset = YOUR_DATASET_HERE, + # eval_dataset = YOUR_DATASET_HERE, + tokenizer = tokenizer, + max_length = 1024, + max_prompt_length = 512, +) +dpo_trainer.train() +``` + +## 🥇 Detailed Benchmarking Tables ### Context length benchmarks - #### Llama 3.1 (8B) max. context length We tested Llama 3.1 (8B) Instruct and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads. | GPU VRAM | 🦥Unsloth context length | Hugging Face + FA2 | @@ -426,13 +419,14 @@ You can cite the Unsloth repo as follows: @software{unsloth, author = {Daniel Han, Michael Han and Unsloth team}, title = {Unsloth}, - url = {https://github.com/unslothai/unsloth}, + url = {http://github.com/unslothai/unsloth}, year = {2023} } ``` ### Thank You to -- The [llama.cpp library](https://github.com/ggml-org/llama.cpp) that lets users save models with Unsloth -- The Hugging Face team and their libraries: [transformers](https://github.com/huggingface/transformers) and [TRL](https://github.com/huggingface/trl) -- The Pytorch and [Torch AO](https://github.com/unslothai/unsloth/pull/3391) team for their contributions -- And of course for every single person who has contributed or has used Unsloth! +- [Erik](https://github.com/erikwijmans) for his help adding [Apple's ML Cross Entropy](https://github.com/apple/ml-cross-entropy) in Unsloth +- [HuyNguyen-hust](https://github.com/HuyNguyen-hust) for making [RoPE Embeddings 28% faster](https://github.com/unslothai/unsloth/pull/238) +- [RandomInternetPreson](https://github.com/RandomInternetPreson) for confirming WSL support +- [152334H](https://github.com/152334H) for experimental DPO support +- [atgctg](https://github.com/atgctg) for syntax highlighting diff --git a/unsloth/save.py b/unsloth/save.py index 6e38d1e95..1eaf3ddc2 100644 --- a/unsloth/save.py +++ b/unsloth/save.py @@ -13,32 +13,10 @@ # limitations under the License. from unsloth_zoo.utils import Version -from importlib.metadata import version as importlib_version -from unsloth_zoo.hf_utils import dtype_from_config, HAS_TORCH_DTYPE -from unsloth_zoo.llama_cpp import ( - convert_to_gguf, - quantize_gguf, - use_local_gguf, - install_llama_cpp, - check_llama_cpp, - _download_convert_hf_to_gguf, -) - -# H4: Defensive imports -- these were added in unsloth-zoo PR #526 -# and may not exist on older versions -try: - from unsloth_zoo.llama_cpp import LLAMA_CPP_DEFAULT_DIR, IS_WINDOWS -except ImportError: - import sys - - IS_WINDOWS = sys.platform == "win32" - LLAMA_CPP_DEFAULT_DIR = "llama.cpp" from bitsandbytes.nn import Linear4bit as Bnb_Linear4bit from peft.tuners.lora import Linear4bit as Peft_Linear4bit from peft.tuners.lora import Linear as Peft_Linear from typing import Optional, Callable, Union, List -import sys -import requests import torch import os import shutil @@ -51,22 +29,14 @@ import psutil import re from transformers.models.llama.modeling_llama import logger from .tokenizer_utils import fix_sentencepiece_gguf -from .models.loader_utils import get_model_name -from .models._utils import _convert_torchao_model -from .ollama_template_mappers import OLLAMA_TEMPLATES, MODEL_TO_OLLAMA_TEMPLATE_MAPPER -from transformers import ProcessorMixin from huggingface_hub import HfApi - try: - from huggingface_hub import get_token + from huggingface_hub.utils import get_token except: - try: - from huggingface_hub.utils import get_token - except: - # For older versions of huggingface_hub - from huggingface_hub.utils._token import get_token + # Old HF Hub versions <= 0.0.25 + from huggingface_hub.utils._token import get_token +pass from pathlib import Path -from peft import PeftModelForCausalLM, PeftModel __all__ = [ "print_quantization_methods", @@ -74,90 +44,70 @@ __all__ = [ "save_to_gguf", "patch_saving_functions", "create_huggingface_repo", + "export_model_to_local", ] # llama.cpp specific targets - all takes 90s. Below takes 60s -LLAMA_CPP_TARGETS = [ - "llama-quantize", - "llama-cli", - "llama-server", -] +LLAMA_CPP_TARGETS = ["llama-quantize", "llama-export-lora", "llama-cli",] # Check environments keynames = "\n" + "\n".join(os.environ.keys()) -IS_COLAB_ENVIRONMENT = "\nCOLAB_" in keynames +IS_COLAB_ENVIRONMENT = "\nCOLAB_" in keynames IS_KAGGLE_ENVIRONMENT = "\nKAGGLE_" in keynames KAGGLE_TMP = "/tmp" del keynames # Weights LLAMA_WEIGHTS = ( - "self_attn.q_proj", - "self_attn.k_proj", - "self_attn.v_proj", - "self_attn.o_proj", - "mlp.gate_proj", - "mlp.up_proj", - "mlp.down_proj", + "self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj", "self_attn.o_proj", + "mlp.gate_proj", "mlp.up_proj", "mlp.down_proj", ) LLAMA_LAYERNORMS = ( - "input_layernorm", - "post_attention_layernorm", - "pre_feedforward_layernorm", - "post_feedforward_layernorm", - "self_attn.q_norm", - "self_attn.k_norm", + "input_layernorm", "post_attention_layernorm", + "pre_feedforward_layernorm", "post_feedforward_layernorm", ) # https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp#L19 # From https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html -ALLOWED_QUANTS = { - "not_quantized": "Recommended. Fast conversion. Slow inference, big files.", - "fast_quantized": "Recommended. Fast conversion. OK inference, OK file size.", - "quantized": "Recommended. Slow conversion. Fast inference, small files.", - "f32": "Not recommended. Retains 100% accuracy, but super slow and memory hungry.", - "bf16": "Bfloat16 - Fastest conversion + retains 100% accuracy. Slow and memory hungry.", - "f16": "Float16 - Fastest conversion + retains 100% accuracy. Slow and memory hungry.", - "q8_0": "Fast conversion. High resource use, but generally acceptable.", - "q4_k_m": "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K", - "q5_k_m": "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K", - "q2_k": "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.", - "q3_k_l": "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K", - "q3_k_m": "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K", - "q3_k_s": "Uses Q3_K for all tensors", - "q4_0": "Original quant method, 4-bit.", - "q4_1": "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.", - "q4_k_s": "Uses Q4_K for all tensors", - "q4_k": "alias for q4_k_m", - "q5_k": "alias for q5_k_m", - "q5_0": "Higher accuracy, higher resource usage and slower inference.", - "q5_1": "Even higher accuracy, resource usage and slower inference.", - "q5_k_s": "Uses Q5_K for all tensors", - "q6_k": "Uses Q8_K for all tensors", +ALLOWED_QUANTS = \ +{ + "not_quantized" : "Recommended. Fast conversion. Slow inference, big files.", + "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.", + "quantized" : "Recommended. Slow conversion. Fast inference, small files.", + "f32" : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.", + "bf16" : "Bfloat16 - Fastest conversion + retains 100% accuracy. Slow and memory hungry.", + "f16" : "Float16 - Fastest conversion + retains 100% accuracy. Slow and memory hungry.", + "q8_0" : "Fast conversion. High resource use, but generally acceptable.", + "q4_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K", + "q5_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K", + "q2_k" : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.", + "q3_k_l" : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K", + "q3_k_m" : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K", + "q3_k_s" : "Uses Q3_K for all tensors", + "q4_0" : "Original quant method, 4-bit.", + "q4_1" : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.", + "q4_k_s" : "Uses Q4_K for all tensors", + "q4_k" : "alias for q4_k_m", + "q5_k" : "alias for q5_k_m", + "q5_0" : "Higher accuracy, higher resource usage and slower inference.", + "q5_1" : "Even higher accuracy, resource usage and slower inference.", + "q5_k_s" : "Uses Q5_K for all tensors", + "q6_k" : "Uses Q8_K for all tensors", # "iq2_xxs" : "2.06 bpw quantization", # Not supported sadly # "iq2_xs" : "2.31 bpw quantization", # "iq3_xxs" : "3.06 bpw quantization", - "q3_k_xs": "3-bit extra small quantization", + "q3_k_xs" : "3-bit extra small quantization", } - -def has_curl(): - return shutil.which("curl") is not None - - -CURL_FLAG = "-DLLAMA_CURL=ON" if has_curl() else "-DLLAMA_CURL=OFF" - - def print_quantization_methods(): for key, value in ALLOWED_QUANTS.items(): print(f'"{key}" ==> {value}') + pass +pass -def check_if_sentencepiece_model( - model, temporary_location = "_unsloth_sentencepiece_temp" -): - if not hasattr(model, "_saved_temp_tokenizer"): - return False +def check_if_sentencepiece_model(model, temporary_location = "_unsloth_sentencepiece_temp"): + if not hasattr(model, "_saved_temp_tokenizer"): return False temp_tokenizer = model._saved_temp_tokenizer sentencepiece_model = False @@ -166,17 +116,19 @@ def check_if_sentencepiece_model( if not os.path.exists(file_location): created_folder = True os.makedirs(file_location) + pass temp_tokenizer.save_pretrained(file_location) if os.path.isfile(f"{file_location}/tokenizer.model"): sentencepiece_model = True + pass if created_folder: shutil.rmtree(file_location, ignore_errors = True) return sentencepiece_model +pass def _free_cached_model(model): from huggingface_hub import scan_cache_dir - cached_repos = list(scan_cache_dir().repos) # Go through every cached repo, and delete the one that matches the model we want to save. @@ -184,27 +136,27 @@ def _free_cached_model(model): for cached_repo in cached_repos: if cached_repo.repo_id == model.config._name_or_path: remove_cache_commit = list(cached_repo.revisions)[0].commit_hash - delete_strategy = scan_cache_dir().delete_revisions( - remove_cache_commit, - ) + delete_strategy = scan_cache_dir().delete_revisions(remove_cache_commit,) logger.warning_once( - "Unsloth: Will remove a cached repo with size " - + delete_strategy.expected_freed_size_str, + "Unsloth: Will remove a cached repo with size " + \ + delete_strategy.expected_freed_size_str, ) delete_strategy.execute() + pass + pass +pass def _merge_lora(layer, name): + bias = getattr(layer, "bias", None) if isinstance(layer, (Bnb_Linear4bit, Peft_Linear4bit, Peft_Linear)): # Is LoRA so we need to merge! W, quant_state, A, B, s, bias = get_lora_parameters_bias(layer) if quant_state is not None: - dtype = ( - quant_state.dtype if type(quant_state) is not list else quant_state[2] - ) + dtype = quant_state.dtype if type(quant_state) is not list else quant_state[2] W = fast_dequantize(W, quant_state) else: dtype = W.dtype @@ -219,13 +171,13 @@ def _merge_lora(layer, name): # if not torch.isfinite(W).all(): maximum_element = torch.max(W.min().abs(), W.max()) if not torch.isfinite(maximum_element).item(): - raise ValueError( - f"Unsloth: Merge failed.\n{name} has some elements = infinity." - ) + raise ValueError(f"Unsloth: Merge failed.\n{name} has some elements = infinity.") + pass W = W.t().to(dtype) else: W = layer.weight return W, bias +pass def fast_save_pickle(shard, name): @@ -239,41 +191,41 @@ def fast_save_pickle(shard, name): # pickle_protocol = pickle.HIGHEST_PROTOCOL, ) return +pass @torch.inference_mode def unsloth_save_model( model, tokenizer, - save_directory: Union[str, os.PathLike], - save_method: str = "lora", # ["lora", "merged_16bit", "merged_4bit"] - push_to_hub: bool = False, - token: Optional[Union[str, bool]] = None, - is_main_process: bool = True, - state_dict: Optional[dict] = None, - save_function: Callable = torch.save, - max_shard_size: Union[int, str] = "5GB", - safe_serialization: bool = True, - variant: Optional[str] = None, - save_peft_format: bool = True, - # Push to hub - use_temp_dir: Optional[bool] = None, - commit_message: Optional[str] = "Trained with Unsloth", - private: Optional[bool] = None, - create_pr: bool = False, - revision: str = None, - commit_description: str = "Upload model trained with Unsloth 2x faster", - tags: List[str] = None, - # Our functions - temporary_location: str = "_unsloth_temporary_saved_buffers", - maximum_memory_usage: float = 0.9, - datasets: Optional[List[str]] = None, -): - if token is None: - token = get_token() + save_directory : Union[str, os.PathLike], + save_method : str = "lora", # ["lora", "merged_16bit", "merged_4bit"] + push_to_hub : bool = False, + token : Optional[Union[str, bool]] = None, + is_main_process : bool = True, + state_dict : Optional[dict] = None, + save_function : Callable = torch.save, + max_shard_size : Union[int, str] = "5GB", + safe_serialization : bool = True, + variant : Optional[str] = None, + save_peft_format : bool = True, - if commit_message is None: - commit_message = "" + # Push to hub + use_temp_dir : Optional[bool] = None, + commit_message : Optional[str] = "Trained with Unsloth", + private : Optional[bool] = None, + create_pr : bool = False, + revision : str = None, + commit_description : str = "Upload model trained with Unsloth 2x faster", + tags : List[str] = None, + + # Our functions + temporary_location : str = "_unsloth_temporary_saved_buffers", + maximum_memory_usage : float = 0.9, +): + if token is None: token = get_token() + + if commit_message is None: commit_message = "" if "Unsloth" not in commit_message: commit_message += " (Trained with Unsloth)" commit_message = commit_message.lstrip() @@ -282,214 +234,185 @@ def unsloth_save_model( commit_description = "Upload model trained with Unsloth 2x faster" elif "Unsloth 2x faster" not in commit_description: commit_description += " (Trained with Unsloth 2x faster)" + pass if save_method == "merged_4bit": raise RuntimeError( - "Unsloth: Merging into 4bit will cause your model to lose accuracy if you plan\n" - "to merge to GGUF or others later on. I suggest you to do this as a final step\n" - "if you're planning to do multiple saves.\n" + "Unsloth: Merging into 4bit will cause your model to lose accuracy if you plan\n"\ + "to merge to GGUF or others later on. I suggest you to do this as a final step\n"\ + "if you're planning to do multiple saves.\n"\ "If you are certain, change `save_method` to `merged_4bit_forced`." ) elif save_method == "merged_4bit_forced": save_method = "merged_4bit" + pass save_pretrained_settings = dict(locals()) - for deletion in ( - "model", - "tokenizer", - "save_method", - "temporary_location", - "maximum_memory_usage", - "datasets", - ): + for deletion in ("model", "tokenizer", "save_method", "temporary_location", "maximum_memory_usage"): del save_pretrained_settings[deletion] + pass # First check for a token! if push_to_hub: from huggingface_hub import whoami - - try: + try: username = whoami(token = token)["name"] except: raise RuntimeError( - "Unsloth: Please supply a token!\n" + "Unsloth: Please supply a token!\n"\ "Go to https://huggingface.co/settings/tokens" ) + pass + pass - assert maximum_memory_usage > 0 and maximum_memory_usage <= 0.95 + assert(maximum_memory_usage > 0 and maximum_memory_usage <= 0.95) # Clean memory up first for _ in range(3): torch.cuda.empty_cache() gc.collect() + pass save_method = save_method.lower().replace(" ", "_") - if ( - save_method != "lora" - and save_method != "merged_16bit" - and save_method != "merged_4bit" - ): + if save_method != "lora" and save_method != "merged_16bit" and save_method != "merged_4bit": raise RuntimeError( - "Unsloth: You must select one of 3 options when saving models:\n" - '"lora" ==> This is the fastest and easiet. Just saves LoRA modules.\n' - '"merged_16bit" ==> This merges LoRA weights and saves to float16. Needed for llama.cpp / GGUF.\n' + "Unsloth: You must select one of 3 options when saving models:\n"\ + '"lora" ==> This is the fastest and easiet. Just saves LoRA modules.\n'\ + '"merged_16bit" ==> This merges LoRA weights and saves to float16. Needed for llama.cpp / GGUF.\n'\ '"merged_4bit" ==> This merges LoRA weights and saves to 4bit. Useful for DPO / inference.' ) + pass if save_method == "merged_4bit": + print("Unsloth: Merging 4bit and LoRA weights to 4bit...") print("This might take 5 minutes...") # Counteract no LoRA adapters! if hasattr(model, "merge_and_unload"): model = model.merge_and_unload() + pass print("Done.") + pass if tags is not None: - assert isinstance(tags, (list, tuple)) - tags = list(tags) + [ - "unsloth", - ] + assert(isinstance(tags, (list, tuple))) + tags = list(tags) + ["unsloth",] else: - tags = [ - "unsloth", - ] + tags = ["unsloth",] + pass save_pretrained_settings["tags"] = tags if ((save_method == "lora") or (save_method == "merged_4bit")) and push_to_hub: if token is None: raise RuntimeError( - "Unsloth: Pushing to HF requires a token. Pass `token = 'hf_....'`\n" + "Unsloth: Pushing to HF requires a token. Pass `token = 'hf_....'`\n"\ "Go to https://huggingface.co/settings/tokens." ) + pass if save_method == "lora": print("Unsloth: Saving LoRA adapters. Please wait...") elif save_method == "merged_4bit": print("Unsloth: Saving 4bit Bitsandbytes model. Please wait...") + pass # Update model tag _ = upload_to_huggingface( - model, - save_directory, - token, - "finetuned", - "trl", - file_location = None, - old_username = None, - private = private, - datasets = datasets, + model, save_directory, token, + "finetuned", "trl", file_location = None, + old_username = None, private = private, ) - getattr(model, "original_push_to_hub", model.push_to_hub)( - repo_id = save_directory, - use_temp_dir = use_temp_dir, - commit_message = commit_message, - private = private, - token = token, - max_shard_size = max_shard_size, - create_pr = create_pr, + getattr(model, "original_push_to_hub", tokenizer.push_to_hub)\ + ( + repo_id = save_directory, + use_temp_dir = use_temp_dir, + commit_message = commit_message, + private = private, + token = token, + max_shard_size = max_shard_size, + create_pr = create_pr, safe_serialization = safe_serialization, - revision = revision, + revision = revision, commit_description = commit_description, - tags = tags, + tags = tags, ) if tokenizer is not None: # Set padding side to left for inference old_padding_side = tokenizer.padding_side tokenizer.padding_side = "left" - getattr(tokenizer, "original_push_to_hub", tokenizer.push_to_hub)( - repo_id = save_directory, - use_temp_dir = use_temp_dir, - commit_message = commit_message, - private = private, - token = token, - max_shard_size = max_shard_size, - create_pr = create_pr, + getattr(tokenizer, "original_push_to_hub", tokenizer.push_to_hub)\ + ( + repo_id = save_directory, + use_temp_dir = use_temp_dir, + commit_message = commit_message, + private = private, + token = token, + max_shard_size = max_shard_size, + create_pr = create_pr, safe_serialization = safe_serialization, - revision = revision, + revision = revision, commit_description = commit_description, - tags = tags, + tags = tags, ) # Revert back padding side tokenizer.padding_side = old_padding_side + pass if hasattr(model, "config"): - print( - f"Saved {save_method} model to https://huggingface.co/" + save_directory - ) + print(f"Saved {save_method} model to https://huggingface.co/" + save_directory) + pass return save_directory, None + pass # Tokenizer has different saving arguments - tokenizer_save_settings = { - "save_directory": save_pretrained_settings["save_directory"], - "legacy_format": None, - "filename_prefix": None, - "push_to_hub": save_pretrained_settings["push_to_hub"], - "private": save_pretrained_settings["private"], - "token": save_pretrained_settings["token"], + tokenizer_save_settings = \ + { + "save_directory" : save_pretrained_settings["save_directory"], + "legacy_format" : None, + "filename_prefix" : None, + "push_to_hub" : save_pretrained_settings["push_to_hub"], + "private" : save_pretrained_settings["private"], + "token" : save_pretrained_settings["token"], } # Check if PEFT Model or not - if yes, 3 levels. If not 2 levels. from peft import PeftModelForCausalLM - if isinstance(model, PeftModelForCausalLM): internal_model = model.model else: internal_model = model - + pass + # Cannot be converted properly! - if ( - (save_method == "merged_4bit") - or (save_method == "lora") - or (not hasattr(model, "model") or not hasattr(internal_model.model, "layers")) + if (save_method == "merged_4bit") or (save_method == "lora") or ( + not hasattr(model, "model") or \ + not hasattr(internal_model.model, "layers") ): # Do general saving # Edit save_pretrained_settings # [TODO] _create_repo has errors due to **kwargs getting accepted # commit_description does not seem to work? - what_to_delete = ( - ( - "use_temp_dir", - "commit_message", - "create_pr", - "revision", - "commit_description", - "tags", - ) - if save_pretrained_settings["push_to_hub"] is False - else ( - "use_temp_dir", - "create_pr", - "revision", - "tags", - "commit_description", - ) - ) + what_to_delete = ("use_temp_dir", "commit_message", "create_pr", "revision", "commit_description", "tags",) \ + if save_pretrained_settings["push_to_hub"] is False else \ + ("use_temp_dir", "create_pr", "revision", "tags", "commit_description",) for deletion in what_to_delete: del save_pretrained_settings[deletion] + pass if hasattr(model, "add_model_tags"): - model.add_model_tags( - [ - "unsloth", - ] - ) + model.add_model_tags(["unsloth",]) # Update model tag if push_to_hub: - _ = upload_to_huggingface( - model, - save_pretrained_settings["save_directory"], - token, - "finetuned", - "trl", - file_location = None, - old_username = None, - private = private, - datasets = datasets, + _ = upload_to_huggingface( + model, save_pretrained_settings["save_directory"], token, + "finetuned", "trl", file_location = None, + old_username = None, private = private, ) + pass if tokenizer is not None: print("Unsloth: Saving tokenizer...", end = "") @@ -508,48 +431,47 @@ def unsloth_save_model( print() print("Unsloth: Saving model...", end = "") - if save_method != "lora": - print(" This might take 10 minutes for Llama-7b...", end = "") + if save_method != "lora": print(" This might take 10 minutes for Llama-7b...", end = "") # [TODO] Is this correct? if save_method == "lora": save_pretrained_settings["selected_adapters"] = None + pass model.save_pretrained(**save_pretrained_settings) if push_to_hub and hasattr(model, "config"): - print( - "Saved to https://huggingface.co/" - + save_pretrained_settings["save_directory"] - ) + print("Saved to https://huggingface.co/" + save_pretrained_settings["save_directory"]) + pass print(" Done.") return save_directory, None + pass # If push_to_hub, we must remove the .../ part of a repo username = None if push_to_hub and "/" in save_directory: + # +1 solves absolute path issues new_save_directory = save_directory - username = new_save_directory[: new_save_directory.find("/")] - new_save_directory = new_save_directory[new_save_directory.find("/") + 1 :] + username = new_save_directory[:new_save_directory.find("/")] + new_save_directory = new_save_directory[new_save_directory.find("/")+1:] if IS_KAGGLE_ENVIRONMENT: - new_save_directory = os.path.join( - KAGGLE_TMP, new_save_directory[new_save_directory.find("/") + 1 :] - ) + new_save_directory = os.path.join(KAGGLE_TMP, new_save_directory[new_save_directory.find("/")+1:]) logger.warning_once( - "Unsloth: You are pushing to hub in Kaggle environment.\n" + "Unsloth: You are pushing to hub in Kaggle environment.\n"\ f"To save memory, we shall move {save_directory} to {new_save_directory}" ) else: logger.warning_once( - f"Unsloth: You are pushing to hub, but you passed your HF username = {username}.\n" + f"Unsloth: You are pushing to hub, but you passed your HF username = {username}.\n"\ f"We shall truncate {save_directory} to {new_save_directory}" ) save_pretrained_settings["save_directory"] = new_save_directory - tokenizer_save_settings["save_directory"] = new_save_directory + tokenizer_save_settings ["save_directory"] = new_save_directory save_directory = new_save_directory + pass print("Unsloth: Merging 4bit and LoRA weights to 16bit...") @@ -557,25 +479,18 @@ def unsloth_save_model( max_ram = psutil.virtual_memory().available sharded_ram_usage = 5 * 1024 * 1024 * 1024 if type(max_shard_size) is str: - gb_found = re.match( - r"([0-9]{1,})[\s]{0,}GB", max_shard_size, flags = re.IGNORECASE - ) - mb_found = re.match( - r"([0-9]{1,})[\s]{0,}MB", max_shard_size, flags = re.IGNORECASE - ) - if gb_found: - sharded_ram_usage = int(gb_found.group(1)) * 1024 * 1024 * 1024 - elif mb_found: - sharded_ram_usage = int(mb_found.group(1)) * 1024 * 1024 + gb_found = re.match("([0-9]{1,})[\s]{0,}GB", max_shard_size, flags = re.IGNORECASE) + mb_found = re.match("([0-9]{1,})[\s]{0,}MB", max_shard_size, flags = re.IGNORECASE) + if gb_found: sharded_ram_usage = int(gb_found.group(1)) * 1024 * 1024 * 1024 + elif mb_found: sharded_ram_usage = int(mb_found.group(1)) * 1024 * 1024 elif type(max_shard_size) is int: - sharded_ram_usage = max_shard_size + sharded_ram_usage = sharded_ram_usage + pass # Switch to our fast saving modules if it's a slow PC! n_cpus = psutil.cpu_count(logical = False) - if n_cpus is None: - n_cpus = psutil.cpu_count() - if n_cpus is None: - n_cpus = 1 + if n_cpus is None: n_cpus = psutil.cpu_count() + if n_cpus is None: n_cpus = 1 if safe_serialization is None: safe_serialization = True @@ -583,27 +498,27 @@ def unsloth_save_model( elif safe_serialization and (n_cpus <= 2): logger.warning_once( - f"Unsloth: You have {n_cpus} CPUs. Using `safe_serialization` is 10x slower.\n" - f"We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.\n" + f"Unsloth: You have {n_cpus} CPUs. Using `safe_serialization` is 10x slower.\n"\ + f"We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.\n"\ f"To force `safe_serialization`, set it to `None` instead.", ) safe_serialization = False save_function = fast_save_pickle save_pretrained_settings["safe_serialization"] = safe_serialization - save_pretrained_settings["save_function"] = save_function + save_pretrained_settings["save_function"] = save_function + pass # Only safe_serialization uses more RAM if safe_serialization: max_ram -= sharded_ram_usage else: - max_ram -= sharded_ram_usage * 0.25 # Uses much less + max_ram -= sharded_ram_usage*0.25 # Uses much less + pass max_ram = int(max(0, max_ram) * maximum_memory_usage) - print( - f"Unsloth: Will use up to " - f"{round(max_ram/1024/1024/1024, 2)} out of " - f"{round(psutil.virtual_memory().total/1024/1024/1024, 2)} RAM for saving." - ) + print(f"Unsloth: Will use up to "\ + f"{round(max_ram/1024/1024/1024, 2)} out of "\ + f"{round(psutil.virtual_memory().total/1024/1024/1024, 2)} RAM for saving.") # Move temporary_location to /tmp in Kaggle if IS_KAGGLE_ENVIRONMENT: @@ -612,41 +527,36 @@ def unsloth_save_model( # Max directory for disk saving if not os.path.exists(temporary_location): os.makedirs(temporary_location) + pass # Check if Kaggle or Colab, since only 20GB of Disk space allowed. if IS_KAGGLE_ENVIRONMENT or IS_COLAB_ENVIRONMENT: # We free up 4GB of space logger.warning_once( - "Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded\n" + "Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded\n"\ "model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab." ) _free_cached_model(internal_model) + pass # HF also uses a OrderedDict from collections import OrderedDict - state_dict = OrderedDict() - torch_dtype = dtype_from_config(internal_model.config) + torch_dtype = internal_model.config.torch_dtype if type(torch_dtype) is str: - if torch_dtype == "float16": - torch_dtype = torch.float16 - elif torch_dtype == "bfloat16": - torch_dtype = torch.bfloat16 + if torch_dtype == "float16": torch_dtype = torch.float16 + elif torch_dtype == "bfloat16": torch_dtype = torch.bfloat16 + pass # Check modules to save float32 dtype - state_dict["model.embed_tokens.weight"] = ( - internal_model.model.embed_tokens.weight.data.to(torch_dtype) - ) + state_dict["model.embed_tokens.weight"] = internal_model.model.embed_tokens.weight.data.to(torch_dtype) - max_vram = int( - torch.cuda.get_device_properties(0).total_memory * maximum_memory_usage - ) + max_vram = int(torch.cuda.get_device_properties(0).total_memory * maximum_memory_usage) print("Unsloth: Saving model... This might take 5 minutes ...") from tqdm import tqdm as ProgressBar - for j, layer in enumerate(ProgressBar(internal_model.model.layers)): for item in LLAMA_WEIGHTS: proj = eval(f"layer.{item}") @@ -656,6 +566,7 @@ def unsloth_save_model( # Bias term if bias is not None: state_dict[f"model.layers.{j}.{item}.bias"] = bias + pass if (torch.cuda.memory_allocated() + W.nbytes) < max_vram: # Save to GPU memory @@ -670,104 +581,70 @@ def unsloth_save_model( # Save to Disk logger.warning_once("\nWe will save to Disk and not RAM now.") filename = os.path.join(temporary_location, f"{name}.pt") - torch.save( - W, - filename, - pickle_module = pickle, - pickle_protocol = pickle.HIGHEST_PROTOCOL, - ) + torch.save(W, filename, pickle_module = pickle, pickle_protocol = pickle.HIGHEST_PROTOCOL,) # weights_only = True weirdly fails? - state_dict[name] = torch.load( - filename, map_location = "cpu", mmap = True, weights_only = False - ) + state_dict[name] = torch.load(filename, map_location = "cpu", mmap = True, weights_only = False) + pass for item in LLAMA_LAYERNORMS: try: # Skip for Gemma 2 - state_dict[f"model.layers.{j}.{item}.weight"] = eval( - f"layer.{item}.weight.data" - ) + state_dict[f"model.layers.{j}.{item}.weight"] = eval(f"layer.{item}.weight.data") except: continue + pass + pass state_dict["model.norm.weight"] = internal_model.model.norm.weight.data # Check for modules_to_save float32 dtype # Check for tied weights - if ( - internal_model.model.embed_tokens.weight.data_ptr() - != internal_model.lm_head.weight.data_ptr() - ): - state_dict["lm_head.weight"] = internal_model.lm_head.weight.data.to( - torch_dtype - ) + if internal_model.model.embed_tokens.weight.data_ptr() != internal_model.lm_head.weight.data_ptr(): + state_dict["lm_head.weight"] = internal_model.lm_head.weight.data.to(torch_dtype) + pass # All tensors MUST be type torch.Tensor and not torch.nn.parameter.Parameter for key, value in state_dict.items(): - if hasattr(value, "data"): - state_dict[key] = value = value.data + if hasattr(value, "data"): state_dict[key] = value = value.data if type(value) is not torch.Tensor: logger.warning_once(f"Unsloth: {key} is not a Tensor but a {type(value)}.") + pass + pass # Edit save_pretrained_settings # [TODO] _create_repo has errors due to **kwargs getting accepted save_pretrained_settings["state_dict"] = state_dict - + # commit_description does not seem to work? - what_to_delete = ( - ( - "use_temp_dir", - "commit_message", - "create_pr", - "revision", - "commit_description", - "tags", - ) - if not push_to_hub - else ( - "use_temp_dir", - "create_pr", - "revision", - "tags", - "commit_description", - ) - ) + what_to_delete = ("use_temp_dir", "commit_message", "create_pr", "revision", "commit_description", "tags",) \ + if not push_to_hub else \ + ("use_temp_dir", "create_pr", "revision", "tags", "commit_description",) for deletion in what_to_delete: del save_pretrained_settings[deletion] + pass if hasattr(model, "add_model_tags"): - model.add_model_tags( - [ - "unsloth", - ] - ) + model.add_model_tags(["unsloth",]) # Update model tag if push_to_hub: _ = upload_to_huggingface( - model, - save_pretrained_settings["save_directory"], - token, - "finetuned", - "trl", - file_location = None, - old_username = username, - private = private, - datasets = datasets, + model, save_pretrained_settings["save_directory"], token, + "finetuned", "trl", file_location = None, + old_username = username, private = private, ) + pass # First check if we're pushing to an organization! save_directory = save_pretrained_settings["save_directory"] if save_pretrained_settings["push_to_hub"]: - new_save_directory, new_username = _determine_username( - save_directory, username, token - ) + new_save_directory, new_username = _determine_username(save_directory, username, token) if token is not None: from huggingface_hub import whoami - actual_username = whoami(token = token)["name"] else: actual_username = username + pass # Check if pushing to an organization if save_pretrained_settings["push_to_hub"] and (username != actual_username): @@ -775,6 +652,7 @@ def unsloth_save_model( # We upload everything at the end! tokenizer_save_settings["push_to_hub"] = False tokenizer_save_settings["save_directory"] = new_save_directory + pass # Save tokenizer if tokenizer is not None: @@ -788,10 +666,11 @@ def unsloth_save_model( # Revert back padding side tokenizer.padding_side = old_padding_side - + print(" Done.") else: print() + pass # Since merged, edit quantization_config old_config = model.config @@ -830,11 +709,12 @@ def unsloth_save_model( path_in_repo = ".", repo_id = new_save_directory, repo_type = "model", - commit_message = "(Trained with Unsloth)", + commit_message = "(Trained with Unsloth)", ignore_patterns = "*.md", ) else: internal_model.save_pretrained(**save_pretrained_settings) + pass # Revert config back original_model = model @@ -845,9 +725,8 @@ def unsloth_save_model( print("Done.") if push_to_hub and hasattr(model, "config"): - print( - f"Saved merged model to https://huggingface.co/{username}/{save_directory.lstrip('/').split('/')[-1]}" - ) + print(f"Saved merged model to https://huggingface.co/{username}/{save_directory.lstrip('/').split('/')[-1]}") + pass save_pretrained_settings["state_dict"] = None @@ -856,6 +735,8 @@ def unsloth_save_model( if j % 10 == 0: torch.cuda.empty_cache() gc.collect() + pass + pass state_dict = None del state_dict torch.cuda.empty_cache() @@ -863,26 +744,20 @@ def unsloth_save_model( # Remove temporary location import shutil - shutil.rmtree(temporary_location, ignore_errors = True) for _ in range(3): torch.cuda.empty_cache() gc.collect() return save_directory, username +pass def install_llama_cpp_clone_non_blocking(): - full_command = [ - "git", - "clone", - "--recursive", - "https://github.com/ggerganov/llama.cpp", - ] - run_installer = subprocess.Popen( - full_command, stdout = subprocess.DEVNULL, stderr = subprocess.STDOUT - ) + full_command = ["git", "clone", "--recursive", "https://github.com/ggerganov/llama.cpp"] + run_installer = subprocess.Popen(full_command, stdout = subprocess.DEVNULL, stderr = subprocess.STDOUT) return run_installer +pass def install_llama_cpp_make_non_blocking(): @@ -894,90 +769,71 @@ def install_llama_cpp_make_non_blocking(): IS_CMAKE = False if check == 0: # Uses old MAKE - n_jobs = max(int((psutil.cpu_count() or 1) * 1.5), 1) - full_command = ["make", "all", "-j" + str(n_jobs), "-C", "llama.cpp"] + n_jobs = max(int(psutil.cpu_count()*1.5), 1) + full_command = ["make", "all", "-j"+str(n_jobs), "-C", "llama.cpp"] IS_CMAKE = False else: # Uses new CMAKE - n_jobs = max(int(psutil.cpu_count() or 1), 1) # Use less CPUs since 1.5x faster - check = os.system( - f"cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF {CURL_FLAG}" - ) - + n_jobs = max(int(psutil.cpu_count()), 1) # Use less CPUs since 1.5x faster + check = os.system("cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF -DLLAMA_CURL=ON") if check != 0: - raise RuntimeError( - f"*** Unsloth: Failed compiling llama.cpp using os.system(...) with error {check}. Please report this ASAP!" - ) + raise RuntimeError(f"*** Unsloth: Failed compiling llama.cpp using os.system(...) with error {check}. Please report this ASAP!") + pass # f"cmake --build llama.cpp/build --config Release -j{psutil.cpu_count()*2} --clean-first --target {' '.join(LLAMA_CPP_TARGETS)}", full_command = [ - "cmake", - "--build", - "llama.cpp/build", - "--config", - "Release", - "-j" + str(n_jobs), + "cmake", "--build", "llama.cpp/build", + "--config", "Release", + "-j"+str(n_jobs), "--clean-first", "--target", ] + LLAMA_CPP_TARGETS IS_CMAKE = True + pass # https://github.com/ggerganov/llama.cpp/issues/7062 # Weirdly GPU conversion for GGUF breaks?? # run_installer = subprocess.Popen(full_command, env = env, stdout = subprocess.DEVNULL, stderr = subprocess.STDOUT) - run_installer = subprocess.Popen( - full_command, stdout = subprocess.DEVNULL, stderr = subprocess.STDOUT - ) + run_installer = subprocess.Popen(full_command, stdout = subprocess.DEVNULL, stderr = subprocess.STDOUT) return run_installer, IS_CMAKE +pass def install_python_non_blocking(packages = []): full_command = ["pip", "install"] + packages - run_installer = subprocess.Popen( - full_command, stdout = subprocess.DEVNULL, stderr = subprocess.STDOUT - ) + run_installer = subprocess.Popen(full_command, stdout = subprocess.DEVNULL, stderr = subprocess.STDOUT) return run_installer +pass def try_execute(commands, force_complete = False): for command in commands: - with subprocess.Popen( - command, - shell = True, - stdout = subprocess.PIPE, - stderr = subprocess.STDOUT, - bufsize = 1, - ) as sp: + with subprocess.Popen(command, shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT, bufsize = 1) as sp: for line in sp.stdout: line = line.decode("utf-8", errors = "replace") if "undefined reference" in line: - raise RuntimeError( - f"*** Unsloth: Failed compiling llama.cpp with {line}. Please report this ASAP!" - ) + raise RuntimeError(f"*** Unsloth: Failed compiling llama.cpp with {line}. Please report this ASAP!") elif "deprecated" in line: return "CMAKE" elif "Unknown argument" in line: - raise RuntimeError( - f"*** Unsloth: Failed compiling llama.cpp with {line}. Please report this ASAP!" - ) + raise RuntimeError(f"*** Unsloth: Failed compiling llama.cpp with {line}. Please report this ASAP!") elif "***" in line: - raise RuntimeError( - f"*** Unsloth: Failed compiling llama.cpp with {line}. Please report this ASAP!" - ) + raise RuntimeError(f"*** Unsloth: Failed compiling llama.cpp with {line}. Please report this ASAP!") print(line, flush = True, end = "") + pass if force_complete and sp.returncode is not None and sp.returncode != 0: raise subprocess.CalledProcessError(sp.returncode, sp.args) + pass + pass return None +pass def install_llama_cpp_old(version = -10): # Download the 10th latest release since the latest might be broken! # FALLBACK mechanism - releases = subprocess.check_output( - ["git", "ls-remote", "--tags", "https://github.com/ggerganov/llama.cpp.git"] - ) + releases = subprocess.check_output(["git", "ls-remote", "--tags", "https://github.com/ggerganov/llama.cpp.git"]) releases = releases.decode("utf-8").replace("\t", " ").split("\n") for i, x in enumerate(releases): - if "refs/tags/b" not in x: - break + if "refs/tags/b" not in x: break releases = releases[:i] latest = releases[-1] version = releases[version].split(" ")[0] @@ -985,18 +841,17 @@ def install_llama_cpp_old(version = -10): # Check if the llama.cpp exists if os.path.exists("llama.cpp"): print( - "**[WARNING]** You have a llama.cpp directory which is broken.\n" - "Unsloth will DELETE the broken directory and install a new one.\n" + "**[WARNING]** You have a llama.cpp directory which is broken.\n"\ + "Unsloth will DELETE the broken directory and install a new one.\n"\ "Press CTRL + C / cancel this if this is wrong. We shall wait 30 seconds.\n" ) import time - for i in range(30): print(f"**[WARNING]** Deleting llama.cpp directory... {30-i} seconds left.") time.sleep(1) import shutil - shutil.rmtree("llama.cpp", ignore_errors = True) + pass # Clone a specific commit # Also don't use the GPU! @@ -1009,33 +864,27 @@ def install_llama_cpp_old(version = -10): # Try using MAKE commands = [ "make clean -C llama.cpp", - f"make all -j{(psutil.cpu_count() or 1)*2} -C llama.cpp", + f"make all -j{psutil.cpu_count()*2} -C llama.cpp", ] if try_execute(commands) == "CMAKE": # Instead use CMAKE commands = [ - f"cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF {CURL_FLAG}", - f"cmake --build llama.cpp/build --config Release -j{(psutil.cpu_count() or 1)*2} --clean-first --target {' '.join(LLAMA_CPP_TARGETS)}", + "cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF -DLLAMA_CURL=ON", + f"cmake --build llama.cpp/build --config Release -j{psutil.cpu_count()*2} --clean-first --target {' '.join(LLAMA_CPP_TARGETS)}", "cp llama.cpp/build/bin/llama-* llama.cpp", "rm -rf llama.cpp/build", ] - try_execute(commands) + pass # Check if successful - if not ( - os.path.exists("llama.cpp/llama-quantize.exe") - or os.path.exists("llama.cpp/llama-quantize") - or os.path.exists("llama.cpp/quantize.exe") - or os.path.exists("llama.cpp/quantize") - or os.path.exists("llama.cpp/build/bin/llama-quantize") - or os.path.exists("llama.cpp/build/bin/quantize") - ): + if not os.path.exists("llama.cpp/quantize") and not os.path.exists("llama.cpp/llama-quantize"): raise RuntimeError( - "Unsloth: The file 'llama.cpp/llama-quantize' or `llama.cpp/quantize` does not exist.\n" - "We've also double checked the building directory under 'llama.cpp/build/bin/'.\n" - "But we expect this file to exist! Check if the file exists under llama.cpp and investigate the building process of llama.cpp (make/cmake)!" + "Unsloth: The file 'llama.cpp/llama-quantize' or `llama.cpp/quantize` does not exist.\n"\ + "But we expect this file to exist! Maybe the llama.cpp developers changed the name?" ) + pass +pass def install_llama_cpp_blocking(use_cuda = False): @@ -1047,26 +896,27 @@ def install_llama_cpp_blocking(use_cuda = False): "git clone --recursive https://github.com/ggerganov/llama.cpp", "pip install gguf protobuf", ] - if os.path.exists("llama.cpp"): - return + if os.path.exists("llama.cpp"): return try_execute(commands) commands = [ "make clean -C llama.cpp", # https://github.com/ggerganov/llama.cpp/issues/7062 # Weirdly GPU conversion for GGUF breaks?? - # f"{use_cuda} make all -j{(psutil.cpu_count() or 1)*2} -C llama.cpp", - f"make all -j{(psutil.cpu_count() or 1)*2} -C llama.cpp", + # f"{use_cuda} make all -j{psutil.cpu_count()*2} -C llama.cpp", + f"make all -j{psutil.cpu_count()*2} -C llama.cpp", ] if try_execute(commands) == "CMAKE": # Instead use CMAKE commands = [ - f"cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF {CURL_FLAG}", - f"cmake --build llama.cpp/build --config Release -j{(psutil.cpu_count() or 1)*2} --clean-first --target {' '.join(LLAMA_CPP_TARGETS)}", + "cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF -DLLAMA_CURL=ON", + f"cmake --build llama.cpp/build --config Release -j{psutil.cpu_count()*2} --clean-first --target {' '.join(LLAMA_CPP_TARGETS)}", "cp llama.cpp/build/bin/llama-* llama.cpp", "rm -rf llama.cpp/build", ] try_execute(commands) + pass +pass def get_executable(executables): @@ -1077,80 +927,73 @@ def get_executable(executables): for executable in executables: path = os.path.join(directory, executable) # Check if the executable exists and is executable - if os.path.exists(path) and os.access(path, os.X_OK): - return path + if os.path.exists(path) and os.access(path, os.X_OK): return path + pass + pass return None +pass def save_to_gguf( - model_name: str, - model_type: str, - model_dtype: str, - is_sentencepiece: bool = False, - model_directory: str = "unsloth_finetuned_model", - quantization_method = "fast_quantized", # Can be a list of options! ["q4_k_m", "q8_0", "q5_k_m"] - first_conversion: str = None, - is_vlm: bool = False, - is_gpt_oss: bool = False, + model_type : str, + model_dtype : str, + is_sentencepiece : bool = False, + model_directory : str = "unsloth_finetuned_model", + quantization_method = "fast_quantized", # Can be a list of options! ["q4_k_m", "q8_0", "q5_k_m"] + first_conversion : str = None, + _run_installer = None, # Non blocking install of llama.cpp ): - """ - Orchestrates the complete GGUF conversion process. - Handles installation, conversion, and quantization. - """ - # print_output True only if UNSLOTH_ENABLE_LOGGING=1 - if os.environ.get("UNSLOTH_ENABLE_LOGGING", "0") == "1": - print_output = True - else: - print_output = False - - # Validate model dtype - assert model_dtype == "float16" or model_dtype == "bfloat16" + # logger.warning( + # "NOTICE: llama.cpp GGUF conversion is currently unstable, since llama.cpp is\n"\ + # "undergoing some major bug fixes as at 5th of May 2024. This is not an Unsloth issue.\n"\ + # "Please be patient - GGUF saving should still work, but might not work as well." + # ) + assert(model_dtype == "float16" or model_dtype == "bfloat16") model_dtype = "f16" if model_dtype == "float16" else "bf16" # Convert quantization_method to list - if isinstance(quantization_method, list): - pass - elif isinstance(quantization_method, str): - quantization_method = [ - quantization_method, - ] - elif isinstance(quantization_method, tuple): - quantization_method = list(quantization_method) + if isinstance(quantization_method, list): pass + elif isinstance(quantization_method, str): quantization_method = [ quantization_method, ] + elif isinstance(quantization_method, tuple): quantization_method = list(quantization_method) else: - raise TypeError( - "Unsloth: quantization_method can only be a string or a list of strings" - ) - + raise TypeError("Unsloth: quantization_method can only be a string or a list of strings") + pass + # Check if bfloat16 is supported if model_dtype == "bf16" and not torch.cuda.is_bf16_supported(): logger.warning( - "Unsloth: Cannot convert to bf16 GGUF since your computer doesn't support it.\n" + "Unsloth: Cannot convert to bf16 GGUF since your computer doesn't support it.\n"\ "We shall switch instead to f16." ) model_dtype = "f16" + pass # Check first_conversion as well if first_conversion is None: first_conversion = model_dtype + pass # Check I quants - for quant_method in quantization_method: + for quant_method in quantization_method: if quant_method.startswith("iq2"): - raise RuntimeError( - "Unsloth: Currently iq2 type quantizations aren't supported yet - sorry!" - ) + raise RuntimeError("Unsloth: Currently iq2 type quantizations aren't supported yet - sorry!") + pass + + # Careful convert.py is only for Llama / Mistral based archs + use_fast_convert = False + if not is_sentencepiece: use_fast_convert = False # Llama-3 + elif model_type == "llama": use_fast_convert = True + elif model_type == "mistral": use_fast_convert = True + pass + logger.warning_once(f"Unsloth: Converting {model_type} model. Can use fast conversion = {use_fast_convert}.") # Map quant methods - new_quantization_methods = [] + new_quantization_method = [] for quant_method in quantization_method: - if quant_method == "not_quantized": - quant_method = model_dtype - elif quant_method == "fast_quantized": - quant_method = "q8_0" - elif quant_method == "quantized": - quant_method = "q4_k_m" - elif quant_method is None: - quant_method = "q8_0" + if quant_method == "not_quantized": quant_method = model_dtype + elif quant_method == "fast_quantized": quant_method = "q8_0" + elif quant_method == "quantized": quant_method = "q4_k_m" + elif quant_method is None: quant_method = "q8_0" # Check if wrong method if quant_method not in ALLOWED_QUANTS.keys(): @@ -1158,249 +1001,234 @@ def save_to_gguf( for key, value in ALLOWED_QUANTS.items(): error += f"[{key}] => {value}\n" raise RuntimeError(error) + pass - new_quantization_methods.append(quant_method) - quantization_method = new_quantization_methods + new_quantization_method.append(quant_method) + pass + quantization_method = new_quantization_method - # Determine optimal first_conversion - if is_gpt_oss: - print("Unsloth: GPT-OSS model detected - using special conversion settings") - first_conversion = "None" # No quantization for GPT-OSS - # Only keep one conversion method since GPT-OSS doesn't quantize - quantization_method = ["None"] - else: - if first_conversion is None: - # Check if q8_0 is the ONLY quantization method requested - if len(quantization_method) == 1 and quantization_method[0] == "q8_0": - first_conversion = "None" # Let llama-quantize do the direct conversion - else: - # For all other cases, choose the highest precision format - # that can be requantized to all requested formats - strength = 0 - for quant_method in quantization_method: - if quant_method == "f32": - strength = max(strength, 3) - elif quant_method == "f16": - strength = max(strength, 2) - elif quant_method == "bf16": - strength = max(strength, 1) - # Note: we don't set strength for q8_0 here since we handle it above - - if strength >= 3: - first_conversion = "f32" - elif strength >= 2: - first_conversion = "f16" - elif strength >= 1: - first_conversion = "bf16" - else: - first_conversion = "bf16" # requantizing from q8_0 disallowed in new llama.cpp default to bf16. - - # Check bfloat16 support again for first_conversion - if first_conversion == "bf16" and not torch.cuda.is_bf16_supported(): - logger.warning("Unsloth: Switching bf16 to f16 due to hardware limitations") - first_conversion = "f16" - - first_conversion_dtype = "" if first_conversion == "None" else first_conversion - # Print conversion info - print_info = ( - f"==((====))== Unsloth: Conversion from HF to GGUF information\n" - f" {chr(92)}{chr(92)} /| [0] Installing llama.cpp might take 3 minutes.\n" - f"O^O/ {chr(92)}_/ {chr(92)} [1] Converting HF to GGUF {first_conversion_dtype} might take 3 minutes.\n" - f"{chr(92)} / [2] Converting GGUF {first_conversion_dtype} to {quantization_method} might take 10 minutes each.\n" + print_info = \ + f"==((====))== Unsloth: Conversion from QLoRA to GGUF information\n"\ + f" \\\ /| [0] Installing llama.cpp might take 3 minutes.\n"\ + f"O^O/ \_/ \\ [1] Converting HF to GGUF 16bits might take 3 minutes.\n"\ + f"\ / [2] Converting GGUF 16bits to {quantization_method} might take 10 minutes each.\n"\ f' "-____-" In total, you will have to wait at least 16 minutes.\n' - ) print(print_info) - # Step 1: Ensure llama.cpp is installed - try: - quantizer_location, converter_location = check_llama_cpp() - print("Unsloth: llama.cpp found in the system. Skipping installation.") - except: - print("Unsloth: Installing llama.cpp. This might take 3 minutes...") - if IS_KAGGLE_ENVIRONMENT: - # Kaggle: no CUDA support due to environment limitations - quantizer_location, converter_location = install_llama_cpp( - gpu_support = False, print_output = print_output - ) + # Check first_conversion format + if first_conversion == "f16" : pass + elif first_conversion == "bf16" : pass + elif first_conversion == "f32" : pass + elif first_conversion == "q8_0" : pass + else: + raise RuntimeError( + f"Unsloth: `first_conversion` can only be one of ['f16', 'bf16', 'f32', 'q8_0'] and not `{first_conversion}`." + ) + pass + + # Determine maximum first_conversion state + if first_conversion == "f32" : strength = 3 + elif first_conversion == "f16" : strength = 2 + elif first_conversion == "bf16" : strength = 1 + elif first_conversion == "q8_0" : strength = 0 + + for quant_method in quantization_method: + if quant_method == "f32": strength = max(strength, 3) + elif quant_method == "f16": strength = max(strength, 2) + elif quant_method == "bf16": strength = max(strength, 1) + elif quant_method == "q8_0": strength = max(strength, 0) else: - quantizer_location, converter_location = install_llama_cpp( - gpu_support = False, # GGUF conversion doesn't need CUDA - print_output = print_output, - ) - - # Step 2: Download and patch converter script - print("Unsloth: Preparing converter script...") - with use_local_gguf(): - converter_path, supported_text_archs, supported_vision_archs = ( - _download_convert_hf_to_gguf() - ) - - # Step 3: Initial GGUF conversion - print( - f"Unsloth: [1] Converting model into {first_conversion_dtype} GGUF format." - ) - print(f"This might take 3 minutes...") - - initial_files, is_vlm_update = convert_to_gguf( - model_name = model_name, - input_folder = model_directory, - model_dtype = model_dtype, - quantization_type = first_conversion, - converter_location = converter_path, - supported_text_archs = supported_text_archs, - supported_vision_archs = supported_vision_archs, - is_vlm = is_vlm, - is_gpt_oss = is_gpt_oss, - max_shard_size = "50GB", - print_output = print_output, - ) - # update is_vlm switch - is_vlm = is_vlm_update - # Check conversion success - for file in initial_files: - if not os.path.exists(file): - if IS_KAGGLE_ENVIRONMENT: - raise RuntimeError( - f"Unsloth: Conversion failed for {file}\n" - "You are in a Kaggle environment with limited disk space (20GB).\n" - "Try saving to /tmp for more space or use a smaller model.\n" - "Alternatively, save the 16bit model first, then convert manually." - ) - else: - raise RuntimeError( - f"Unsloth: Conversion failed for {file}\n" - "Please check disk space and try again." + # Quantized models must have f16 as the default argument + if first_conversion == "f32" : pass + elif first_conversion == "f16" : pass + elif first_conversion == "bf16" : pass + elif first_conversion == "q8_0": + logger.warning_once( + "Unsloth: Using q8_0 for the `first_conversion` will lose a bit of accuracy, "\ + "but saves disk space!" ) + # first_conversion = "f16" + pass + pass + pass - # Move initial GGUF files into a dedicated _gguf directory - gguf_directory = f"{model_directory}_gguf" - os.makedirs(gguf_directory, exist_ok = True) - moved_files = [] - for fpath in initial_files: - dst = os.path.join(gguf_directory, os.path.basename(fpath)) - shutil.move(fpath, dst) - moved_files.append(dst) - initial_files = moved_files + # If only q8_0: + if len(quantization_method) == 1 and quantization_method[0] == "q8_0": + strength = 0 + pass - print(f"Unsloth: Initial conversion completed! Files: {initial_files}") + if strength >= 3: first_conversion = "f32" + elif strength >= 2: first_conversion = "f16" + elif strength >= 1: first_conversion = "bf16" + else: first_conversion = "q8_0" - # Step 4: Additional quantizations using llama-quantize - all_saved_locations = initial_files.copy() + # Non llama/mistral needs can only use f32 or f16 + if not use_fast_convert and \ + (first_conversion != "f16" or first_conversion != "bf16" or first_conversion != "f32"): + + pass + # Latest llama.cpp works for all models for q8_0! + + # logger.warning_once("Unsloth: We must use f16 for non Llama and Mistral models.") + # first_conversion = "f16" + pass + + # Check if bfloat16 is supported + if first_conversion == "bf16" and not torch.cuda.is_bf16_supported(): + logger.warning( + "Unsloth: Cannot convert to bf16 GGUF since your computer doesn't support it.\n"\ + "We shall switch instead to f16." + ) + first_conversion = "f16" + pass - # Get CPU count for quantization n_cpus = psutil.cpu_count() - if n_cpus is None: - n_cpus = 1 + if n_cpus is None: n_cpus = 1 n_cpus *= 2 + # Concurrency from https://rentry.org/llama-cpp-conversions#merging-loras-into-a-model - if not is_gpt_oss: - base_gguf = initial_files[0] - quants_created = False - for quant_method in quantization_method: - if quant_method != first_conversion: - print( - f"Unsloth: [2] Converting GGUF {first_conversion_dtype} into {quant_method}. This might take 10 minutes..." + final_location = str((Path(model_directory) / f"unsloth.{first_conversion.upper()}.gguf").absolute()) + + print(f"Unsloth: [1] Converting model at {model_directory} into {first_conversion} GGUF format.\n"\ + f"The output location will be {final_location}\n"\ + "This might take 3 minutes...") + + # We first check if tokenizer.model exists in the model_directory + if os.path.exists(f"{model_directory}/tokenizer.model"): + vocab_type = "spm,hfft,bpe" + # Fix Sentencepiece model as well! + fix_sentencepiece_gguf(model_directory) + else: + vocab_type = "bpe" + pass + + # convert.py is deprecated! + use_fast_convert = False + if use_fast_convert: + command = f"python llama.cpp/convert.py {model_directory} "\ + f"--outfile {final_location} --vocab-type {vocab_type} "\ + f"--outtype {first_conversion} --concurrency {n_cpus} --pad-vocab" + else: + command = f"python {convert_location} {model_directory} "\ + f"--outfile {final_location} "\ + f"--outtype {first_conversion}" + pass + + try_execute([command,], force_complete = True) + + # Check if quantization succeeded! + if not os.path.isfile(final_location): + if IS_KAGGLE_ENVIRONMENT: + if not Path(final_location).resolve().is_relative_to(Path('/tmp').resolve()): + raise RuntimeError( + f"Unsloth: Quantization failed for {final_location}\n"\ + "You are in a Kaggle environment, which might be the reason this is failing.\n"\ + "Kaggle only provides 20GB of disk space in the working directory.\n"\ + "Merging to 16bit for 7b models use 16GB of space.\n"\ + "This means using `model.{save_pretrained/push_to_hub}_merged` works, but\n"\ + "`model.{save_pretrained/push_to_hub}_gguf will use too much disk space.\n"\ + "You can try saving it to the `/tmp` directory for larger disk space.\n"\ + "I suggest you to save the 16bit model first, then use manual llama.cpp conversion." ) - output_location = os.path.join( - gguf_directory, f"{model_name}.{quant_method.upper()}.gguf" - ) - try: - # Use the quantize_gguf function we created - quantized_file = quantize_gguf( - input_gguf = base_gguf, - output_gguf = output_location, - quant_type = quant_method, - quantizer_location = quantizer_location, - print_output = print_output, + else: + raise RuntimeError( + f"Unsloth: Quantization failed for {final_location}\n"\ + "You might have to compile llama.cpp yourself, then run this again.\n"\ + "You do not need to close this Python program. Run the following commands in a new terminal:\n"\ + "You must run this in the same folder as you're saving your model.\n"\ + "git clone --recursive https://github.com/ggerganov/llama.cpp\n"\ + "cd llama.cpp && make clean && make all -j\n"\ + "Once that's done, redo the quantization." + ) + pass + pass + print(f"Unsloth: Conversion completed! Output location: {final_location}") + + full_precision_location = final_location + + all_saved_locations = [full_precision_location,] + # Convert each type! + for quant_method in quantization_method: + if quant_method != first_conversion: + print(f"Unsloth: [2] Converting GGUF 16bit into {quant_method}. This might take 20 minutes...") + final_location = str((Path(model_directory) / f"unsloth.{quant_method.upper()}.gguf").absolute()) + + command = f"./{quantize_location} {full_precision_location} "\ + f"{final_location} {quant_method} {n_cpus}" + + try_execute([command,], force_complete = True) + + # Check if quantization succeeded! + if not os.path.isfile(final_location): + if IS_KAGGLE_ENVIRONMENT: + if not Path(final_location).resolve().is_relative_to(Path('/tmp').resolve()): + raise RuntimeError( + f"Unsloth: Quantization failed for {final_location}\n"\ + "You are in a Kaggle environment, which might be the reason this is failing.\n"\ + "Kaggle only provides 20GB of disk space in the working directory.\n"\ + "Merging to 16bit for 7b models use 16GB of space.\n"\ + "This means using `model.{save_pretrained/push_to_hub}_merged` works, but\n"\ + "`model.{save_pretrained/push_to_hub}_gguf will use too much disk space.\n"\ + "You can try saving it to the `/tmp` directory for larger disk space.\n"\ + "I suggest you to save the 16bit model first, then use manual llama.cpp conversion." + ) + else: + raise RuntimeError( + "Unsloth: Quantization failed! You might have to compile llama.cpp yourself, then run this again.\n"\ + "You do not need to close this Python program. Run the following commands in a new terminal:\n"\ + "You must run this in the same folder as you're saving your model.\n"\ + "git clone --recursive https://github.com/ggerganov/llama.cpp\n"\ + "cd llama.cpp && make clean && make all -j\n"\ + "Once that's done, redo the quantization." ) - all_saved_locations.append(quantized_file) - quants_created = True - except Exception as e: - if IS_KAGGLE_ENVIRONMENT: - raise RuntimeError( - f"Unsloth: Quantization failed for {output_location}\n" - "You are in a Kaggle environment, which might be the reason this is failing.\n" - "Kaggle only provides 20GB of disk space in the working directory.\n" - "Merging to 16bit for 7b models use 16GB of space.\n" - "This means using `model.{save_pretrained/push_to_hub}_merged` works, but\n" - "`model.{save_pretrained/push_to_hub}_gguf will use too much disk space.\n" - "You can try saving it to the `/tmp` directory for larger disk space.\n" - "I suggest you to save the 16bit model first, then use manual llama.cpp conversion.\n" - f"Error: {e}" - ) - else: - if IS_WINDOWS: - build_instructions = ( - f'cd "{LLAMA_CPP_DEFAULT_DIR}"\n' - f"cmake -S . -B build -DBUILD_SHARED_LIBS=OFF\n" - f"cmake --build build --config Release" - ) - else: - build_instructions = f'cd "{LLAMA_CPP_DEFAULT_DIR}" && make clean && make all -j' + pass + pass - raise RuntimeError( - f"Unsloth: Quantization failed for {output_location}\n" - "You might have to compile llama.cpp yourself, then run this again.\n" - "You do not need to close this Python program. Run the following commands in a new terminal:\n" - f'git clone --recursive https://github.com/ggerganov/llama.cpp "{LLAMA_CPP_DEFAULT_DIR}"\n' - f"{build_instructions}\n" - "Once that's done, redo the quantization.\n" - f"Error: {e}" - ) - print("Unsloth: Model files cleanup...") - if quants_created: - all_saved_locations.remove(base_gguf) - Path(base_gguf).unlink(missing_ok = True) + print(f"Unsloth: Conversion completed! Output location: {final_location}") + all_saved_locations.append(final_location) + pass + pass - # flip the list to get [text_model, mmproj] order. for text models stays the same. - all_saved_locations.reverse() - else: - print("Unsloth: GPT-OSS model - skipping additional quantizations") + # Finally check if first_conversion (f16, bf16 etc) was in the list of actual quant methods + full_precision_seen = first_conversion in frozenset(quantization_method) - if is_gpt_oss: - want_full_precision = True - else: - want_full_precision = first_conversion in frozenset(quantization_method) - - print(f"Unsloth: All GGUF conversions completed successfully!") - print(f"Generated files: {all_saved_locations}") - - return all_saved_locations, want_full_precision, is_vlm + return all_saved_locations, full_precision_seen +pass def unsloth_save_pretrained_merged( self, - save_directory: Union[str, os.PathLike], - tokenizer = None, - save_method: str = "merged_16bit", # ["lora", "merged_16bit", "merged_4bit"] - push_to_hub: bool = False, - token: Optional[Union[str, bool]] = None, - is_main_process: bool = True, - state_dict: Optional[dict] = None, - save_function: Callable = torch.save, - max_shard_size: Union[int, str] = "5GB", - safe_serialization: bool = True, - variant: Optional[str] = None, - save_peft_format: bool = True, - tags: List[str] = None, - temporary_location: str = "_unsloth_temporary_saved_buffers", - maximum_memory_usage: float = 0.75, - datasets: Optional[List[str]] = None, + save_directory : Union[str, os.PathLike], + tokenizer = None, + save_method : str = "merged_16bit", # ["lora", "merged_16bit", "merged_4bit"] + push_to_hub : bool = False, + token : Optional[Union[str, bool]] = None, + is_main_process : bool = True, + state_dict : Optional[dict] = None, + save_function : Callable = torch.save, + max_shard_size : Union[int, str] = "5GB", + safe_serialization : bool = True, + variant : Optional[str] = None, + save_peft_format : bool = True, + tags : List[str] = None, + temporary_location : str = "_unsloth_temporary_saved_buffers", + maximum_memory_usage : float = 0.75, ): """ - Same as .save_pretrained(...) except 4bit weights are auto - converted to float16 with as few overhead as possible. + Same as .save_pretrained(...) except 4bit weights are auto + converted to float16 with as few overhead as possible. - Choose for `save_method` to be either: - 1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp. - 2. `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference. - 3. `lora`: Save LoRA adapters with no merging. Useful for HF inference. + Choose for `save_method` to be either: + 1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp. + 2. `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference. + 3. `lora`: Save LoRA adapters with no merging. Useful for HF inference. """ if tokenizer is None: logger.warning_once( - "Unsloth: You're not saving a tokenizer as well?\n" + "Unsloth: You're not saving a tokenizer as well?\n"\ "You can do it separately via `tokenizer.save_pretrained(...)`" ) + pass arguments = dict(locals()) arguments["model"] = self @@ -1408,54 +1236,57 @@ def unsloth_save_pretrained_merged( unsloth_save_model(**arguments) for _ in range(3): gc.collect() +pass def unsloth_push_to_hub_merged( self, - repo_id: str, - tokenizer = None, - save_method: str = "merged_16bit", # ["lora", "merged_16bit", "merged_4bit"] - use_temp_dir: Optional[bool] = None, - commit_message: Optional[str] = "Trained with Unsloth", - private: Optional[bool] = None, - token: Union[bool, str, None] = None, - max_shard_size: Union[int, str, None] = "5GB", - create_pr: bool = False, - safe_serialization: bool = True, - revision: str = None, - commit_description: str = "Upload model trained with Unsloth 2x faster", - tags: Optional[List[str]] = None, - temporary_location: str = "_unsloth_temporary_saved_buffers", - maximum_memory_usage: float = 0.75, - datasets: Optional[List[str]] = None, + repo_id : str, + tokenizer = None, + save_method : str = "merged_16bit", # ["lora", "merged_16bit", "merged_4bit"] + use_temp_dir : Optional[bool] = None, + commit_message : Optional[str] = "Trained with Unsloth", + private : Optional[bool] = None, + token : Union[bool, str, None] = None, + max_shard_size : Union[int, str, None] = "5GB", + create_pr : bool = False, + safe_serialization : bool = True, + revision : str = None, + commit_description : str = "Upload model trained with Unsloth 2x faster", + tags : List[str] = None, + temporary_location : str = "_unsloth_temporary_saved_buffers", + maximum_memory_usage : float = 0.75, ): """ - Same as .push_to_hub(...) except 4bit weights are auto - converted to float16 with as few overhead as possible. + Same as .push_to_hub(...) except 4bit weights are auto + converted to float16 with as few overhead as possible. - Choose for `save_method` to be either: - 1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp. - 2. `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference. - 3. `lora`: Save LoRA adapters with no merging. Useful for HF inference. + Choose for `save_method` to be either: + 1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp. + 2. `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference. + 3. `lora`: Save LoRA adapters with no merging. Useful for HF inference. """ if tokenizer is None: logger.warning_once( - "Unsloth: You're not saving a tokenizer as well?\n" + "Unsloth: You're not saving a tokenizer as well?\n"\ "You can do it separately via `tokenizer.push_to_hub(...)`" ) + pass arguments = dict(locals()) - arguments["model"] = self + arguments["model"] = self arguments["save_directory"] = repo_id - arguments["push_to_hub"] = True + arguments["push_to_hub"] = True del arguments["self"] del arguments["repo_id"] unsloth_save_model(**arguments) for _ in range(3): gc.collect() +pass -MODEL_CARD = """--- +MODEL_CARD = \ +"""--- base_model: {base_model} tags: - text-generation-inference @@ -1474,7 +1305,7 @@ language: - **License:** apache-2.0 - **Finetuned from model :** {base_model} -This {model_type} model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) +This {model_type} model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. [](https://github.com/unslothai/unsloth) """ @@ -1485,19 +1316,19 @@ def _determine_username(save_directory, old_username, token): save_directory = save_directory.lstrip("./") if "/" not in save_directory: from huggingface_hub import whoami - - try: + try: username = whoami(token = token)["name"] if type(old_username) is str and username != old_username: username = old_username + pass save_directory = f"{username}/{save_directory}" except: - raise RuntimeError( - f"Unsloth: {save_directory} is not a Huggingface directory." - ) + raise RuntimeError(f"Unsloth: {save_directory} is not a Huggingface directory.") else: username = save_directory.split("/")[0] + pass return save_directory, username +pass def create_huggingface_repo( @@ -1505,52 +1336,38 @@ def create_huggingface_repo( save_directory, token = None, private = False, - datasets = None, ): - if token is None: + if token is None : token = get_token() - save_directory, username = _determine_username(save_directory, None, token) + pass + save_directory, username = _determine_username(save_directory, "", token) from huggingface_hub import create_repo - try: create_repo( - repo_id = save_directory, - token = token, + repo_id = save_directory, + token = token, repo_type = "model", - exist_ok = False, - private = private, - ) + exist_ok = False, + private = private, + ) # Create model card from huggingface_hub import ModelCard - content = MODEL_CARD.format( - username = username, + username = username, base_model = model.config._name_or_path, model_type = model.config.model_type, - method = "", - extra = "unsloth", + method = "", + extra = "unsloth", ) card = ModelCard(content) - if datasets: - card.data.datasets = datasets card.push_to_hub(save_directory, token = token) except: - # Repo already exists — update datasets metadata separately - if datasets: - try: - from huggingface_hub import metadata_update - - metadata_update( - save_directory, {"datasets": datasets}, overwrite = True, token = token - ) - except Exception as e: - logger.warning_once( - f"Unsloth: Could not update datasets metadata for {save_directory}: {e}" - ) + pass hf_api = HfApi(token = token) return save_directory, hf_api +pass def upload_to_huggingface( @@ -1563,901 +1380,539 @@ def upload_to_huggingface( old_username = None, private = None, create_config = True, - datasets = None, ): save_directory, username = _determine_username(save_directory, old_username, token) from huggingface_hub import create_repo - try: create_repo( - repo_id = save_directory, - token = token, + repo_id = save_directory, + token = token, repo_type = "model", - exist_ok = False, - private = private, - ) + exist_ok = False, + private = private, + ) # Create model card from huggingface_hub import ModelCard - content = MODEL_CARD.format( - username = username, + username = username, base_model = model.config._name_or_path, model_type = model.config.model_type, - method = "", - extra = extra, + method = "", + extra = extra, ) card = ModelCard(content) - if datasets: - card.data.datasets = datasets card.push_to_hub(save_directory, token = token) except: - # Repo already exists — update datasets metadata separately - if datasets: - try: - from huggingface_hub import metadata_update - - metadata_update( - save_directory, {"datasets": datasets}, overwrite = True, token = token - ) - except Exception as e: - logger.warning_once( - f"Unsloth: Could not update datasets metadata for {save_directory}: {e}" - ) + pass if file_location is not None: # Now upload file hf_api = HfApi(token = token) if "/" in file_location: - uploaded_location = file_location[file_location.rfind("/") + 1 :] + uploaded_location = file_location[file_location.rfind("/")+1:] else: uploaded_location = file_location + pass # find ftevent file from tensorboard and upload it import glob - ftevent_files = glob.glob("*out.tfevents*", recursive = True) if len(ftevent_files) > 0: - print( - "Unsloth: Uploading tensorboard files... Please wait...", - file_location + "*out.tfevents*", - ) + print("Unsloth: Uploading tensorboard files... Please wait...", file_location + "*out.tfevents*") for ftevent_file in ftevent_files: hf_api.upload_file( path_or_fileobj = ftevent_file, - path_in_repo = ftevent_file.replace(file_location, ""), - repo_id = save_directory, - repo_type = "model", - commit_message = "(Trained with Unsloth)", + path_in_repo = ftevent_file.replace(file_location, ""), + repo_id = save_directory, + repo_type = "model", + commit_message = "(Trained with Unsloth)", ) + pass + pass hf_api.upload_file( path_or_fileobj = file_location, - path_in_repo = uploaded_location, - repo_id = save_directory, - repo_type = "model", - commit_message = "(Trained with Unsloth)", + path_in_repo = uploaded_location, + repo_id = save_directory, + repo_type = "model", + commit_message = "(Trained with Unsloth)", ) # We also upload a config.json file if create_config: import json - - with open("_temporary_unsloth_config.json", "w", encoding = "utf-8") as file: - json.dump({"model_type": model.config.model_type}, file, indent = 4) + with open("_temporary_unsloth_config.json", "w") as file: + json.dump({"model_type" : model.config.model_type}, file, indent = 4) + pass hf_api.upload_file( path_or_fileobj = "_temporary_unsloth_config.json", - path_in_repo = "config.json", - repo_id = save_directory, - repo_type = "model", - commit_message = "(Trained with Unsloth)", + path_in_repo = "config.json", + repo_id = save_directory, + repo_type = "model", + commit_message = "(Trained with Unsloth)", ) os.remove("_temporary_unsloth_config.json") + pass + pass return username +pass def fix_tokenizer_bos_token(tokenizer): # Check if BOS added already, then warn fix_bos_token = False chat_template = getattr(tokenizer, "chat_template", None) + + if (tokenizer("A").input_ids[0] == getattr(tokenizer, "bos_token_id", None)): + if chat_template is not None and \ + ( + tokenizer.bos_token in chat_template or \ + "{bos_token}" in chat_template.replace(" ", "") or \ + "{bos_token+" in chat_template.replace(" ", "") + ): - if tokenizer("A").input_ids[0] == getattr(tokenizer, "bos_token_id", None): - if chat_template is not None and ( - tokenizer.bos_token in chat_template - or "{bos_token}" in chat_template.replace(" ", "") - or "{bos_token+" in chat_template.replace(" ", "") - ): fix_bos_token = True logger.warning( - "Unsloth: ##### The current model auto adds a BOS token.\n" + "Unsloth: ##### The current model auto adds a BOS token.\n"\ "Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily." ) # Remove {{bos_token}} - new_chat_template = re.sub( - r"\{[\s]{0,}\{[\s]{0,}bos\_token[\s]{0,}\}[\s]{0,}\}", "", chat_template - ) + new_chat_template = re.sub(r"\{[\s]{0,}\{[\s]{0,}bos\_token[\s]{0,}\}[\s]{0,}\}", "", chat_template) # Remove {{bos_token + - new_chat_template = re.sub( - r"\{[\s]{0,}\{[\s]{0,}bos\_token[\s]{0,}\+[\s]{0,}", - "", - new_chat_template, - ) - + new_chat_template = re.sub(r"\{[\s]{0,}\{[\s]{0,}bos\_token[\s]{0,}\+[\s]{0,}", "", new_chat_template) + tokenizer.chat_template = new_chat_template + pass + pass return fix_bos_token, chat_template +pass -def create_ollama_modelfile(tokenizer, base_model_name, model_location): +def create_ollama_modelfile(tokenizer, gguf_location): """ - Creates an Ollama Modelfile. - Use ollama.create(model = "new_ollama_model", modelfile = modelfile) + Creates an Ollama Modelfile. + Use ollama.create(model = "new_ollama_model", modelfile = modelfile) """ - ollama_template_name = MODEL_TO_OLLAMA_TEMPLATE_MAPPER.get(base_model_name) - if not ollama_template_name: - print( - f"Unsloth: No Ollama template mapping found for model '{base_model_name}'. Skipping Ollama Modelfile" - ) - return None - ollama_modelfile = OLLAMA_TEMPLATES.get(ollama_template_name) - if not ollama_modelfile: - print( - f"Unsloth: No Ollama template mapping found for model '{base_model_name}'. Skipping Ollama Modelfile" - ) - return None - tokenizer._ollama_modelfile = ( - ollama_modelfile # This comes from the unpacking above - ) - modelfile = ollama_modelfile + modelfile = getattr(tokenizer, "_ollama_modelfile", None) + if modelfile is None: return None FILE_LOCATION_REPLACER = "⚫@✅#🦥__FILE_LOCATION__⚡@🦥#⛵" - EOS_TOKEN_REPLACER = "⚫@✅#🦥__EOS_TOKEN__⚡@🦥#⛵" - LEFT_BRACKET_REPLACER = "⚫@✅#🦥" + EOS_TOKEN_REPLACER = "⚫@✅#🦥__EOS_TOKEN__⚡@🦥#⛵" + LEFT_BRACKET_REPLACER = "⚫@✅#🦥" RIGHT_BRACKET_REPLACER = "⚡@🦥#⛵" # Fixes https://github.com/unslothai/unsloth/issues/1087 # We must convert all {'s and }'s but keep {__FILE_LOCATION__} intact - modelfile = ( - modelfile.replace("{__FILE_LOCATION__}", FILE_LOCATION_REPLACER) - .replace("{__EOS_TOKEN__}", EOS_TOKEN_REPLACER) - .replace("{", LEFT_BRACKET_REPLACER) + modelfile = modelfile\ + .replace("{__FILE_LOCATION__}", FILE_LOCATION_REPLACER)\ + .replace("{__EOS_TOKEN__}", EOS_TOKEN_REPLACER)\ + .replace("{", LEFT_BRACKET_REPLACER)\ .replace("}", RIGHT_BRACKET_REPLACER) - ) # Revert {__FILE_LOCATION__} back - modelfile = modelfile.replace( - FILE_LOCATION_REPLACER, "{__FILE_LOCATION__}" - ).replace(EOS_TOKEN_REPLACER, "{__EOS_TOKEN__}") - + modelfile = modelfile\ + .replace(FILE_LOCATION_REPLACER, "{__FILE_LOCATION__}")\ + .replace(EOS_TOKEN_REPLACER, "{__EOS_TOKEN__}") + if "__EOS_TOKEN__" in modelfile: modelfile = modelfile.format( - __FILE_LOCATION__ = model_location, - __EOS_TOKEN__ = tokenizer.eos_token, + __FILE_LOCATION__ = gguf_location, + __EOS_TOKEN__ = tokenizer.eos_token, ) else: modelfile = modelfile.format( - __FILE_LOCATION__ = model_location, + __FILE_LOCATION__ = gguf_location, ) - - modelfile = modelfile.replace("⚫@✅#🦥", "{").replace("⚡@🦥#⛵", "}").rstrip() + pass + + modelfile = modelfile\ + .replace("⚫@✅#🦥", "{")\ + .replace("⚡@🦥#⛵", "}")\ + .rstrip() return modelfile - - -def create_ollama_model(username: str, model_name: str, tag: str, modelfile_path: str): - try: - init_check = subprocess.run( - ["curl", "http://localhost:11434"], - capture_output = True, - text = True, - timeout = 3, - ) - if init_check.returncode == 0: - print(init_check.stdout.strip()) - else: - print("Ollama Server is not Running") - except subprocess.TimeoutExpired: - return "Ollama Request Timeout" - - process = subprocess.Popen( - [ - "ollama", - "create", - f"{username}/{model_name}:{tag}", - "-f", - f"{modelfile_path}", - ], - stdout = subprocess.PIPE, - stderr = subprocess.STDOUT, - text = True, - bufsize = 1, - universal_newlines = True, - ) - - for line in iter(process.stdout.readline, ""): - print(line, end = "") - sys.stdout.flush() - - return_code = process.wait() - - if return_code != 0: - print(f"\nMODEL CREATED FAILED WITH RETURN CODE {return_code}") - else: - print("\nMODEL CREATED SUCCESSFULLY") - - -def push_to_ollama_hub(username: str, model_name: str, tag: str): - try: - init_check = subprocess.run( - ["curl", "http://localhost:11434"], - capture_output = True, - text = True, - timeout = 3, - ) - if init_check.returncode == 0: - print(init_check.stdout.strip()) - else: - print("Ollama Server is not Running") - except subprocess.TimeoutExpired: - return "Ollama Request Timeout" - - process = subprocess.Popen( - ["ollama", "push", f"{username}/{model_name}:{tag}"], - stdout = subprocess.PIPE, - stderr = subprocess.STDOUT, - text = True, - bufsize = 1, - universal_newlines = True, - ) - - for line in iter(process.stdout.readline, ""): - print(line, end = "") - sys.stdout.flush() - - return_code = process.wait() - - if return_code != 0: - print(f"\nMODEL PUBLISHED FAILED WITH RETURN CODE {return_code}") - else: - print("\nMODEL PUBLISHED SUCCESSFULLY") - - -def push_to_ollama(tokenizer, gguf_location, username: str, model_name: str, tag: str): - model_file = create_ollama_modelfile( - tokenizer = tokenizer, gguf_location = gguf_location - ) - - with open(f"Modelfile_{model_name}", "w", encoding = "utf-8") as f: - f.write(model_file) - f.close() - - create_ollama_model( - username = username, - model_name = model_name, - tag = tag, - modelfile_path = f"Modelfile_{model_name}", - ) - - push_to_ollama_hub(username = username, model_name = model_name, tag = tag) - - print("Successfully pushed to ollama") +pass def unsloth_save_pretrained_gguf( self, - save_directory: Union[str, os.PathLike], - tokenizer = None, - quantization_method = "fast_quantized", - first_conversion: str = None, - push_to_hub: bool = False, - token: Optional[Union[str, bool]] = None, - private: Optional[bool] = None, - is_main_process: bool = True, - state_dict: Optional[dict] = None, - save_function: Callable = torch.save, - max_shard_size: Union[int, str] = "5GB", - safe_serialization: bool = True, - variant: Optional[str] = None, - save_peft_format: bool = True, - tags: List[str] = None, - temporary_location: str = "_unsloth_temporary_saved_buffers", - maximum_memory_usage: float = 0.85, + save_directory : Union[str, os.PathLike], + tokenizer = None, + quantization_method : str = "fast_quantized", + first_conversion : str = None, + push_to_hub : bool = False, + token : Optional[Union[str, bool]] = None, + private : Optional[bool] = None, + is_main_process : bool = True, + state_dict : Optional[dict] = None, + save_function : Callable = torch.save, + max_shard_size : Union[int, str] = "5GB", + safe_serialization : bool = True, + variant : Optional[str] = None, + save_peft_format : bool = True, + tags : List[str] = None, + temporary_location : str = "_unsloth_temporary_saved_buffers", + maximum_memory_usage : float = 0.85, ): """ - Same as .save_pretrained(...) except 4bit weights are auto - converted to float16 then converted to GGUF / llama.cpp format. + Same as .save_pretrained(...) except 4bit weights are auto + converted to float16 then converted to GGUF / llama.cpp format. - Choose for `quantization_method` to be: - "not_quantized" : "Recommended. Fast conversion. Slow inference, big files.", - "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.", - "quantized" : "Recommended. Slow conversion. Fast inference, small files.", - "f32" : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.", - "f16" : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.", - "q8_0" : "Fast conversion. High resource use, but generally acceptable.", - "q4_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K", - "q5_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K", - "q2_k" : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.", - "q3_k_l" : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K", - "q3_k_m" : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K", - "q3_k_s" : "Uses Q3_K for all tensors", - "q4_0" : "Original quant method, 4-bit.", - "q4_1" : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.", - "q4_k_s" : "Uses Q4_K for all tensors", - "q4_k" : "alias for q4_k_m", - "q5_k" : "alias for q5_k_m", - "q5_0" : "Higher accuracy, higher resource usage and slower inference.", - "q5_1" : "Even higher accuracy, resource usage and slower inference.", - "q5_k_s" : "Uses Q5_K for all tensors", - "q6_k" : "Uses Q8_K for all tensors", - "iq2_xxs" : "2.06 bpw quantization", - "iq2_xs" : "2.31 bpw quantization", - "iq3_xxs" : "3.06 bpw quantization", - "q3_k_xs" : "3-bit extra small quantization", + Choose for `quantization_method` to be: + "not_quantized" : "Recommended. Fast conversion. Slow inference, big files.", + "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.", + "quantized" : "Recommended. Slow conversion. Fast inference, small files.", + "f32" : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.", + "f16" : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.", + "q8_0" : "Fast conversion. High resource use, but generally acceptable.", + "q4_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K", + "q5_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K", + "q2_k" : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.", + "q3_k_l" : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K", + "q3_k_m" : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K", + "q3_k_s" : "Uses Q3_K for all tensors", + "q4_0" : "Original quant method, 4-bit.", + "q4_1" : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.", + "q4_k_s" : "Uses Q4_K for all tensors", + "q4_k" : "alias for q4_k_m", + "q5_k" : "alias for q5_k_m", + "q5_0" : "Higher accuracy, higher resource usage and slower inference.", + "q5_1" : "Even higher accuracy, resource usage and slower inference.", + "q5_k_s" : "Uses Q5_K for all tensors", + "q6_k" : "Uses Q8_K for all tensors", + "iq2_xxs" : "2.06 bpw quantization", + "iq2_xs" : "2.31 bpw quantization", + "iq3_xxs" : "3.06 bpw quantization", + "q3_k_xs" : "3-bit extra small quantization", """ if tokenizer is None: raise ValueError("Unsloth: Saving to GGUF must have a tokenizer.") - try: - base_model_name = get_model_name(self.config._name_or_path, load_in_4bit = False) - model_name = base_model_name.split("/")[-1] - except: - base_model_name = self.config._name_or_path - model_name = base_model_name.split("/")[-1] - - # Check if push_to_hub is requested - if push_to_hub: - raise ValueError( - "Unsloth: Please use .push_to_hub_gguf() instead of .save_pretrained_gguf() with push_to_hub=True" - ) - - # Step 1: Check if this is a VLM (Vision-Language Model) and check if gpt-oss - is_vlm = False - if hasattr(self, "config") and hasattr(self.config, "architectures"): - is_vlm = any( - x.endswith(("ForConditionalGeneration", "ForVisionText2Text")) - for x in self.config.architectures - ) - is_vlm = is_vlm or hasattr(self.config, "vision_config") - - is_processor = is_vlm and isinstance(tokenizer, ProcessorMixin) - - is_gpt_oss = ( - True - if ( - hasattr(self.config, "architectures") - and self.config.architectures == "GptOssForCausalLM" - ) - or ( - hasattr(self.config, "model_type") - and self.config.model_type in ["gpt-oss", "gpt_oss"] - ) - else False - ) - # Step 2: Prepare arguments for model saving arguments = dict(locals()) - arguments["model"] = self - arguments["tokenizer"] = tokenizer - arguments["push_to_hub"] = False # We handle upload ourselves - # GPT-OSS needs mxfp4 save method - if is_gpt_oss: - if quantization_method is not None: - _qm = ( - quantization_method - if isinstance(quantization_method, (list, tuple)) - else [quantization_method] - ) - _ignored = [q for q in _qm if str(q).lower() != "mxfp4"] - if _ignored: - logger.warning_once( - f"Unsloth: GPT-OSS does not support GGUF quantization " - f"(requested: {', '.join(str(q) for q in _ignored)}). " - f"Overriding to MXFP4 format. " - f"Pass quantization_method=None to suppress this warning." - ) - arguments["save_method"] = "mxfp4" - else: - arguments["save_method"] = "merged_16bit" + arguments["model"] = self + arguments["tokenizer"] = tokenizer + arguments["push_to_hub"] = False # We save ourselves + arguments["save_method"] = "merged_16bit" # Must be 16bit del arguments["self"] del arguments["quantization_method"] del arguments["first_conversion"] - del arguments["is_vlm"] - del arguments["is_gpt_oss"] - del arguments["model_name"] - del arguments["base_model_name"] - del arguments["is_processor"] - # Step 3: Fix tokenizer BOS token if needed - if is_processor: - fix_bos_token, old_chat_template = fix_tokenizer_bos_token(tokenizer.tokenizer) + # Fix tokenizer adding an extra BOS token at the front + fix_bos_token, old_chat_template = fix_tokenizer_bos_token(tokenizer) + + # Non blocking install GGUF first + if not os.path.exists("llama.cpp"): + + if IS_KAGGLE_ENVIRONMENT: + # Kaggle is weird - no blocking installs, and no CUDA? + python_install = install_python_non_blocking(["gguf", "protobuf"]) + python_install.wait() + install_llama_cpp_blocking(use_cuda = False) + new_save_directory, old_username = unsloth_save_model(**arguments) + makefile = None + else: + git_clone = install_llama_cpp_clone_non_blocking() + python_install = install_python_non_blocking(["gguf", "protobuf"]) + git_clone.wait() + makefile = install_llama_cpp_make_non_blocking() + new_save_directory, old_username = unsloth_save_model(**arguments) + python_install.wait() + pass else: - fix_bos_token, old_chat_template = fix_tokenizer_bos_token(tokenizer) - - # Step 4: Save/merge model to 16-bit format - print( - f'Unsloth: Merging model weights to {"mxfp4" if is_gpt_oss else "16-bit"} format...' - ) - try: - # Call unsloth_generic_save directly (it's in the same file) - unsloth_generic_save(**arguments) - - except Exception as e: - raise RuntimeError(f"Failed to save/merge model: {e}") - - if is_processor: - tokenizer = tokenizer.tokenizer + try: + new_save_directory, old_username = unsloth_save_model(**arguments) + makefile = None + except: + # Retry by recloning llama.cpp + if IS_KAGGLE_ENVIRONMENT: + # Kaggle is weird - no blocking installs, and no CUDA? + python_install = install_python_non_blocking(["gguf", "protobuf"]) + python_install.wait() + install_llama_cpp_blocking(use_cuda = False) + new_save_directory, old_username = unsloth_save_model(**arguments) + makefile = None + else: + git_clone = install_llama_cpp_clone_non_blocking() + python_install = install_python_non_blocking(["gguf", "protobuf"]) + git_clone.wait() + makefile = install_llama_cpp_make_non_blocking() + new_save_directory, old_username = unsloth_save_model(**arguments) + python_install.wait() + pass + pass + pass # Use old chat template if the bos is removed if fix_bos_token: tokenizer.chat_template = old_chat_template + pass - # Step 6: Clean up memory for _ in range(3): - import gc - gc.collect() - if torch.cuda.is_available(): - torch.cuda.empty_cache() - # Step 7: Get model dtype and type - try: - model_dtype = dtype_from_config(self.config) - model_type = self.config.model_type - if type(model_dtype) is str: - assert model_dtype == "float16" or model_dtype == "bfloat16" - elif model_dtype == torch.float16: - model_dtype = "float16" - elif model_dtype == torch.bfloat16: - model_dtype = "bfloat16" - else: - raise TypeError("Unsloth: Model dtype can only be float16 or bfloat16") - except Exception as e: - # Fallback if dtype_from_config fails - print(f"Unsloth: Could not determine dtype ({e}), defaulting to float16") + model_dtype = self.config.torch_dtype + model_type = self.config.model_type + if type(model_dtype) is str: + assert(model_dtype == "float16" or model_dtype == "bfloat16") + elif model_dtype == torch.float16: model_dtype = "float16" + elif model_dtype == torch.bfloat16: + model_dtype = "bfloat16" + else: + raise TypeError("Unsloth: Model dtype can only be float16 or bfloat16") + pass - # Step 8: Convert to GGUF format - print("Unsloth: Converting to GGUF format...") + is_sentencepiece_model = check_if_sentencepiece_model(self) - # Convert quantization_method to list if string - # Use old style quantization_method - quantization_methods = [] - if quantization_method is not None: - # Convert quantization_method to list - if isinstance(quantization_method, list): - pass - elif isinstance(quantization_method, str): - quantization_method = [ - quantization_method, - ] - elif isinstance(quantization_method, tuple): - quantization_method = list(quantization_method) - else: - raise TypeError( - "Unsloth: quantization_method can only be a string or a list of strings" - ) - for i, quant_method in enumerate(quantization_method): - quant_method = quant_method.lower() - if quant_method == "not_quantized": - quant_method = "f16" - elif quant_method == "fast_quantized": - quant_method = "q8_0" - elif quant_method == "quantized": - quant_method = "q4_k_m" - elif quant_method is None: - quant_method = "q8_0" - quantization_methods.append(quant_method.lower()) + # Save to GGUF + all_file_locations, want_full_precision = save_to_gguf( + model_type, model_dtype, is_sentencepiece_model, + new_save_directory, quantization_method, first_conversion, makefile, + ) - try: - all_file_locations, want_full_precision, is_vlm_update = save_to_gguf( - model_name = model_name, - model_type = model_type, - model_dtype = model_dtype, - is_sentencepiece = False, - model_directory = save_directory, - quantization_method = quantization_methods, - first_conversion = first_conversion, - is_vlm = is_vlm, # Pass VLM flag - is_gpt_oss = is_gpt_oss, # Pass gpt_oss Flag - ) - except Exception as e: - if IS_KAGGLE_ENVIRONMENT: - raise RuntimeError( - f"Unsloth: GGUF conversion failed in Kaggle environment.\n" - f"This is likely due to the 20GB disk space limit.\n" - f"Try saving to /tmp directory or use a smaller model.\n" - f"Error: {e}" - ) - else: - raise RuntimeError(f"Unsloth: GGUF conversion failed: {e}") - - # Step 9: Create Ollama modelfile - gguf_directory = f"{save_directory}_gguf" + # Save Ollama modelfile + modelfile = create_ollama_modelfile(tokenizer, all_file_locations[0]) modelfile_location = None - ollama_success = False - if all_file_locations: - try: - if is_vlm_update: - modelfile = create_ollama_modelfile(tokenizer, base_model_name, ".") - else: - modelfile = create_ollama_modelfile( - tokenizer, - base_model_name, - os.path.basename(all_file_locations[0]), - ) - if modelfile is not None: - modelfile_location = os.path.join(gguf_directory, "Modelfile") - with open(modelfile_location, "w", encoding = "utf-8") as file: - file.write(modelfile) - ollama_success = True - except Exception as e: - print(f"Warning: Could not create Ollama modelfile: {e}") + if modelfile is not None: + modelfile_location = os.path.join(new_save_directory, "Modelfile") + with open(modelfile_location, "w") as file: + file.write(modelfile) + pass + print(f"Unsloth: Saved Ollama Modelfile to {modelfile_location}") + pass - # Step 10: Show BOS token warning if applicable if fix_bos_token: logger.warning( - "Unsloth: ##### The current model auto adds a BOS token.\n" + "Unsloth: ##### The current model auto adds a BOS token.\n"\ "Unsloth: ##### We removed it in GGUF's chat template for you." ) + pass - _exe = ".exe" if IS_WINDOWS else "" - if IS_WINDOWS: - _bin_dir = os.path.join(LLAMA_CPP_DEFAULT_DIR, "build", "bin", "Release") - else: - _bin_dir = LLAMA_CPP_DEFAULT_DIR + if push_to_hub: + print("Unsloth: Uploading GGUF to Huggingface Hub...") - if is_vlm_update: - print("\n") - print( - f"Unsloth: example usage for Multimodal LLMs: {os.path.join(_bin_dir, 'llama-mtmd-cli' + _exe)} -m {all_file_locations[0]} --mmproj {all_file_locations[-1]}" - ) - print("Unsloth: load image inside llama.cpp runner: /image test_image.jpg") - print("Unsloth: Prompt model to describe the image") - else: - print( - f'Unsloth: example usage for text only LLMs: {os.path.join(_bin_dir, "llama-cli" + _exe)} --model {all_file_locations[0]} -p "why is the sky blue?"' - ) + # If not needing full precision, skip the first + if not want_full_precision: all_file_locations = all_file_locations[1:] - if ollama_success: - print(f"Unsloth: Saved Ollama Modelfile to {modelfile_location}") - print( - f"Unsloth: convert model to ollama format by running - ollama create model_name -f {modelfile_location}" - ) + for file_location in all_file_locations: + username = upload_to_huggingface( + self, save_directory, token, + "GGUF converted", "gguf", file_location, old_username, private, + ) + link = f"{username}/{new_save_directory.lstrip('/.')}" \ + if username not in new_save_directory else \ + new_save_directory.lstrip('/.') + print(f"Saved GGUF to https://huggingface.co/{link}") + pass - # Return a dict with all needed info for push_to_hub - return { - "save_directory": save_directory, - "gguf_directory": gguf_directory, - "gguf_files": all_file_locations, - "modelfile_location": modelfile_location, - "want_full_precision": want_full_precision, - "is_vlm": is_vlm_update, - "fix_bos_token": fix_bos_token, - } + # Save modelfile + if modelfile_location is not None: + username = upload_to_huggingface( + self, save_directory, token, + "GGUF converted", "gguf", modelfile_location, old_username, private, + ) + print(f"Saved Ollama Modelfile to https://huggingface.co/{link}") + pass + pass +pass def unsloth_push_to_hub_gguf( self, - repo_id: str, - tokenizer = None, - quantization_method = "fast_quantized", - first_conversion: str = None, - use_temp_dir: Optional[bool] = None, - commit_message: Optional[str] = "Trained with Unsloth", - private: Optional[bool] = None, - token: Union[bool, str, None] = None, - max_shard_size: Union[int, str, None] = "5GB", - create_pr: bool = False, - safe_serialization: bool = True, - revision: str = None, - commit_description: str = "Upload model trained with Unsloth 2x faster", - tags: Optional[List[str]] = None, - temporary_location: str = "_unsloth_temporary_saved_buffers", - maximum_memory_usage: float = 0.85, - datasets: Optional[List[str]] = None, + repo_id : str, + tokenizer = None, + quantization_method : str = "fast_quantized", + first_conversion : str = None, + use_temp_dir : Optional[bool] = None, + commit_message : Optional[str] = "Trained with Unsloth", + private : Optional[bool] = None, + token : Union[bool, str, None] = None, + max_shard_size : Union[int, str, None] = "5GB", + create_pr : bool = False, + safe_serialization : bool = True, + revision : str = None, + commit_description : str = "Upload model trained with Unsloth 2x faster", + tags : Optional[List[str]] = None, + temporary_location : str = "_unsloth_temporary_saved_buffers", + maximum_memory_usage : float = 0.85, ): """ - Same as .push_to_hub(...) except 4bit weights are auto - converted to float16 then converted to GGUF / llama.cpp format. + Same as .push_to_hub(...) except 4bit weights are auto + converted to float16 then converted to GGUF / llama.cpp format. - Choose for `quantization_method` to be: - "not_quantized" : "Recommended. Fast conversion. Slow inference, big files.", - "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.", - "quantized" : "Recommended. Slow conversion. Fast inference, small files.", - "f32" : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.", - "f16" : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.", - "q8_0" : "Fast conversion. High resource use, but generally acceptable.", - "q4_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K", - "q5_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K", - "q2_k" : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.", - "q3_k_l" : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K", - "q3_k_m" : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K", - "q3_k_s" : "Uses Q3_K for all tensors", - "q4_0" : "Original quant method, 4-bit.", - "q4_1" : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.", - "q4_k_s" : "Uses Q4_K for all tensors", - "q5_0" : "Higher accuracy, higher resource usage and slower inference.", - "q5_1" : "Even higher accuracy, resource usage and slower inference.", - "q5_k_s" : "Uses Q5_K for all tensors", - "q6_k" : "Uses Q8_K for all tensors", + Choose for `quantization_method` to be: + "not_quantized" : "Recommended. Fast conversion. Slow inference, big files.", + "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.", + "quantized" : "Recommended. Slow conversion. Fast inference, small files.", + "f32" : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.", + "f16" : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.", + "q8_0" : "Fast conversion. High resource use, but generally acceptable.", + "q4_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K", + "q5_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K", + "q2_k" : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.", + "q3_k_l" : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K", + "q3_k_m" : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K", + "q3_k_s" : "Uses Q3_K for all tensors", + "q4_0" : "Original quant method, 4-bit.", + "q4_1" : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.", + "q4_k_s" : "Uses Q4_K for all tensors", + "q4_k" : "alias for q4_k_m", + "q5_k" : "alias for q5_k_m", + "q5_0" : "Higher accuracy, higher resource usage and slower inference.", + "q5_1" : "Even higher accuracy, resource usage and slower inference.", + "q5_k_s" : "Uses Q5_K for all tensors", + "q6_k" : "Uses Q8_K for all tensors", """ if tokenizer is None: raise ValueError("Unsloth: Saving to GGUF must have a tokenizer.") - # Step 1: Determine save directory - model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id + arguments = dict(locals()) + arguments["model"] = self + arguments["tokenizer"] = tokenizer + arguments["save_directory"] = repo_id + arguments["push_to_hub"] = False # We save ourselves + arguments["save_method"] = "merged_16bit" # Must be 16bit + del arguments["self"] + del arguments["repo_id"] + del arguments["quantization_method"] + del arguments["first_conversion"] - if use_temp_dir or use_temp_dir is None: - import tempfile + # Fix tokenizer adding an extra BOS token at the front + fix_bos_token, old_chat_template = fix_tokenizer_bos_token(tokenizer) - temp_dir = tempfile.mkdtemp(prefix = "unsloth_gguf_") - save_directory = temp_dir - cleanup_temp = True - else: - save_directory = model_name # Use model name, not repo_id - cleanup_temp = False + # Non blocking install GGUF first + if not os.path.exists("llama.cpp"): - # Step 2: Call save_pretrained_gguf to do the conversion - print(f"Unsloth: Converting model to GGUF format...") - - try: - # Call save_pretrained_gguf - it returns all the info we need - result = unsloth_save_pretrained_gguf( - self = self, - save_directory = save_directory, - tokenizer = tokenizer, - quantization_method = quantization_method, - first_conversion = first_conversion, - push_to_hub = False, # Never push from here - token = None, # Don't need token for local save - max_shard_size = max_shard_size, - safe_serialization = safe_serialization, - temporary_location = temporary_location, - maximum_memory_usage = maximum_memory_usage, - ) - - # Extract results - all_file_locations = result["gguf_files"] - modelfile_location = result["modelfile_location"] - want_full_precision = result["want_full_precision"] - is_vlm = result["is_vlm"] - fix_bos_token = result["fix_bos_token"] - actual_save_directory = result["save_directory"] - - except Exception as e: - if cleanup_temp: - import shutil - - for d in [save_directory, f"{save_directory}_gguf"]: - try: - shutil.rmtree(d) - except: - pass - raise RuntimeError(f"Failed to convert model to GGUF: {e}") - - # Step 3: Upload to HuggingFace Hub - print("Unsloth: Uploading GGUF to Huggingface Hub...") - - try: - from huggingface_hub import HfApi - - api = HfApi(token = token) - - # Get full repo id - if "/" not in repo_id: - username = api.whoami()["name"] - full_repo_id = f"{username}/{repo_id}" + if IS_KAGGLE_ENVIRONMENT: + # Kaggle is weird - no blocking installs, and no CUDA? + python_install = install_python_non_blocking(["gguf", "protobuf"]) + python_install.wait() + install_llama_cpp_blocking(use_cuda = False) + new_save_directory, old_username = unsloth_save_model(**arguments) + makefile = None else: - full_repo_id = repo_id - - # Create repo - api.create_repo( - repo_id = full_repo_id, - repo_type = "model", - private = private, - exist_ok = True, - ) - - # Upload GGUF files - for file_location in all_file_locations: - original_name = os.path.basename(file_location) - # Replace temp directory name with proper model name - if cleanup_temp and "unsloth_gguf_" in original_name: - # Extract the quantization part (e.g., ".Q8_0.gguf" or ".Q8_0-mmproj.gguf") - quant_suffix = ( - original_name.split(".", 1)[1] - if "." in original_name - else original_name - ) - proper_name = f"{model_name}.{quant_suffix}" - else: - proper_name = original_name.replace( - os.path.basename(save_directory), model_name - ) - - print(f"Uploading {proper_name}...") - - api.upload_file( - path_or_fileobj = file_location, - path_in_repo = proper_name, - repo_id = full_repo_id, - repo_type = "model", - commit_message = commit_message, - commit_description = commit_description, - create_pr = create_pr, - revision = revision, - ) - - # Upload config.json if exists - config_path = os.path.join(actual_save_directory, "config.json") - if os.path.exists(config_path): - print("Uploading config.json...") - api.upload_file( - path_or_fileobj = config_path, - path_in_repo = "config.json", - repo_id = full_repo_id, - repo_type = "model", - commit_message = f"{commit_message} - config", - create_pr = create_pr, - revision = revision, - ) - - # Upload Modelfile if exists - if modelfile_location and os.path.exists(modelfile_location): - print("Uploading Ollama Modelfile...") - api.upload_file( - path_or_fileobj = modelfile_location, - path_in_repo = "Modelfile", - repo_id = full_repo_id, - repo_type = "model", - commit_message = f"{commit_message} - Ollama Modelfile", - create_pr = create_pr, - revision = revision, - ) - - # Create and upload README - readme_content = f"""--- -tags: -- gguf -- llama.cpp -- unsloth -{"- vision-language-model" if is_vlm else ""} ---- - -# {repo_id.split("/")[-1]} : GGUF - -This model was finetuned and converted to GGUF format using [Unsloth](https://github.com/unslothai/unsloth). - -**Example usage**: -- For text only LLMs: `llama-cli -hf {repo_id} --jinja` -- For multimodal models: `llama-mtmd-cli -hf {repo_id} --jinja` - -## Available Model files: -""" - for file in all_file_locations: - # Fix filename in README too - original_name = os.path.basename(file) - if cleanup_temp and "unsloth_gguf_" in original_name: - quant_suffix = ( - original_name.split(".", 1)[1] - if "." in original_name - else original_name - ) - proper_name = f"{model_name}.{quant_suffix}" - else: - proper_name = original_name.replace( - os.path.basename(save_directory), model_name - ) - readme_content += f"- `{proper_name}`\n" - - # Special note for VLM with Modelfile - if is_vlm and modelfile_location: - readme_content += "\n## ⚠️ Ollama Note for Vision Models\n" - readme_content += "**Important:** Ollama currently does not support separate mmproj files for vision models.\n\n" - readme_content += "To create an Ollama model from this vision model:\n" - readme_content += "1. Place the `Modelfile` in the same directory as the finetuned bf16 merged model\n" - readme_content += "3. Run: `ollama create model_name -f ./Modelfile`\n" - readme_content += " (Replace `model_name` with your desired name)\n\n" - readme_content += ( - "This will create a unified bf16 model that Ollama can use.\n" - ) - elif modelfile_location: - readme_content += "\n## Ollama\n" - readme_content += "An Ollama Modelfile is included for easy deployment.\n" - - if fix_bos_token: - readme_content += "\n## Note\n" - readme_content += ( - "The model's BOS token behavior was adjusted for GGUF compatibility.\n" - ) - - readme_content += ( - "This was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth)\n" - '[](https://github.com/unslothai/unsloth)\n' - ) - - readme_path = os.path.join(actual_save_directory, "README.md") - with open(readme_path, "w") as f: - f.write(readme_content) - - api.upload_file( - path_or_fileobj = readme_path, - path_in_repo = "README.md", - repo_id = full_repo_id, - repo_type = "model", - commit_message = "Add README", - create_pr = create_pr, - revision = revision, - ) - - print( - f"Unsloth: Successfully uploaded GGUF to https://huggingface.co/{full_repo_id}" - ) - - # Add tags - if tags is None: - tags = [] - tags.extend(["gguf", "llama-cpp", "unsloth"]) - if is_vlm: - tags.append("vision-language-model") - + git_clone = install_llama_cpp_clone_non_blocking() + python_install = install_python_non_blocking(["gguf", "protobuf"]) + git_clone.wait() + makefile = install_llama_cpp_make_non_blocking() + new_save_directory, old_username = unsloth_save_model(**arguments) + python_install.wait() + pass + else: try: - api.add_tags( - repo_id = full_repo_id, - tags = tags, - repo_type = "model", - ) + new_save_directory, old_username = unsloth_save_model(**arguments) + makefile = None except: + # Retry by recloning llama.cpp + if IS_KAGGLE_ENVIRONMENT: + # Kaggle is weird - no blocking installs, and no CUDA? + python_install = install_python_non_blocking(["gguf", "protobuf"]) + python_install.wait() + install_llama_cpp_blocking(use_cuda = False) + new_save_directory, old_username = unsloth_save_model(**arguments) + makefile = None + else: + git_clone = install_llama_cpp_clone_non_blocking() + python_install = install_python_non_blocking(["gguf", "protobuf"]) + git_clone.wait() + makefile = install_llama_cpp_make_non_blocking() + new_save_directory, old_username = unsloth_save_model(**arguments) + python_install.wait() pass + pass + pass - if datasets: - try: - from huggingface_hub import metadata_update + # Use old chat template if the bos is removed + if fix_bos_token: + tokenizer.chat_template = old_chat_template + pass - metadata_update( - full_repo_id, {"datasets": datasets}, overwrite = True, token = token - ) - except Exception as e: - logger.warning_once( - f"Unsloth: Could not update datasets metadata for {full_repo_id}: {e}" - ) + for _ in range(3): + gc.collect() - except Exception as e: - raise RuntimeError(f"Failed to upload to Hugging Face Hub: {e}") + model_dtype = self.config.torch_dtype + model_type = self.config.model_type + if type(model_dtype) is str: + assert(model_dtype == "float16" or model_dtype == "bfloat16") + elif model_dtype == torch.float16: + model_dtype = "float16" + elif model_dtype == torch.bfloat16: + model_dtype = "bfloat16" + else: + raise TypeError("Unsloth: Model dtype can only be float16 or bfloat16") + pass - finally: - # Clean up temporary directory - if cleanup_temp: - print("Unsloth: Cleaning up temporary files...") - import shutil + is_sentencepiece_model = check_if_sentencepiece_model(self) - for d in [save_directory, f"{save_directory}_gguf"]: - if os.path.exists(d): - try: - shutil.rmtree(d) - except: - pass + # Save to GGUF + all_file_locations, want_full_precision = save_to_gguf( + model_type, model_dtype, is_sentencepiece_model, + new_save_directory, quantization_method, first_conversion, makefile, + ) - return full_repo_id + # Save Ollama modelfile + modelfile = create_ollama_modelfile(tokenizer, all_file_locations[0]) + modelfile_location = None + if modelfile is not None: + modelfile_location = os.path.join(new_save_directory, "Modelfile") + with open(modelfile_location, "w") as file: + file.write(modelfile) + pass + print(f"Unsloth: Saved Ollama Modelfile to {modelfile_location}") + pass + # If not needing full precision, skip the first + if not want_full_precision: all_file_locations = all_file_locations[1:] + + for file_location in all_file_locations: + print("Unsloth: Uploading GGUF to Huggingface Hub...") + username = upload_to_huggingface( + self, repo_id, token, + "GGUF converted", "gguf", file_location, old_username, private, + ) + link = f"{username}/{new_save_directory.lstrip('/.')}" \ + if username not in new_save_directory else \ + new_save_directory.lstrip('/.') + + print(f"Saved GGUF to https://huggingface.co/{link}") + pass + + # Save modelfile + if modelfile_location is not None: + username = upload_to_huggingface( + self, repo_id, token, + "GGUF converted", "gguf", modelfile_location, old_username, private, + ) + print(f"Saved Ollama Modelfile to https://huggingface.co/{link}") + pass + + if fix_bos_token: + logger.warning( + "Unsloth: ##### The current model auto adds a BOS token.\n"\ + "Unsloth: ##### We removed it in GGUF's chat template for you." + ) + pass +pass # Corrected function to save LoRA to a custom directory def save_lora_to_custom_dir(model, tokenizer, save_directory): # Create the custom directory if it doesn't exist - os.makedirs(save_directory, exist_ok = True) + os.makedirs(save_directory, exist_ok=True) # Call the unsloth_save_model function with the custom directory unsloth_save_model( model, tokenizer, - save_directory = save_directory, - save_method = "lora", - push_to_hub = False, + save_directory=save_directory, + save_method="lora", + push_to_hub=False, ) - # Corrected method within the model class to convert LoRA to GGML and push to Hugging Face Hub def unsloth_convert_lora_to_ggml_and_push_to_hub( self, @@ -2477,7 +1932,7 @@ def unsloth_convert_lora_to_ggml_and_push_to_hub( if IS_KAGGLE_ENVIRONMENT: python_install = install_python_non_blocking(["protobuf"]) python_install.wait() - install_llama_cpp_blocking(use_cuda = False) + install_llama_cpp_blocking(use_cuda=False) makefile = None else: git_clone = install_llama_cpp_clone_non_blocking() @@ -2497,26 +1952,17 @@ def unsloth_convert_lora_to_ggml_and_push_to_hub( model_type = self.config.model_type output_file = os.path.join(lora_directory_push, "ggml-adapter-model.bin") - print( - f"Unsloth: Converting auto-saved LoRA adapters at {lora_directory_push} to GGML format." - ) + print(f"Unsloth: Converting auto-saved LoRA adapters at {lora_directory_push} to GGML format.") print(f"The output file will be {output_file}") command = f"python3 llama.cpp/convert-lora-to-ggml.py {lora_directory_push} {output_file} llama" try: - with subprocess.Popen( - command, - shell = True, - stdout = subprocess.PIPE, - stderr = subprocess.PIPE, - bufsize = 1, - universal_newlines = True, - ) as sp: + with subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=1, universal_newlines=True) as sp: for line in sp.stdout: - print(line, end = "", flush = True) + print(line, end="", flush=True) for line in sp.stderr: - print(line, end = "", flush = True) + print(line, end="", flush=True) sp.wait() if sp.returncode != 0: raise subprocess.CalledProcessError(sp.returncode, command) @@ -2528,27 +1974,18 @@ def unsloth_convert_lora_to_ggml_and_push_to_hub( print("Unsloth: Uploading GGML file to Hugging Face Hub...") username = upload_to_huggingface( - self, - repo_id, - token, - "GGML converted LoRA", - "ggml", - output_file, - None, - private, + self, repo_id, token, + "GGML converted LoRA", "ggml", output_file, None, private, ) link = f"{repo_id.lstrip('/')}" print("Unsloth: Done.") print(f"Converted LoRA to GGML and uploaded to https://huggingface.co/{link}") - print( - "\nThis GGML making function was made by Maheswar. Ping him @Maheswar on the Unsloth Discord or on HuggingFace (@mahiatlinux) if you like this!" - ) - + print("\nThis GGML making function was made by Maheswar. Ping him @Maheswar on the Unsloth Discord or on HuggingFace (@mahiatlinux) if you like this!") def unsloth_convert_lora_to_ggml_and_save_locally( self, - save_directory: str, # Added parameter for the folder name - tokenizer, + save_directory: str, # Added parameter for the folder name + tokenizer, temporary_location: str = "_unsloth_temporary_saved_buffers", maximum_memory_usage: float = 0.85, ): @@ -2556,7 +1993,7 @@ def unsloth_convert_lora_to_ggml_and_save_locally( if IS_KAGGLE_ENVIRONMENT: python_install = install_python_non_blocking(["protobuf"]) python_install.wait() - install_llama_cpp_blocking(use_cuda = False) + install_llama_cpp_blocking(use_cuda=False) makefile = None else: git_clone = install_llama_cpp_clone_non_blocking() @@ -2576,26 +2013,17 @@ def unsloth_convert_lora_to_ggml_and_save_locally( model_type = self.config.model_type output_file = os.path.join(save_directory, "ggml-adapter-model.bin") - print( - f"Unsloth: Converting auto-saved LoRA adapters at {save_directory} to GGML format." - ) + print(f"Unsloth: Converting auto-saved LoRA adapters at {save_directory} to GGML format.") print(f"The output file will be {output_file}") command = f"python3 llama.cpp/convert-lora-to-ggml.py {save_directory} {output_file} llama" try: - with subprocess.Popen( - command, - shell = True, - stdout = subprocess.PIPE, - stderr = subprocess.PIPE, - bufsize = 1, - universal_newlines = True, - ) as sp: + with subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=1, universal_newlines=True) as sp: for line in sp.stdout: - print(line, end = "", flush = True) + print(line, end="", flush=True) for line in sp.stderr: - print(line, end = "", flush = True) + print(line, end="", flush=True) sp.wait() if sp.returncode != 0: raise subprocess.CalledProcessError(sp.returncode, command) @@ -2604,211 +2032,92 @@ def unsloth_convert_lora_to_ggml_and_save_locally( return print("Unsloth: Done.") print(f"Unsloth: Conversion completed! Output file: {output_file}") - print( - "\nThis GGML making function was made by Maheswar. Ping him @Maheswar on the Unsloth Discord or on HuggingFace (@mahiatlinux) if you like this!" - ) + print("\nThis GGML making function was made by Maheswar. Ping him @Maheswar on the Unsloth Discord or on HuggingFace (@mahiatlinux) if you like this!") +pass from .models.loader_utils import get_model_name -from unsloth_zoo.saving_utils import ( - merge_and_overwrite_lora, - prepare_saving, -) -from unsloth_zoo.llama_cpp import ( - install_llama_cpp, - convert_to_gguf as _convert_to_gguf, -) - - -@torch.inference_mode -def save_to_gguf_generic( - model, - save_directory, - tokenizer, - quantization_method = None, - quantization_type = "Q8_0", - repo_id = None, - token = None, -): - if token is None and repo_id is not None: - token = get_token() - if repo_id is not None and token is None: - raise RuntimeError("Unsloth: Please specify a token for uploading!") - - if not os.path.exists(os.path.join("llama.cpp", "unsloth_convert_hf_to_gguf.py")): - install_llama_cpp(just_clone_repo = True) - - # Use old style quantization_method - new_quantization_methods = [] - if quantization_method is not None: - # Convert quantization_method to list - if isinstance(quantization_method, list): - pass - elif isinstance(quantization_method, str): - quantization_method = [ - quantization_method, - ] - elif isinstance(quantization_method, tuple): - quantization_method = list(quantization_method) - else: - raise TypeError( - "Unsloth: quantization_method can only be a string or a list of strings" - ) - for i, quant_method in enumerate(quantization_method): - quant_method = quant_method.lower() - if quant_method == "not_quantized": - quant_method = "f16" - elif quant_method == "fast_quantized": - quant_method = "q8_0" - elif quant_method == "quantized": - quant_method = "q4_k_m" - elif quant_method is None: - quant_method = "q8_0" - new_quantization_methods.append(quant_method.lower()) - else: - new_quantization_methods.append(quantization_type.lower()) - # Check if wrong method - for quant_method in new_quantization_methods: - if quant_method not in ALLOWED_QUANTS.keys(): - error = f"Unsloth: Quant method = [{quant_method}] not supported. Choose from below:\n" - for key, value in ALLOWED_QUANTS.items(): - error += f"[{key}] => {value}\n" - raise RuntimeError(error) - - # Go through all types and save individually - somewhat inefficient - # since we save F16 / BF16 multiple times - for quantization_type in new_quantization_methods: - metadata = _convert_to_gguf( - save_directory, - print_output = True, - quantization_type = quantization_type, - ) - if repo_id is not None: - prepare_saving( - model, - repo_id, - push_to_hub = True, - max_shard_size = "50GB", - private = True, - token = token, - ) - - from huggingface_hub import HfApi - - api = HfApi(token = token) - api.upload_folder( - folder_path = save_directory, - repo_id = repo_id, - repo_type = "model", - allow_patterns = ["*.gguf"], - ) - return metadata - +from unsloth_zoo.saving_utils import merge_and_overwrite_lora @torch.inference_mode def unsloth_generic_save( model, tokenizer, - save_directory: Union[str, os.PathLike] = "unsloth_finetuned_merge", - save_method: str = "lora", # ["lora", "merged_16bit", "merged_4bit"] - push_to_hub: bool = False, - token: Optional[Union[str, bool]] = None, - is_main_process: bool = True, - state_dict: Optional[dict] = None, - save_function: Callable = torch.save, - max_shard_size: Union[int, str] = "5GB", - safe_serialization: bool = True, - variant: Optional[str] = None, - save_peft_format: bool = True, + save_directory : Union[str, os.PathLike] = "unsloth_finetuned_merge", + save_method : str = "lora", # ["lora", "merged_16bit", "merged_4bit"] + push_to_hub : bool = False, + token : Optional[Union[str, bool]] = None, + is_main_process : bool = True, + state_dict : Optional[dict] = None, + save_function : Callable = torch.save, + max_shard_size : Union[int, str] = "5GB", + safe_serialization : bool = True, + variant : Optional[str] = None, + save_peft_format : bool = True, + # Push to hub - use_temp_dir: Optional[bool] = None, - commit_message: Optional[str] = "Trained with Unsloth", - private: Optional[bool] = None, - create_pr: bool = False, - revision: str = None, - commit_description: str = "Upload model trained with Unsloth 2x faster", - tags: List[str] = None, + use_temp_dir : Optional[bool] = None, + commit_message : Optional[str] = "Trained with Unsloth", + private : Optional[bool] = None, + create_pr : bool = False, + revision : str = None, + commit_description : str = "Upload model trained with Unsloth 2x faster", + tags : List[str] = None, + # Our functions - temporary_location: str = "_unsloth_temporary_saved_buffers", - maximum_memory_usage: float = 0.9, - datasets: Optional[List[str]] = None, + temporary_location : str = "_unsloth_temporary_saved_buffers", + maximum_memory_usage : float = 0.9, ): - if token is None and push_to_hub: - token = get_token() - - if save_method == "merged_4bit": - raise RuntimeError( - "Unsloth: Merging into 4bit will cause your model to lose accuracy if you plan\n" - "to merge to GGUF or others later on. I suggest you to do this as a final step\n" - "if you're planning to do multiple saves.\n" - "If you are certain, change `save_method` to `merged_4bit_forced`." - ) - elif save_method == "merged_4bit_forced": - save_method = "merged_4bit" - + if token is None and push_to_hub: token = get_token() merge_and_overwrite_lora( get_model_name, - model = model, - tokenizer = tokenizer, - save_directory = save_directory, - push_to_hub = push_to_hub, - private = private, - token = token, - save_method = save_method, - output_dtype = None, - low_disk_space_usage = True, - use_temp_file = False, + model = model, + tokenizer = tokenizer, + save_directory = save_directory, + push_to_hub = push_to_hub, + private = private, + token = token, + output_dtype = None, + low_disk_space_usage = False, + use_temp_file = False, ) - - if push_to_hub and datasets: - try: - from huggingface_hub import metadata_update - - save_dir, _ = _determine_username(save_directory, None, token) - metadata_update( - save_dir, {"datasets": datasets}, overwrite = True, token = token - ) - except Exception as e: - logger.warning_once( - f"Unsloth: Could not update datasets metadata for {save_directory}: {e}" - ) - return +pass def unsloth_generic_save_pretrained_merged( self, - save_directory: Union[str, os.PathLike], - tokenizer = None, - save_method: str = "merged_16bit", # ["lora", "merged_16bit", "merged_4bit"] - push_to_hub: bool = False, - token: Optional[Union[str, bool]] = None, - is_main_process: bool = True, - state_dict: Optional[dict] = None, - save_function: Callable = torch.save, - max_shard_size: Union[int, str] = "5GB", - safe_serialization: bool = True, - variant: Optional[str] = None, - save_peft_format: bool = True, - tags: List[str] = None, - temporary_location: str = "_unsloth_temporary_saved_buffers", - maximum_memory_usage: float = 0.75, - datasets: Optional[List[str]] = None, -): + save_directory : Union[str, os.PathLike], + tokenizer = None, + save_method : str = "merged_16bit", # ["lora", "merged_16bit", "merged_4bit"] + push_to_hub : bool = False, + token : Optional[Union[str, bool]] = None, + is_main_process : bool = True, + state_dict : Optional[dict] = None, + save_function : Callable = torch.save, + max_shard_size : Union[int, str] = "5GB", + safe_serialization : bool = True, + variant : Optional[str] = None, + save_peft_format : bool = True, + tags : List[str] = None, + temporary_location : str = "_unsloth_temporary_saved_buffers", + maximum_memory_usage : float = 0.75, +): """ - Same as .push_to_hub(...) except 4bit weights are auto - converted to float16 with as few overhead as possible. + Same as .push_to_hub(...) except 4bit weights are auto + converted to float16 with as few overhead as possible. - Choose for `save_method` to be either: - 1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp. - 2. `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference. - 3. `lora`: Save LoRA adapters with no merging. Useful for HF inference. + Choose for `save_method` to be either: + 1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp. + 2. `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference. + 3. `lora`: Save LoRA adapters with no merging. Useful for HF inference. """ if tokenizer is None: logger.warning_once( - "Unsloth: You're not saving a tokenizer as well?\n" + "Unsloth: You're not saving a tokenizer as well?\n"\ "You can do it separately via `tokenizer.save_pretrained(...)`" ) + pass arguments = dict(locals()) arguments["model"] = self @@ -2816,266 +2125,58 @@ def unsloth_generic_save_pretrained_merged( unsloth_generic_save(**arguments) for _ in range(3): gc.collect() +pass def unsloth_generic_push_to_hub_merged( self, - repo_id: str, - tokenizer = None, - save_method: str = "merged_16bit", # ["lora", "merged_16bit", "merged_4bit"] - use_temp_dir: Optional[bool] = None, - commit_message: Optional[str] = "Trained with Unsloth", - private: Optional[bool] = None, - token: Union[bool, str, None] = None, - max_shard_size: Union[int, str, None] = "5GB", - create_pr: bool = False, - safe_serialization: bool = True, - revision: str = None, - commit_description: str = "Upload model trained with Unsloth 2x faster", - tags: Optional[List[str]] = None, - temporary_location: str = "_unsloth_temporary_saved_buffers", - maximum_memory_usage: float = 0.75, - datasets: Optional[List[str]] = None, + repo_id : str, + tokenizer = None, + save_method : str = "merged_16bit", # ["lora", "merged_16bit", "merged_4bit"] + use_temp_dir : Optional[bool] = None, + commit_message : Optional[str] = "Trained with Unsloth", + private : Optional[bool] = None, + token : Union[bool, str, None] = None, + max_shard_size : Union[int, str, None] = "5GB", + create_pr : bool = False, + safe_serialization : bool = True, + revision : str = None, + commit_description : str = "Upload model trained with Unsloth 2x faster", + tags : Optional[List[str]] = None, + temporary_location : str = "_unsloth_temporary_saved_buffers", + maximum_memory_usage : float = 0.75, ): """ - Same as .push_to_hub(...) except 4bit weights are auto - converted to float16 with as few overhead as possible. + Same as .push_to_hub(...) except 4bit weights are auto + converted to float16 with as few overhead as possible. - Choose for `save_method` to be either: - 1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp. - 2. `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference. - 3. `lora`: Save LoRA adapters with no merging. Useful for HF inference. + Choose for `save_method` to be either: + 1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp. + 2. `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference. + 3. `lora`: Save LoRA adapters with no merging. Useful for HF inference. """ if tokenizer is None: logger.warning_once( - "Unsloth: You're not saving a tokenizer as well?\n" + "Unsloth: You're not saving a tokenizer as well?\n"\ "You can do it separately via `tokenizer.push_to_hub(...)`" ) + pass arguments = dict(locals()) - arguments["model"] = self + arguments["model"] = self arguments["save_directory"] = repo_id - arguments["push_to_hub"] = True + arguments["push_to_hub"] = True del arguments["self"] del arguments["repo_id"] unsloth_generic_save(**arguments) for _ in range(3): gc.collect() - - -def _unsloth_save_torchao_with_attached_config( - model, - save_directory: Union[str, os.PathLike], - tokenizer, - push_to_hub: bool = False, - token: Optional[Union[str, bool]] = None, -): - """Save a QAT-trained model by converting fake-quantized weights to real quantized weights.""" - # Convert QAT fake-quantized weights to real quantized weights - _convert_torchao_model(model) - # PEFT models also might come here, so parse it - if isinstance(model, PeftModelForCausalLM): - _unsloth_save_torchao_with_given_config( - model = model, - save_directory = save_directory, - tokenizer = tokenizer, - torchao_config = model.config.quantization_config, - push_to_hub = push_to_hub, - token = token, - ) - return - - # TorchAO does not support safe_serialization reliably - safe_serialization = False - - if push_to_hub: - model.push_to_hub( - save_directory, safe_serialization = safe_serialization, token = token - ) - tokenizer.push_to_hub(save_directory, token = token) - else: - model.save_pretrained(save_directory, safe_serialization = safe_serialization) - tokenizer.save_pretrained(save_directory) - - -def _unsloth_save_torchao_with_given_config( - model, - save_directory: Union[str, os.PathLike], - tokenizer, - torchao_config, - push_to_hub: bool = False, - token: Optional[Union[str, bool]] = None, -): - """Quantizes the model with torchao and saves a torchao quantized checkpoint - - Args - `save_directory`: local folder path or huggingface hub ID when `push_to_hub` is set to True, e.g. `my_model` - `torchao_config` (TorchAOBaseConfig): configuration for torchao quantization, full list: https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize - `push_to_hub` (bool): whether to push the checkpoint to huggingface hub or save locally - """ - - if push_to_hub: - assert token is not None, "Unsloth: Please specify a token for uploading!" - - assert ( - torchao_config is not None - ), "Unsloth: Please specify a torchao_config for post-training quantization!" - - # first merge the lora weights - arguments = dict(locals()) - arguments["push_to_hub"] = False # We save ourselves - arguments["save_method"] = "merged_16bit" # Must be 16bit - del arguments["torchao_config"] - - if not isinstance(model, PeftModelForCausalLM) and not isinstance(model, PeftModel): - model.save_pretrained(save_directory) - tokenizer.save_pretrained(save_directory) - else: - unsloth_generic_save(**arguments) - - for _ in range(3): - gc.collect() - - from transformers import ( - AutoModelForCausalLM, - AutoTokenizer, - TorchAoConfig, - AutoModelForImageTextToText, - AutoProcessor, - ) - from torchao import quantize_ - - if isinstance(torchao_config, TorchAoConfig): - quantization_config = torchao_config - else: - quantization_config = TorchAoConfig(quant_type = torchao_config) - - # Determine if this is a VLM - is_vlm = False - if hasattr(model, "config") and hasattr(model.config, "architectures"): - is_vlm = any( - x.endswith(("ForConditionalGeneration", "ForVisionText2Text")) - for x in model.config.architectures - ) - is_vlm = is_vlm or hasattr(model.config, "vision_config") - auto_model = AutoModelForImageTextToText if is_vlm else AutoModelForCausalLM - auto_processor = AutoProcessor if is_vlm else AutoTokenizer - - tokenizer = auto_processor.from_pretrained(save_directory) - - # TorchAO must only use bfloat16 for loading (float16 fails) - if HAS_TORCH_DTYPE: - kwargs = {"torch_dtype": torch.bfloat16} - else: - kwargs = {"dtype": torch.bfloat16} - - # Reload with quantization applied - quantized_model = auto_model.from_pretrained( - save_directory, - device_map = "auto", - quantization_config = quantization_config, - **kwargs, - ) - - torchao_save_directory = save_directory + "-torchao" - - # TorchAO does not support safe_serialization right now 0.14.0 seems broken! - safe_serialization = Version(importlib_version("torchao")) > Version("0.14.0") - safe_serialization = False - - if push_to_hub: - quantized_model.push_to_hub( - torchao_save_directory, safe_serialization = safe_serialization, token = token - ) - tokenizer.push_to_hub(torchao_save_directory, token = token) - else: - quantized_model.save_pretrained( - torchao_save_directory, safe_serialization = safe_serialization - ) - tokenizer.save_pretrained(torchao_save_directory) - - # Clean up the intermediate unquantized model - if os.path.exists(save_directory): - try: - shutil.rmtree(save_directory) - except: - pass - - -def unsloth_save_pretrained_torchao( - self, - save_directory: Union[str, os.PathLike], - tokenizer = None, - torchao_config = None, - push_to_hub: bool = False, - token: Optional[Union[str, bool]] = None, -): - """Saves a torchao quantized model checkpoint. - - This function handles two mutually exclusive workflows: - - 1. **QAT (Quantization-Aware Training)**: If the model was trained with `qat_scheme` - parameter, do NOT pass `torchao_config`. The function will convert the QAT - fake-quantized weights to real quantized weights and save directly. - - 2. **PTQ (Post-Training Quantization)**: If you want to apply quantization to a - regular model, pass a `torchao_config`. The model must NOT have been trained - with `qat_scheme`. - - Args: - `save_directory`: local folder path or huggingface hub ID when `push_to_hub` is True - `tokenizer`: the tokenizer to save alongside the model - `torchao_config` (TorchAOBaseConfig): configuration for torchao quantization. - Required for PTQ, must be None for QAT models. - Options: https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize - `push_to_hub` (bool): whether to push to huggingface hub or save locally - `token`: HuggingFace token for pushing to hub - """ - if token is None and push_to_hub: - token = get_token() - - has_qat_config = ( - hasattr(self, "_torchao_config") and self._torchao_config is not None - ) - - if torchao_config is not None: - # PTQ path: user provided a config, model must NOT have QAT config unless PEFT - assert not has_qat_config, ( - "Unsloth: You passed `torchao_config` but this model was trained with `qat_scheme`. " - "For QAT models, do not pass `torchao_config` - the quantization config is already " - "attached to the model from training." - ) - _unsloth_save_torchao_with_given_config( - model = self, - save_directory = save_directory, - tokenizer = tokenizer, - torchao_config = torchao_config, - push_to_hub = push_to_hub, - token = token, - ) - else: - # QAT path: no config provided, model must have QAT config - assert has_qat_config, ( - "Unsloth: No `torchao_config` provided and model was not trained with `qat_scheme`. " - "Either train with `qat_scheme` parameter, or provide a `torchao_config` for " - "post-training quantization." - ) - _unsloth_save_torchao_with_attached_config( - model = self, - save_directory = save_directory, - tokenizer = tokenizer, - push_to_hub = push_to_hub, - token = token, - ) - - for _ in range(3): - gc.collect() +pass def not_implemented_save(*args, **kwargs): - raise NotImplementedError( - "Unsloth: Sorry GGUF is currently not supported for vision models!" - ) + raise NotImplementedError("Unsloth: Sorry GGUF is currently not supported for vision models!") +pass def patch_saving_functions(model, vision = False): @@ -3088,6 +2189,7 @@ def patch_saving_functions(model, vision = False): original_push_to_hub = model.original_push_to_hub else: original_push_to_hub = model.push_to_hub + pass signature = str(inspect.signature(original_push_to_hub)).replace("NoneType", "None") signature = signature[1:] @@ -3152,63 +2254,60 @@ def patch_saving_functions(model, vision = False): original_model = model while True: - # Check if push_to_hub exists before accessing its __name__ - if ( - hasattr(original_model, "push_to_hub") - and original_model.push_to_hub.__name__ != "unsloth_push_to_hub" - ): - original_model.original_push_to_hub = original_model.push_to_hub - original_model.push_to_hub = types.MethodType( - unsloth_push_to_hub, original_model - ) - if hasattr(original_model, "add_model_tags"): - original_model.add_model_tags( - [ - "unsloth", - ] - ) - if hasattr(original_model, "model"): - original_model = original_model.model - else: - break + if original_model.push_to_hub.__name__ != "unsloth_push_to_hub": + original_model.original_push_to_hub = original_model.push_to_hub + original_model.push_to_hub = types.MethodType(unsloth_push_to_hub, original_model) + if hasattr(original_model, "add_model_tags"): + original_model.add_model_tags(["unsloth",]) + pass + pass + + if hasattr(original_model, "model"): original_model = original_model.model + else: break + pass # Add saving methods to top level model if not vision: if hasattr(model, "config"): # Counteract tokenizers - model.push_to_hub_merged = types.MethodType( - unsloth_generic_push_to_hub_merged, model - ) - model.save_pretrained_merged = types.MethodType( - unsloth_generic_save_pretrained_merged, model - ) - model.push_to_hub_gguf = types.MethodType(unsloth_push_to_hub_gguf, model) - model.save_pretrained_gguf = types.MethodType( - unsloth_save_pretrained_gguf, model - ) - model.save_pretrained_torchao = types.MethodType( - unsloth_save_pretrained_torchao, model - ) - model.push_to_hub_ggml = types.MethodType( - unsloth_convert_lora_to_ggml_and_push_to_hub, model - ) - model.save_pretrained_ggml = types.MethodType( - unsloth_convert_lora_to_ggml_and_save_locally, model - ) + model.push_to_hub_merged = types.MethodType(unsloth_push_to_hub_merged, model) + model.save_pretrained_merged = types.MethodType(unsloth_save_pretrained_merged, model) + model.push_to_hub_gguf = types.MethodType(unsloth_push_to_hub_gguf, model) + model.save_pretrained_gguf = types.MethodType(unsloth_save_pretrained_gguf, model) + model.push_to_hub_ggml = types.MethodType(unsloth_convert_lora_to_ggml_and_push_to_hub, model) + model.save_pretrained_ggml = types.MethodType(unsloth_convert_lora_to_ggml_and_save_locally, model) + pass else: # Vision only 1 option - model.push_to_hub_merged = types.MethodType( - unsloth_generic_push_to_hub_merged, model - ) - model.save_pretrained_merged = types.MethodType( - unsloth_generic_save_pretrained_merged, model - ) - model.push_to_hub_gguf = types.MethodType(unsloth_push_to_hub_gguf, model) - model.save_pretrained_gguf = types.MethodType( - unsloth_save_pretrained_gguf, model - ) - model.save_pretrained_torchao = types.MethodType( - unsloth_save_pretrained_torchao, model - ) + model.push_to_hub_merged = types.MethodType(unsloth_generic_push_to_hub_merged, model) + model.save_pretrained_merged = types.MethodType(unsloth_generic_save_pretrained_merged, model) + model.push_to_hub_gguf = types.MethodType(not_implemented_save, model) + model.save_pretrained_gguf = types.MethodType(not_implemented_save, model) + pass return model +pass + +def export_model_to_local(model, tokenizer, save_directory, drive_directory): + """ + Export a fine-tuned model from Colab to your local machine. + + Args: + model: The fine-tuned model to be exported. + tokenizer: The tokenizer associated with the model. + save_directory: The directory where the model will be saved in Colab. + drive_directory: The directory in Google Drive where the model will be saved. + """ + # Save the model in Colab + model.save_pretrained(save_directory) + tokenizer.save_pretrained(save_directory) + + # Mount Google Drive + from google.colab import drive + drive.mount('/content/drive') + + # Copy the model files to Google Drive + import shutil + shutil.copytree(save_directory, drive_directory) + + print(f"Model saved to {drive_directory} in Google Drive. You can now download it to your local machine.")