diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index abbaedd5b..eb60a5a20 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -3,30 +3,27 @@
 Thank you for not only using Unsloth but also for being interested in helping out! We value all contributions, whether they come in the form of code, ideas, support for others or just by simply spreading the word of Unsloth! 💕
 
 - **[Support the Community](https://github.com/unslothai/unsloth/issues)**: Answer questions, review pull requests, or assist others in discussions.
-- **Fix Bugs**: Identify and resolve issues with the existing codebase.  
-- **Submit Ideas**: Request new features or share enhancements you'd like to see.  
+- **Fix Bugs**: Identify and resolve issues with the existing codebase.
+- **Submit Ideas**: Request new features or share enhancements you'd like to see.
 - **Develop Features**: Implement new functionality or improve existing tools which can be done via PRs.
 - **[Improve Documentation](https://docs.unsloth.ai/)**: Help by creating guides, FAQs, or enhancing clarity.
 
 One of the best ways to support us is by spreading the word about Unsloth! Share how it’s powering your amazing projects in blog posts or social media, and inspire others to explore its potential. Even a simple star on our repo goes a long way in showing your support and helping the community grow. 🌟
 
-## Submitting Issues  
-If you find a bug or have a feature idea, we’d love to hear from you! Here’s how to make your submission stand out:  
+## Submitting Issues
+If you find a bug or have a feature idea, we’d love to hear from you! Here’s how to make your submission stand out:
 
-### Reporting Bugs  
-1. **Search First**: Check if the issue has already been reported using GitHub’s search bar under Issues.  
-2. **Details Matter**: Is this on Google Colab, Kaggle, or on another platform service? Are you using Unsloth's official notebook? Include your OS, Python version, and other relevant details. For bugs, a concise code snippet that reproduces the issue is incredibly helpful.  
+### Reporting Bugs
+1. **Search First**: Check if the issue has already been reported using GitHub’s search bar under Issues.
+2. **Details Matter**: Is this on Google Colab, Kaggle, or on another platform service? Are you using Unsloth's official notebook? Include your OS, Python version, and other relevant details. For bugs, a concise code snippet that reproduces the issue is incredibly helpful.
 3. **Be Thorough**: Attach screenshots, traceback logs, or any additional information that might speed up resolution.
 
 ## Spread the Word
-Your support extends beyond code:  
-- Spread the word by writing about Unsloth in blogs or social media.  
-- Share how Unsloth powers your projects.  
-- Star our repository to show your appreciation.  
+Your support extends beyond code:
+- Spread the word by writing about Unsloth in blogs or social media.
+- Share how Unsloth powers your projects.
+- Star our repository to show your appreciation.
 
-## Note
-We have added a new section in the `README.md` under "✨ Finetune for Free" titled "Exporting Models from Colab to Local Machine" with detailed steps. Please refer to it for guidance on exporting models from Colab to your local machine.
-
-Finally, please be mindful of our [Code of Conduct](https://github.com/unslothai/unsloth/tree/main/unsloth/CODE_OF_CONDUCT.md) to ensure a welcoming and inclusive environment for everyone.  
+Finally, please be mindful of our [Code of Conduct](https://github.com/unslothai/unsloth/blob/main/CODE_OF_CONDUCT.md) to ensure a welcoming and inclusive environment for everyone.
 
 Thank you so much for reading and we hope you have lots of fun using Unsloth! 🦥
diff --git a/README.md b/README.md
index 83eb45ad1..1314cb1c5 100644
--- a/README.md
+++ b/README.md
@@ -1,152 +1,191 @@
 <div align="center">
 
-  <a href="https://unsloth.ai"><picture>
+  <a href="https://unsloth.ai/docs"><picture>
     <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20logo%20white%20text.png">
     <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20logo%20black%20text.png">
     <img alt="unsloth logo" src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20logo%20black%20text.png" height="110" style="max-width: 100%;">
   </picture></a>
   
-<a href="https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb"><img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/start free finetune button.png" height="48"></a>
-<a href="https://discord.com/invite/unsloth"><img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/Discord button.png" height="48"></a>
-<a href="https://docs.unsloth.ai"><img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/Documentation%20Button.png" height="48"></a>
+<a href="https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-Fine-tuning.ipynb"><img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/start free finetune button.png" width="154"></a>
+<a href="https://discord.com/invite/unsloth"><img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/Discord button.png" width="165"></a>
+<a href="https://unsloth.ai/docs"><img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/Documentation%20Button.png" width="137"></a>
 
-### Finetune Llama 3.3, Mistral, Phi-4, Qwen 2.5 & Gemma 2x faster with 80% less memory!
+### Train gpt-oss, DeepSeek, Gemma, Qwen & Llama 2x faster with 70% less VRAM!
 
 ![](https://i.ibb.co/sJ7RhGG/image-41.png)
 
 </div>
 
-## ✨ Finetune for Free
+## ✨ Train for Free
 
-All notebooks are **beginner friendly**! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, Ollama, vLLM or uploaded to Hugging Face.
+Notebooks are beginner friendly. Read our [guide](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide). Add dataset, run, then deploy your trained model.
 
-| Unsloth supports | Free Notebooks | Performance | Memory use |
+| Model | Free Notebooks | Performance | Memory use |
 |-----------|---------|--------|----------|
-| **Llama 3.2 (3B)**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)               | 2x faster | 70% less |
-| **GRPO (reasoning)**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)               | 2x faster | 80% less |
-| **Phi-4 (14B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_4-Conversational.ipynb)               | 2x faster | 70% less |
-| **Llama 3.2 Vision (11B)**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)               | 2x faster | 50% less |
-| **Llama 3.1 (8B)**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb)               | 2x faster | 70% less |
-| **Gemma 2 (9B)**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma2_(9B)-Alpaca.ipynb)               | 2x faster | 70% less |
-| **Qwen 2.5 (7B)**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(7B)-Alpaca.ipynb)               | 2x faster | 70% less |
-| **Mistral v0.3 (7B)**    | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-Conversational.ipynb)               | 2.2x faster | 75% less |
-| **Ollama**     | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)               | 1.9x faster | 60% less |
-| **DPO Zephyr**     | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Zephyr_(7B)-DPO.ipynb)               | 1.9x faster | 50% less |
+| **Qwen3.5 (4B)**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_5_(4B)_Vision.ipynb)               | 1.5x faster | 60% less |
+| **gpt-oss (20B)**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-Fine-tuning.ipynb)               | 2x faster | 70% less |
+| **gpt-oss (20B): GRPO**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb)               | 2x faster | 80% less |
+| **Qwen3: Advanced GRPO**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb)               | 2x faster | 50% less |
+| **Gemma 3 (4B) Vision** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_(4B)-Vision.ipynb)               | 1.7x faster | 60% less |
+| **embeddinggemma (300M)**    | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/EmbeddingGemma_(300M).ipynb)               | 2x faster | 20% less |
+| **Mistral Ministral 3 (3B)**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Ministral_3_VL_(3B)_Vision.ipynb)               | 1.5x faster | 60% less |
+| **Llama 3.1 (8B) Alpaca**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb)               | 2x faster | 70% less |
+| **Llama 3.2 Conversational**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)               | 2x faster | 70% less |
+| **Orpheus-TTS (3B)**     | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Orpheus_(3B)-TTS.ipynb)               | 1.5x faster | 50% less |
 
-- See [all our notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks) and [all our models](https://docs.unsloth.ai/get-started/all-our-models)
-- **Kaggle Notebooks** for [Llama 3.2 Kaggle notebook](https://www.kaggle.com/danielhanchen/kaggle-llama-3-2-1b-3b-unsloth-notebook), [Llama 3.1 (8B)](https://www.kaggle.com/danielhanchen/kaggle-llama-3-1-8b-unsloth-notebook), [Gemma 2 (9B)](https://www.kaggle.com/code/danielhanchen/kaggle-gemma-7b-unsloth-notebook/), [Mistral (7B)](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
-- Run notebooks for [Llama 3.2 conversational](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb), [Llama 3.1 conversational](https://colab.research.google.com/drive/15OyFkGoCImV9dSsewU1wa2JuKB4-mDE_?usp=sharing) and [Mistral v0.3 ChatML](https://colab.research.google.com/drive/15F1xyn8497_dUbxZP4zWmPZ3PJx1Oymv?usp=sharing)
-- This [continued pretraining notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-CPT.ipynb) is for learning another language
-- Click [here](https://docs.unsloth.ai/) for detailed documentation for Unsloth.
+- See all our notebooks for: [Kaggle](https://github.com/unslothai/notebooks?tab=readme-ov-file#-kaggle-notebooks), [GRPO](https://unsloth.ai/docs/get-started/unsloth-notebooks#grpo-reasoning-rl-notebooks), [TTS](https://unsloth.ai/docs/get-started/unsloth-notebooks#text-to-speech-tts-notebooks), [embedding](https://unsloth.ai/docs/new/embedding-finetuning) & [Vision](https://unsloth.ai/docs/get-started/unsloth-notebooks#vision-multimodal-notebooks)
+- See [all our models](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [all our notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks)
+- See detailed documentation for Unsloth [here](https://unsloth.ai/docs)
 
-## Exporting Models from Colab to Local Machine
-
-If you have fine-tuned a model in Colab and want to use it locally on your machine, follow these steps:
-
-1. **Save the Model in Colab**: Ensure you have saved the model in a format that can be easily downloaded. You can use the `unsloth_save_model` function to save the model in the desired format.
-
-2. **Connect to Google Drive**: Mount your Google Drive in Colab to save the model files there. This allows you to download the files to your local machine later.
-
-```python
-from google.colab import drive
-drive.mount('/content/drive')
+## ⚡ Quickstart
+### Linux or WSL
+```bash
+pip install unsloth
 ```
+### Windows
+For Windows, `pip install unsloth` works only if you have Pytorch installed. Read our [Windows Guide](https://unsloth.ai/docs/get-started/install/windows-installation).
 
-3. **Save Model to Google Drive**: Save the model files to a directory in your Google Drive.
+### Docker
+Use our official [Unsloth Docker image](https://hub.docker.com/r/unsloth/unsloth) ```unsloth/unsloth``` container. Read our [Docker Guide](https://unsloth.ai/docs/get-started/install/docker).
 
-```python
-model.save_pretrained('/content/drive/MyDrive/your_model_directory')
-tokenizer.save_pretrained('/content/drive/MyDrive/your_model_directory')
-```
+### AMD, Intel, Blackwell & DGX Spark
+For RTX 50x, B200, 6000 GPUs: `pip install unsloth`. Read our guides for: [Blackwell](https://unsloth.ai/docs/blog/fine-tuning-llms-with-blackwell-rtx-50-series-and-unsloth) and [DGX Spark](https://unsloth.ai/docs/blog/fine-tuning-llms-with-nvidia-dgx-spark-and-unsloth). <br>
+To install Unsloth on **AMD** and **Intel** GPUs, follow our [AMD Guide](https://unsloth.ai/docs/get-started/install/amd) and [Intel Guide](https://unsloth.ai/docs/get-started/install/intel).
 
-4. **Download Model Files**: After saving the model files to Google Drive, you can download them to your local machine. Go to your Google Drive, locate the model directory, and download the files.
+## 🦥 Unsloth News
+- **Qwen3.5** - 0.8B, 2B, 4B, 9B, 27B, 35-A3B, 112B-A10B are now supported. [Guide + notebooks](https://unsloth.ai/docs/models/qwen3.5/fine-tune)
+- Train **MoE LLMs 12x faster** with 35% less VRAM - DeepSeek, GLM, Qwen and gpt-oss. [Blog](https://unsloth.ai/docs/new/faster-moe)
+- **Embedding models**: Unsloth now supports ~1.8-3.3x faster embedding fine-tuning. [Blog](https://unsloth.ai/docs/new/embedding-finetuning) • [Notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks#embedding-models)
+- New **7x longer context RL** vs. all other setups, via our new batching algorithms. [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+- New RoPE & MLP **Triton Kernels** & **Padding Free + Packing**: 3x faster training & 30% less VRAM. [Blog](https://unsloth.ai/docs/new/3x-faster-training-packing)
+- **500K Context**: Training a 20B model with >500K context is now possible on an 80GB GPU. [Blog](https://unsloth.ai/docs/blog/500k-context-length-fine-tuning)
+- **FP8 & Vision RL**: You can now do FP8 & VLM GRPO on consumer GPUs. [FP8 Blog](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/vision-reinforcement-learning-vlm-rl)
+- **Docker**: Use Unsloth with no setup & environment issues with our new image. [Guide](https://unsloth.ai/docs/blog/how-to-fine-tune-llms-with-unsloth-and-docker) • [Docker image](https://hub.docker.com/r/unsloth/unsloth)
+- **gpt-oss** by OpenAI: Read our [RL blog](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune/gpt-oss-reinforcement-learning), [Flex Attention](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune/long-context-gpt-oss-training) blog and [Guide](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune).
 
-5. **Load Model Locally**: Once you have downloaded the model files to your local machine, you can load the model using the `from_pretrained` method.
-
-```python
-from transformers import AutoModel, AutoTokenizer
-
-model = AutoModel.from_pretrained('path_to_your_model_directory')
-tokenizer = AutoTokenizer.from_pretrained('path_to_your_model_directory')
-```
-
-By following these steps, you can easily export a fine-tuned model from Colab and use it locally on your machine.
-
-## 🦥 Unsloth.ai News
-- 📣 NEW! Introducing [Reasoning](https://unsloth.ai/blog/r1-reasoning) in Unsloth. You can now reproduce DeepSeek-R1's "aha" moment with just 7GB VRAM. Transform Llama, Phi, Mistral etc. into reasoning LLMs!
-- 📣 NEW! [DeepSeek-R1](https://unsloth.ai/blog/deepseek-r1) - the most powerful open reasoning models with Llama & Qwen distillations. Run or fine-tune them now! More details: [unsloth.ai/blog/deepseek-r1](https://unsloth.ai/blog/deepseek-r1). All model uploads: [here](https://huggingface.co/collections/unsloth/deepseek-r1-all-versions-678e1c48f5d2fce87892ace5).
-- 📣 NEW! [Phi-4](https://unsloth.ai/blog/phi4) by Microsoft is now supported. We also [fixed bugs](https://unsloth.ai/blog/phi4) in Phi-4 and [uploaded GGUFs, 4-bit](https://huggingface.co/collections/unsloth/phi-4-all-versions-677eecf93784e61afe762afa). Try the [Phi-4 Colab notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_4-Conversational.ipynb)
-- 📣 NEW! [Llama 3.3 (70B)](https://huggingface.co/collections/unsloth/llama-33-all-versions-67535d7d994794b9d7cf5e9f), Meta's latest model is supported.
-- 📣 NEW! We worked with Apple to add [Cut Cross Entropy](https://arxiv.org/abs/2411.09009). Unsloth now supports 89K context for Meta's Llama 3.3 (70B) on a 80GB GPU - 13x longer than HF+FA2. For Llama 3.1 (8B), Unsloth enables 342K context, surpassing its native 128K support.
-- 📣 Introducing Unsloth [Dynamic 4-bit Quantization](https://unsloth.ai/blog/dynamic-4bit)! We dynamically opt not to quantize certain parameters and this greatly increases accuracy while only using <10% more VRAM than BnB 4-bit. See our collection on [Hugging Face here.](https://huggingface.co/collections/unsloth/unsloth-4-bit-dynamic-quants-67503bb873f89e15276c44e7)
-- 📣 [Vision models](https://unsloth.ai/blog/vision) now supported! [Llama 3.2 Vision (11B)](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb), [Qwen 2.5 VL (7B)](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb) and [Pixtral (12B) 2409](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Pixtral_(12B)-Vision.ipynb)
 <details>
   <summary>Click for more news</summary>
-  
+
+- **Quantization-Aware Training**: We collabed with Pytorch, recovering ~70% accuracy. [Read blog](https://unsloth.ai/docs/blog/quantization-aware-training-qat)
+- **Memory-efficient RL**: We're introducing even better RL. Our new kernels & algos allows faster RL with 50% less VRAM & 10× more context. [Read blog](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/memory-efficient-rl)
+- **Mistral 3**: Run Ministral 3 or Devstral 2 and fine-tune with vision/RL sudoku notebooks. [Guide](https://unsloth.ai/docs/models/tutorials/ministral-3) • [Notebooks](https://unsloth.ai/docs/models/ministral-3#fine-tuning-ministral-3)
+- **Gemma 3n** by Google: [Read Blog](https://unsloth.ai/docs/models/gemma-3-how-to-run-and-fine-tune/gemma-3n-how-to-run-and-fine-tune). We [uploaded GGUFs, 4-bit models](https://huggingface.co/collections/unsloth/gemma-3n-685d3874830e49e1c93f9339).
+- **[Text-to-Speech (TTS)](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning)** is now supported, including `sesame/csm-1b` and STT `openai/whisper-large-v3`.
+- **[Qwen3](https://unsloth.ai/docs/models/qwen3-how-to-run-and-fine-tune)** is now supported. Qwen3-30B-A3B fits on 17.5GB VRAM.
+- Introducing **[Dynamic 2.0](https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs)** quants that set new benchmarks on 5-shot MMLU & Aider Polyglot.
+- [**EVERYTHING** is now supported](https://unsloth.ai/blog/gemma3#everything) - all models (TTS, BERT, Mamba), FFT, etc. [MultiGPU](https://unsloth.ai/docs/basics/multi-gpu-training-with-unsloth) is now supported. Enable FFT with `full_finetuning = True`, 8-bit with `load_in_8bit = True`.
+- 📣 [DeepSeek-R1](https://unsloth.ai/blog/deepseek-r1) - run or fine-tune them [with our guide](https://unsloth.ai/blog/deepseek-r1). All model uploads: [here](https://huggingface.co/collections/unsloth/deepseek-r1-all-versions-678e1c48f5d2fce87892ace5).
+- 📣 Introducing Long-context [Reasoning (GRPO)](https://unsloth.ai/blog/grpo) in Unsloth. Train your own reasoning model with just 5GB VRAM. Transform Llama, Phi, Mistral etc. into reasoning LLMs!
+- 📣 Introducing Unsloth [Dynamic 4-bit Quantization](https://unsloth.ai/blog/dynamic-4bit)! We dynamically opt not to quantize certain parameters and this greatly increases accuracy while only using <10% more VRAM than BnB 4-bit. See our collection on [Hugging Face here.](https://huggingface.co/collections/unsloth/unsloth-4-bit-dynamic-quants-67503bb873f89e15276c44e7)
+- 📣 **[Llama 4](https://unsloth.ai/blog/llama4)** by Meta, including Scout & Maverick are now supported.
+- 📣 [Phi-4](https://unsloth.ai/blog/phi4) by Microsoft: We also [fixed bugs](https://unsloth.ai/blog/phi4) in Phi-4 and [uploaded GGUFs, 4-bit](https://huggingface.co/collections/unsloth/phi-4-all-versions-677eecf93784e61afe762afa).
+- 📣 [Vision models](https://unsloth.ai/blog/vision) now supported! [Llama 3.2 Vision (11B)](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb), [Qwen 2.5 VL (7B)](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb) and [Pixtral (12B) 2409](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Pixtral_(12B)-Vision.ipynb)
+- 📣 [Llama 3.3 (70B)](https://huggingface.co/collections/unsloth/llama-33-all-versions-67535d7d994794b9d7cf5e9f), Meta's latest model is supported.
+- 📣 We worked with Apple to add [Cut Cross Entropy](https://arxiv.org/abs/2411.09009). Unsloth now supports 89K context for Meta's Llama 3.3 (70B) on a 80GB GPU - 13x longer than HF+FA2. For Llama 3.1 (8B), Unsloth enables 342K context, surpassing its native 128K support.
 - 📣 We found and helped fix a [gradient accumulation bug](https://unsloth.ai/blog/gradient)! Please update Unsloth and transformers.
-- 📣 Try out [Chat interface](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Unsloth_Studio.ipynb)!
-- 📣 NEW! Qwen-2.5 including [Coder](https://unsloth.ai/blog/qwen-coder) models are now supported with bugfixes. 14b fits in a Colab GPU! [Qwen 2.5 conversational notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_Coder_(14B)-Conversational.ipynb)
-- 📣 NEW! [Mistral Small 22b notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_Small_(22B)-Alpaca.ipynb) finetuning fits in under 16GB of VRAM!
-- 📣 NEW! `pip install unsloth` now works! Head over to [pypi](https://pypi.org/project/unsloth/) to check it out! This allows non git pull installs. Use `pip install unsloth[colab-new]` for non dependency installs.
-- 📣 NEW! Continued Pretraining [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-CPT.ipynb) for other languages like Korean!
-- 📣 [2x faster inference](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Inference.ipynb) added for all our models
 - 📣 We cut memory usage by a [further 30%](https://unsloth.ai/blog/long-context) and now support [4x longer context windows](https://unsloth.ai/blog/long-context)!
 </details>
 
 ## 🔗 Links and Resources
-| Type                            | Links                               |
-| ------------------------------- | --------------------------------------- |
-| 📚 **Documentation & Wiki**              | [Read Our Docs](https://docs.unsloth.ai) |
-| <img height="14" src="https://upload.wikimedia.org/wikipedia/commons/6/6f/Logo_of_Twitter.svg" />&nbsp; **Twitter (aka X)**              |  [Follow us on X](https://twitter.com/unslothai)|
-| 💾 **Installation**               | [unsloth/README.md](https://github.com/unslothai/unsloth/tree/main#-installation-instructions)|
-| 🥇 **Benchmarking**                   | [Performance Tables](https://github.com/unslothai/unsloth/tree/main#-performance-benchmarking)
-| 🌐 **Released Models**            | [Unsloth Releases](https://docs.unsloth.ai/get-started/all-our-models)|
-| ✍️ **Blog**                    | [Read our Blogs](https://unsloth.ai/blog)|
-| <img height="14" src="https://redditinc.com/hs-fs/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" />&nbsp; **Reddit**                    | [Join our Reddit page](https://reddit.com/r/unsloth)|
+| Type                                                                                                                                      | Links                                                                          |
+| ----------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
+| <img width="15" src="https://redditinc.com/hs-fs/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" />  **r/unsloth Reddit**                       | [Join Reddit community](https://reddit.com/r/unsloth)                          |
+| 📚 **Documentation & Wiki**                                                                                                               | [Read Our Docs](https://unsloth.ai/docs)                                       |
+| <img width="13" src="https://upload.wikimedia.org/wikipedia/commons/0/09/X_(formerly_Twitter)_logo_late_2025.svg" />  **Twitter (aka X)** | [Follow us on X](https://twitter.com/unslothai)                                |
+| 💾 **Installation**                                                                                                                       | [Pip & Docker Install](https://unsloth.ai/docs/get-started/install) |
+| 🔮 **Our Models**                                                                                                                         | [Unsloth Catalog](https://unsloth.ai/docs/get-started/unsloth-model-catalog)   |
+| ✍️ **Blog**                                                                                                                               | [Read our Blogs](https://unsloth.ai/blog)                                      |
 
 ## ⭐ Key Features
-- All kernels written in [OpenAI's Triton](https://openai.com/index/triton/) language. **Manual backprop engine**.
-- **0% loss in accuracy** - no approximation methods - all exact.
-- No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) [Check your GPU!](https://developer.nvidia.com/cuda-gpus) GTX 1070, 1080 works, but is slow.
-- Works on **Linux** and **Windows** via WSL.
-- Supports 4bit and 16bit QLoRA / LoRA finetuning via [bitsandbytes](https://github.com/TimDettmers/bitsandbytes).
-- Open source trains 5x faster - see [Unsloth Pro](https://unsloth.ai/) for up to **30x faster training**!
-- If you trained a model with 🦥Unsloth, you can use this cool sticker! &nbsp; <img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/made with unsloth.png" height="50" align="center" />
 
+* Supports **full-finetuning**, pretraining, 4-bit, 16-bit and **FP8** training
+* Supports **all models** including [TTS](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning), multimodal, [embedding](https://unsloth.ai/docs/new/embedding-finetuning) and more! Any model that works in transformers, works in Unsloth.
+* The most efficient library for [Reinforcement Learning (RL)](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide), using 80% less VRAM. Supports GRPO, GSPO, DrGRPO, DAPO etc.
+* **0% loss in accuracy** - no approximation methods - all exact.
+* Export and [deploy your model](https://unsloth.ai/docs/basics/inference-and-deployment) to [GGUF](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf) llama.cpp, [vLLM](https://unsloth.ai/docs/basics/inference-and-deployment/vllm-guide), [SGLang](https://unsloth.ai/docs/basics/inference-and-deployment/sglang-guide) and Hugging Face.
+* Supports NVIDIA (since 2018), [AMD](https://unsloth.ai/docs/get-started/install/amd) and [Intel](https://unsloth.ai/docs/get-started/install/intel) GPUs. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc)
+* Works on **Linux**, WSL and **[Windows](https://unsloth.ai/docs/get-started/install/windows-installation)**
+* All kernels written in OpenAI's Triton language. Manual backprop engine.
+* If you trained a model with 🦥Unsloth, you can use this cool sticker!   <img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/made with unsloth.png" width="200" align="center" />
 
-## 🥇 Performance Benchmarking
-- For our most detailed benchmarks, read our [Llama 3.3 Blog](https://unsloth.ai/blog/llama3-3).
-- Benchmarking of Unsloth was also conducted by [🤗Hugging Face](https://huggingface.co/blog/unsloth-trl).
+## 💾 Install Unsloth
+You can also see our docs for more detailed installation and updating instructions [here](https://unsloth.ai/docs/get-started/install).
 
-We tested using the Alpaca  Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down):
-  
-| Model          | VRAM  | 🦥 Unsloth speed | 🦥 VRAM reduction | 🦥 Longer context | 😊 Hugging Face + FA2 |
-|----------------|-------|-----------------|----------------|----------------|--------------------|
-| Llama 3.3 (70B)| 80GB  | 2x              | >75%           | 13x longer     | 1x                 |
-| Llama 3.1 (8B) | 80GB  | 2x              | >70%           | 12x longer     | 1x                 |
+Unsloth supports Python 3.13 or lower.
 
-<br>
+### Pip Installation
+**Install with pip (recommended) for Linux devices:**
+```
+pip install unsloth
+```
+**To update Unsloth:**
+```
+pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
+```
+See [here](#advanced-pip-installation) for advanced pip install instructions.
+### Windows Installation
+For this method, we will be utilizing Anaconda. You can view the [full guide with screenshots here](https://unsloth.ai/docs/get-started/install/windows-installation).
+1. **Install Miniconda (or Anaconda):** Miniconda is recommended. Install [Miniconda](https://www.anaconda.com/docs/getting-started/miniconda/install) or [Anaconda](https://www.anaconda.com/download), then open Anaconda PowerShell Prompt to continue.
 
-![](https://i.ibb.co/sJ7RhGG/image-41.png)
+2. **Create a Conda Environment:** Create and activate a fresh Python 3.12 environment for Unsloth.
 
-## 💾 Installation Instructions
+   ```bash
+   conda create --name unsloth_env python==3.12 -y
+   conda activate unsloth_env
+   ```
 
-For stable releases, use `pip install unsloth`. We recommend `pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"` for most installations though.
+3. **Check Your GPU and CUDA Version:** Run `nvidia-smi` to confirm that your NVIDIA GPU is detected and note the CUDA version shown in the output. If `nvidia-smi` does not work, reinstall the latest [NVIDIA drivers](https://www.nvidia.com/en-us/drivers/).
 
-### Conda Installation
-`⚠️Only use Conda if you have it. If not, use Pip`. Select either `pytorch-cuda=11.8,12.1` for CUDA 11.8 or CUDA 12.1. We support `python=3.10,3.11,3.12`.
+4. **Install PyTorch:** Install the Windows pip build of PyTorch that matches your CUDA version. Use [Install PyTorch](https://pytorch.org/get-started/locally/) to select the correct command for your system, then verify that PyTorch can see your GPU.
+
+   ```python
+   import torch
+   print(torch.cuda.is_available())
+   A = torch.ones((10, 10), device="cuda")
+   B = torch.ones((10, 10), device="cuda")
+   A @ B
+   ```
+
+5. **Install Unsloth:** Only install Unsloth after PyTorch is working correctly.
+
+   ```bash
+   pip install unsloth
+   ```
+
+#### Advanced/Troubleshooting
+For **advanced installation instructions** or if you see weird errors during installations:
+
+First try using an isolated environment via then `pip install unsloth`
 ```bash
-conda create --name unsloth_env \
-    python=3.11 \
-    pytorch-cuda=12.1 \
-    pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers \
-    -y
-conda activate unsloth_env
-
-pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
-pip install --no-deps trl peft accelerate bitsandbytes
+python -m venv unsloth
+source unsloth/bin/activate
+pip install unsloth
 ```
 
+1. Install `torch` and `triton`. Go to https://pytorch.org to install it. For example `pip install torch torchvision torchaudio triton`
+2. Confirm if CUDA is installed correctly. Try `nvcc`. If that fails, you need to install `cudatoolkit` or CUDA drivers.
+3. Install `xformers` manually via:
+  ```bash
+  pip install ninja
+  pip install -v --no-build-isolation -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
+  ```
+    Check if `xformers` succeeded with `python -m xformers.info` Go to https://github.com/facebookresearch/xformers. Another option is to install `flash-attn` for Ampere GPUs and ignore `xformers`
+
+4. For GRPO runs, you can try installing `vllm` and seeing if `pip install vllm` succeeds.
+5. Double check that your versions of Python, CUDA, CUDNN, `torch`, `triton`, and `xformers` are compatible with one another. The [PyTorch Compatibility Matrix](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-compatibility-matrix) may be useful.
+6. Finally, install `bitsandbytes` and check it with `python -m bitsandbytes`
+
+### Conda Installation (Optional)
+`⚠️Only use Conda if you have it. If not, use Pip`. We support `python=3.10,3.11,3.12,3.13`.
+```bash
+conda create --name unsloth_env python==3.12 -y
+conda activate unsloth_env
+```
+Use `nvidia-smi` to get the correct CUDA version like 13.0 which becomes `cu130`
+```bash
+pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
+pip3 install unsloth
+```
 <details>
   <summary>If you're looking to install Conda in a Linux environment, <a href="https://docs.anaconda.com/miniconda/">read here</a>, or run the below 🔽</summary>
   
@@ -160,10 +199,10 @@ pip install --no-deps trl peft accelerate bitsandbytes
   ```
 </details>
 
-### Pip Installation
-`⚠️Do **NOT** use this if you have Conda.` Pip is a bit more complex since there are dependency issues. The pip command is different for `torch 2.2,2.3,2.4,2.5` and CUDA versions.
+### Advanced Pip Installation
+`⚠️Do **NOT** use this if you have Conda.` Pip is a bit more complex since there are dependency issues. The pip command is different for `torch 2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10` and CUDA versions.
 
-For other torch versions, we support `torch211`, `torch212`, `torch220`, `torch230`, `torch240` and for CUDA versions, we support `cu118` and `cu121` and `cu124`. For Ampere devices (A100, H100, RTX3090) and above, use `cu118-ampere` or `cu121-ampere` or `cu124-ampere`.
+For other torch versions, we support `torch211`, `torch212`, `torch220`, `torch230`, `torch240`, `torch250`, `torch260`, `torch270`, `torch280`, `torch290`, `torch2100` and for CUDA versions, we support `cu118` and `cu121` and `cu124`. For Ampere devices (A100, H100, RTX3090) and above, use `cu118-ampere` or `cu121-ampere` or `cu124-ampere`. Note: torch 2.10 only supports CUDA 12.6, 12.8, and 13.0.
 
 For example, if you have `torch 2.4` and `CUDA 12.1`, use:
 ```bash
@@ -171,10 +210,16 @@ pip install --upgrade pip
 pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
 ```
 
-Another example, if you have `torch 2.5` and `CUDA 12.4`, use:
+Another example, if you have `torch 2.9` and `CUDA 13.0`, use:
 ```bash
 pip install --upgrade pip
-pip install "unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git"
+pip install "unsloth[cu130-torch290] @ git+https://github.com/unslothai/unsloth.git"
+```
+
+Another example, if you have `torch 2.10` and `CUDA 12.6`, use:
+```bash
+pip install --upgrade pip
+pip install "unsloth[cu126-torch2100] @ git+https://github.com/unslothai/unsloth.git"
 ```
 
 And other examples:
@@ -201,79 +246,81 @@ Or, run the below manually in a Python REPL:
 try: import torch
 except: raise ImportError('Install torch via `pip install torch`')
 from packaging.version import Version as V
-v = V(torch.__version__)
+import re
+v = V(re.match(r"[0-9\.]{3,}", torch.__version__).group(0))
 cuda = str(torch.version.cuda)
 is_ampere = torch.cuda.get_device_capability()[0] >= 8
-if cuda != "12.1" and cuda != "11.8" and cuda != "12.4": raise RuntimeError(f"CUDA = {cuda} not supported!")
+USE_ABI = torch._C._GLIBCXX_USE_CXX11_ABI
+if cuda not in ("11.8", "12.1", "12.4", "12.6", "12.8", "13.0"): raise RuntimeError(f"CUDA = {cuda} not supported!")
 if   v <= V('2.1.0'): raise RuntimeError(f"Torch = {v} too old!")
 elif v <= V('2.1.1'): x = 'cu{}{}-torch211'
 elif v <= V('2.1.2'): x = 'cu{}{}-torch212'
 elif v  < V('2.3.0'): x = 'cu{}{}-torch220'
 elif v  < V('2.4.0'): x = 'cu{}{}-torch230'
 elif v  < V('2.5.0'): x = 'cu{}{}-torch240'
-elif v  < V('2.6.0'): x = 'cu{}{}-torch250'
+elif v  < V('2.5.1'): x = 'cu{}{}-torch250'
+elif v <= V('2.5.1'): x = 'cu{}{}-torch251'
+elif v  < V('2.7.0'): x = 'cu{}{}-torch260'
+elif v  < V('2.7.9'): x = 'cu{}{}-torch270'
+elif v  < V('2.8.0'): x = 'cu{}{}-torch271'
+elif v  < V('2.8.9'): x = 'cu{}{}-torch280'
+elif v  < V('2.9.1'): x = 'cu{}{}-torch290'
+elif v  < V('2.9.2'): x = 'cu{}{}-torch291'
+elif v  < V('2.10.1'): x = 'cu{}{}-torch2100'
 else: raise RuntimeError(f"Torch = {v} too new!")
-x = x.format(cuda.replace(".", ""), "-ampere" if is_ampere else "")
-print(f'pip install --upgrade pip && pip install "unsloth[{x}] @ git+https://github.com/unslothai/unsloth.git"')
+if v > V('2.6.9') and cuda not in ("11.8", "12.6", "12.8", "13.0"): raise RuntimeError(f"CUDA = {cuda} not supported!")
+if v >= V('2.10.0') and cuda not in ("12.6", "12.8", "13.0"): raise RuntimeError(f"Torch 2.10 requires CUDA 12.6, 12.8, or 13.0! Got CUDA = {cuda}")
+x = x.format(cuda.replace(".", ""), "-ampere" if False else "") # is_ampere is broken due to flash-attn
+print(f'pip install --upgrade pip && pip install --no-deps git+https://github.com/unslothai/unsloth-zoo.git && pip install "unsloth[{x}] @ git+https://github.com/unslothai/unsloth.git" --no-build-isolation')
+```
+### Docker Installation
+You can use our pre-built Docker container with all dependencies to use Unsloth instantly with no setup required.
+[Read our guide](https://unsloth.ai/docs/get-started/install/docker).
+
+This container requires installing [NVIDIA's Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).
+
+```bash
+docker run -d -e JUPYTER_PASSWORD="mypassword" \
+  -p 8888:8888 -p 2222:22 \
+  -v $(pwd)/work:/workspace/work \
+  --gpus all \
+  unsloth/unsloth
 ```
 
-### Windows Installation
+Access Jupyter Lab at `http://localhost:8888` and start fine-tuning!
 
-To run Unsloth directly on Windows:
-- Install Triton from this Windows fork and follow the instructions: https://github.com/woct0rdho/triton-windows
-- In the SFTTrainer, set `dataset_num_proc=1` to avoid a crashing issue:
-```python
-trainer = SFTTrainer(
-    dataset_num_proc=1,
-    ...
-)
-```
+## 📜 Documentation
+* Go to our official [Documentation](https://unsloth.ai/docs) for [running models](https://unsloth.ai/docs/basics/inference-and-deployment), [saving to GGUF](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf), [checkpointing](https://unsloth.ai/docs/basics/finetuning-from-last-checkpoint), [evaluation](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide#evaluation) and more!
+* Read our Guides for: [Fine-tuning](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide), [Reinforcement Learning](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide), [Text-to-Speech (TTS)](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning), [Vision](https://unsloth.ai/docs/basics/vision-fine-tuning) and [any model](https://unsloth.ai/docs/models/tutorials).
+* We support Huggingface's transformers, TRL, Trainer, Seq2SeqTrainer and Pytorch code.
 
-For **advanced installation instructions** or if you see weird errors during installations:
-
-1. Install `torch` and `triton`. Go to https://pytorch.org to install it. For example `pip install torch torchvision torchaudio triton`
-2. Confirm if CUDA is installated correctly. Try `nvcc`. If that fails, you need to install `cudatoolkit` or CUDA drivers.
-3. Install `xformers` manually. You can try installing `vllm` and seeing if `vllm` succeeds. Check if `xformers` succeeded with `python -m xformers.info` Go to https://github.com/facebookresearch/xformers. Another option is to install `flash-attn` for Ampere GPUs.
-4.  Finally, install `bitsandbytes` and check it with `python -m bitsandbytes`
-
-## 📜 [Documentation](https://docs.unsloth.ai)
-- Go to our official [Documentation](https://docs.unsloth.ai) for saving to GGUF, checkpointing, evaluation and more!
-- We support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code!
-- We're in 🤗Hugging Face's official docs! Check out the [SFT docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth) and [DPO docs](https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth)!
-- If you want to download models from the ModelScope community, please use an environment variable: `UNSLOTH_USE_MODELSCOPE=1`, and install the modelscope library by: `pip install modelscope -U`.
-
-> unsloth_cli.py also supports `UNSLOTH_USE_MODELSCOPE=1` to download models and datasets. please remember to use the model and dataset id in the ModelScope community.
+Unsloth example code to fine-tune gpt-oss-20b:
 
 ```python
-from unsloth import FastLanguageModel 
-from unsloth import is_bfloat16_supported
+from unsloth import FastLanguageModel, FastModel, FastVisionModel
 import torch
-from trl import SFTTrainer
-from transformers import TrainingArguments
+from trl import SFTTrainer, SFTConfig
 from datasets import load_dataset
-max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
+max_seq_length = 2048 # Supports RoPE Scaling internally, so choose any!
 # Get LAION dataset
 url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
 dataset = load_dataset("json", data_files = {"train" : url}, split = "train")
 
 # 4bit pre quantized models we support for 4x faster downloading + no OOMs.
 fourbit_models = [
-    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
-    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
-    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
-    "unsloth/llama-3-8b-Instruct-bnb-4bit",
-    "unsloth/llama-3-70b-bnb-4bit",
-    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
-    "unsloth/Phi-3-medium-4k-instruct",
-    "unsloth/mistral-7b-bnb-4bit",
-    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
+    "unsloth/gpt-oss-20b-unsloth-bnb-4bit", #or choose any model
+
 ] # More models at https://huggingface.co/unsloth
 
 model, tokenizer = FastLanguageModel.from_pretrained(
-    model_name = "unsloth/llama-3-8b-bnb-4bit",
-    max_seq_length = max_seq_length,
-    dtype = None,
-    load_in_4bit = True,
+    model_name = "unsloth/gpt-oss-20b",
+    max_seq_length = max_seq_length, # Choose any for long context!
+    load_in_4bit = True,  # 4-bit quantization. False = 16-bit LoRA.
+    load_in_8bit = False, # 8-bit quantization
+    load_in_16bit = False, # 16-bit LoRA
+    full_finetuning = False, # Use for full fine-tuning.
+    trust_remote_code = False, # Enable to support new models
+    # token = "hf_...", # use one if using gated models
 )
 
 # Do model patching and add fast LoRA weights
@@ -296,16 +343,13 @@ model = FastLanguageModel.get_peft_model(
 trainer = SFTTrainer(
     model = model,
     train_dataset = dataset,
-    dataset_text_field = "text",
-    max_seq_length = max_seq_length,
     tokenizer = tokenizer,
-    args = TrainingArguments(
+    args = SFTConfig(
+        max_seq_length = max_seq_length,
         per_device_train_batch_size = 2,
         gradient_accumulation_steps = 4,
         warmup_steps = 10,
         max_steps = 60,
-        fp16 = not is_bfloat16_supported(),
-        bf16 = is_bfloat16_supported(),
         logging_steps = 1,
         output_dir = "outputs",
         optim = "adamw_8bit",
@@ -314,79 +358,42 @@ trainer = SFTTrainer(
 )
 trainer.train()
 
-# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like
-# (1) Saving to GGUF / merging to 16bit for vLLM
+# Go to https://unsloth.ai/docs for advanced tips like
+# (1) Saving to GGUF / merging to 16bit for vLLM or SGLang
 # (2) Continued training from a saved LoRA adapter
 # (3) Adding an evaluation loop / OOMs
 # (4) Customized chat templates
 ```
 
-<a name="DPO"></a>
-## DPO Support
-DPO (Direct Preference Optimization), PPO, Reward Modelling all seem to work as per 3rd party independent testing from [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory). We have a preliminary Google Colab notebook for reproducing Zephyr on Tesla T4 here: [notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing).
+<a name="RL"></a>
+## 💡 Reinforcement Learning
+[RL](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) including [GRPO](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide#training-with-grpo), [GSPO](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/gspo-reinforcement-learning), [**FP8** training](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning), DrGRPO, DAPO, PPO, Reward Modelling, Online DPO all work with Unsloth.
 
-We're in 🤗Hugging Face's official docs! We're on the [SFT docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth) and the [DPO docs](https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth)!
+Read our [Reinforcement Learning Guide](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) or our [advanced RL docs](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/advanced-rl-documentation) for batching, generation & training parameters.
 
-```python
-import os
-os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Optional set GPU device ID
+List of RL notebooks:
+- gpt-oss GRPO notebook: [Link](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb)
+- ***FP8*** Qwen3-8B GRPO notebook (L4): [Link](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_8B_FP8_GRPO.ipynb)
+- Qwen3-VL GSPO notebook: [Link](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_VL_(8B)-Vision-GRPO.ipynb)
+- Advanced Qwen3 GRPO notebook: [Link](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb)
+- ORPO notebook: [Link](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-ORPO.ipynb)
+- DPO Zephyr notebook: [Link](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Zephyr_(7B)-DPO.ipynb)
+- KTO notebook: [Link](https://colab.research.google.com/drive/1MRgGtLWuZX4ypSfGguFgC-IblTvO2ivM?usp=sharing)
+- SimPO notebook: [Link](https://colab.research.google.com/drive/1Hs5oQDovOay4mFA6Y9lQhVJ8TnbFLFh2?usp=sharing)
 
-from unsloth import FastLanguageModel, PatchDPOTrainer
-from unsloth import is_bfloat16_supported
-PatchDPOTrainer()
-import torch
-from transformers import TrainingArguments
-from trl import DPOTrainer
+## 🥇 Performance Benchmarking
+- For our most detailed benchmarks, read our [Llama 3.3 Blog](https://unsloth.ai/blog/llama3-3).
+- Benchmarking of Unsloth was also conducted by [🤗Hugging Face](https://huggingface.co/blog/unsloth-trl).
 
-model, tokenizer = FastLanguageModel.from_pretrained(
-    model_name = "unsloth/zephyr-sft-bnb-4bit",
-    max_seq_length = max_seq_length,
-    dtype = None,
-    load_in_4bit = True,
-)
+We tested using the Alpaca  Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down):
+  
+| Model          | VRAM  | 🦥 Unsloth speed | 🦥 VRAM reduction | 🦥 Longer context | 😊 Hugging Face + FA2 |
+|----------------|-------|-----------------|----------------|----------------|--------------------|
+| Llama 3.3 (70B)| 80GB  | 2x              | >75%           | 13x longer     | 1x                 |
+| Llama 3.1 (8B) | 80GB  | 2x              | >70%           | 12x longer     | 1x                 |
 
-# Do model patching and add fast LoRA weights
-model = FastLanguageModel.get_peft_model(
-    model,
-    r = 64,
-    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
-                      "gate_proj", "up_proj", "down_proj",],
-    lora_alpha = 64,
-    lora_dropout = 0, # Supports any, but = 0 is optimized
-    bias = "none",    # Supports any, but = "none" is optimized
-    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
-    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
-    random_state = 3407,
-    max_seq_length = max_seq_length,
-)
-
-dpo_trainer = DPOTrainer(
-    model = model,
-    ref_model = None,
-    args = TrainingArguments(
-        per_device_train_batch_size = 4,
-        gradient_accumulation_steps = 8,
-        warmup_ratio = 0.1,
-        num_train_epochs = 3,
-        fp16 = not is_bfloat16_supported(),
-        bf16 = is_bfloat16_supported(),
-        logging_steps = 1,
-        optim = "adamw_8bit",
-        seed = 42,
-        output_dir = "outputs",
-    ),
-    beta = 0.1,
-    train_dataset = YOUR_DATASET_HERE,
-    # eval_dataset = YOUR_DATASET_HERE,
-    tokenizer = tokenizer,
-    max_length = 1024,
-    max_prompt_length = 512,
-)
-dpo_trainer.train()
-```
-
-## 🥇 Detailed Benchmarking Tables
 ### Context length benchmarks
+
 #### Llama 3.1 (8B) max. context length
 We tested Llama 3.1 (8B) Instruct and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.
 | GPU VRAM | 🦥Unsloth context length | Hugging Face + FA2 |
@@ -419,14 +426,13 @@ You can cite the Unsloth repo as follows:
 @software{unsloth,
   author = {Daniel Han, Michael Han and Unsloth team},
   title = {Unsloth},
-  url = {http://github.com/unslothai/unsloth},
+  url = {https://github.com/unslothai/unsloth},
   year = {2023}
 }
 ```
 
 ### Thank You to
-- [Erik](https://github.com/erikwijmans) for his help adding [Apple's ML Cross Entropy](https://github.com/apple/ml-cross-entropy) in Unsloth
-- [HuyNguyen-hust](https://github.com/HuyNguyen-hust) for making [RoPE Embeddings 28% faster](https://github.com/unslothai/unsloth/pull/238)
-- [RandomInternetPreson](https://github.com/RandomInternetPreson) for confirming WSL support
-- [152334H](https://github.com/152334H) for experimental DPO support
-- [atgctg](https://github.com/atgctg) for syntax highlighting
+- The [llama.cpp library](https://github.com/ggml-org/llama.cpp) that lets users save models with Unsloth
+- The Hugging Face team and their libraries: [transformers](https://github.com/huggingface/transformers) and [TRL](https://github.com/huggingface/trl)
+- The Pytorch and [Torch AO](https://github.com/unslothai/unsloth/pull/3391) team for their contributions
+- And of course for every single person who has contributed or has used Unsloth!
diff --git a/unsloth/save.py b/unsloth/save.py
index 1eaf3ddc2..6e38d1e95 100644
--- a/unsloth/save.py
+++ b/unsloth/save.py
@@ -13,10 +13,32 @@
 # limitations under the License.
 
 from unsloth_zoo.utils import Version
+from importlib.metadata import version as importlib_version
+from unsloth_zoo.hf_utils import dtype_from_config, HAS_TORCH_DTYPE
+from unsloth_zoo.llama_cpp import (
+    convert_to_gguf,
+    quantize_gguf,
+    use_local_gguf,
+    install_llama_cpp,
+    check_llama_cpp,
+    _download_convert_hf_to_gguf,
+)
+
+# H4: Defensive imports -- these were added in unsloth-zoo PR #526
+# and may not exist on older versions
+try:
+    from unsloth_zoo.llama_cpp import LLAMA_CPP_DEFAULT_DIR, IS_WINDOWS
+except ImportError:
+    import sys
+
+    IS_WINDOWS = sys.platform == "win32"
+    LLAMA_CPP_DEFAULT_DIR = "llama.cpp"
 from bitsandbytes.nn import Linear4bit as Bnb_Linear4bit
 from peft.tuners.lora import Linear4bit as Peft_Linear4bit
 from peft.tuners.lora import Linear as Peft_Linear
 from typing import Optional, Callable, Union, List
+import sys
+import requests
 import torch
 import os
 import shutil
@@ -29,14 +51,22 @@ import psutil
 import re
 from transformers.models.llama.modeling_llama import logger
 from .tokenizer_utils import fix_sentencepiece_gguf
+from .models.loader_utils import get_model_name
+from .models._utils import _convert_torchao_model
+from .ollama_template_mappers import OLLAMA_TEMPLATES, MODEL_TO_OLLAMA_TEMPLATE_MAPPER
+from transformers import ProcessorMixin
 from huggingface_hub import HfApi
+
 try:
-    from huggingface_hub.utils import get_token
+    from huggingface_hub import get_token
 except:
-    # Old HF Hub versions <= 0.0.25
-    from huggingface_hub.utils._token import get_token
-pass
+    try:
+        from huggingface_hub.utils import get_token
+    except:
+        # For older versions of huggingface_hub
+        from huggingface_hub.utils._token import get_token
 from pathlib import Path
+from peft import PeftModelForCausalLM, PeftModel
 
 __all__ = [
     "print_quantization_methods",
@@ -44,70 +74,90 @@ __all__ = [
     "save_to_gguf",
     "patch_saving_functions",
     "create_huggingface_repo",
-    "export_model_to_local",
 ]
 
 # llama.cpp specific targets - all takes 90s. Below takes 60s
-LLAMA_CPP_TARGETS = ["llama-quantize", "llama-export-lora", "llama-cli",]
+LLAMA_CPP_TARGETS = [
+    "llama-quantize",
+    "llama-cli",
+    "llama-server",
+]
 
 # Check environments
 keynames = "\n" + "\n".join(os.environ.keys())
-IS_COLAB_ENVIRONMENT  = "\nCOLAB_"  in keynames
+IS_COLAB_ENVIRONMENT = "\nCOLAB_" in keynames
 IS_KAGGLE_ENVIRONMENT = "\nKAGGLE_" in keynames
 KAGGLE_TMP = "/tmp"
 del keynames
 
 # Weights
 LLAMA_WEIGHTS = (
-    "self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj", "self_attn.o_proj",
-    "mlp.gate_proj", "mlp.up_proj", "mlp.down_proj",
+    "self_attn.q_proj",
+    "self_attn.k_proj",
+    "self_attn.v_proj",
+    "self_attn.o_proj",
+    "mlp.gate_proj",
+    "mlp.up_proj",
+    "mlp.down_proj",
 )
 LLAMA_LAYERNORMS = (
-    "input_layernorm", "post_attention_layernorm",
-    "pre_feedforward_layernorm", "post_feedforward_layernorm",
+    "input_layernorm",
+    "post_attention_layernorm",
+    "pre_feedforward_layernorm",
+    "post_feedforward_layernorm",
+    "self_attn.q_norm",
+    "self_attn.k_norm",
 )
 
 # https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp#L19
 # From https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html
-ALLOWED_QUANTS = \
-{
-    "not_quantized"  : "Recommended. Fast conversion. Slow inference, big files.",
-    "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
-    "quantized"      : "Recommended. Slow conversion. Fast inference, small files.",
-    "f32"     : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
-    "bf16"    : "Bfloat16 - Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
-    "f16"     : "Float16  - Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
-    "q8_0"    : "Fast conversion. High resource use, but generally acceptable.",
-    "q4_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
-    "q5_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
-    "q2_k"    : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
-    "q3_k_l"  : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
-    "q3_k_m"  : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
-    "q3_k_s"  : "Uses Q3_K for all tensors",
-    "q4_0"    : "Original quant method, 4-bit.",
-    "q4_1"    : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
-    "q4_k_s"  : "Uses Q4_K for all tensors",
-    "q4_k"    : "alias for q4_k_m",
-    "q5_k"    : "alias for q5_k_m",
-    "q5_0"    : "Higher accuracy, higher resource usage and slower inference.",
-    "q5_1"    : "Even higher accuracy, resource usage and slower inference.",
-    "q5_k_s"  : "Uses Q5_K for all tensors",
-    "q6_k"    : "Uses Q8_K for all tensors",
+ALLOWED_QUANTS = {
+    "not_quantized": "Recommended. Fast conversion. Slow inference, big files.",
+    "fast_quantized": "Recommended. Fast conversion. OK inference, OK file size.",
+    "quantized": "Recommended. Slow conversion. Fast inference, small files.",
+    "f32": "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
+    "bf16": "Bfloat16 - Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
+    "f16": "Float16  - Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
+    "q8_0": "Fast conversion. High resource use, but generally acceptable.",
+    "q4_k_m": "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
+    "q5_k_m": "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
+    "q2_k": "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
+    "q3_k_l": "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
+    "q3_k_m": "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
+    "q3_k_s": "Uses Q3_K for all tensors",
+    "q4_0": "Original quant method, 4-bit.",
+    "q4_1": "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
+    "q4_k_s": "Uses Q4_K for all tensors",
+    "q4_k": "alias for q4_k_m",
+    "q5_k": "alias for q5_k_m",
+    "q5_0": "Higher accuracy, higher resource usage and slower inference.",
+    "q5_1": "Even higher accuracy, resource usage and slower inference.",
+    "q5_k_s": "Uses Q5_K for all tensors",
+    "q6_k": "Uses Q8_K for all tensors",
     # "iq2_xxs" : "2.06 bpw quantization", # Not supported sadly
     # "iq2_xs"  : "2.31 bpw quantization",
     # "iq3_xxs" : "3.06 bpw quantization",
-    "q3_k_xs" : "3-bit extra small quantization",
+    "q3_k_xs": "3-bit extra small quantization",
 }
 
+
+def has_curl():
+    return shutil.which("curl") is not None
+
+
+CURL_FLAG = "-DLLAMA_CURL=ON" if has_curl() else "-DLLAMA_CURL=OFF"
+
+
 def print_quantization_methods():
     for key, value in ALLOWED_QUANTS.items():
         print(f'"{key}"  ==> {value}')
-    pass
-pass
 
 
-def check_if_sentencepiece_model(model, temporary_location = "_unsloth_sentencepiece_temp"):
-    if not hasattr(model, "_saved_temp_tokenizer"): return False
+def check_if_sentencepiece_model(
+    model, temporary_location = "_unsloth_sentencepiece_temp"
+):
+    if not hasattr(model, "_saved_temp_tokenizer"):
+        return False
 
     temp_tokenizer = model._saved_temp_tokenizer
     sentencepiece_model = False
@@ -116,19 +166,17 @@ def check_if_sentencepiece_model(model, temporary_location = "_unsloth_sentencep
     if not os.path.exists(file_location):
         created_folder = True
         os.makedirs(file_location)
-    pass
     temp_tokenizer.save_pretrained(file_location)
     if os.path.isfile(f"{file_location}/tokenizer.model"):
         sentencepiece_model = True
-    pass
     if created_folder:
         shutil.rmtree(file_location, ignore_errors = True)
     return sentencepiece_model
-pass
 
 
 def _free_cached_model(model):
     from huggingface_hub import scan_cache_dir
+
     cached_repos = list(scan_cache_dir().repos)
 
     # Go through every cached repo, and delete the one that matches the model we want to save.
@@ -136,27 +184,27 @@ def _free_cached_model(model):
     for cached_repo in cached_repos:
         if cached_repo.repo_id == model.config._name_or_path:
             remove_cache_commit = list(cached_repo.revisions)[0].commit_hash
-            delete_strategy = scan_cache_dir().delete_revisions(remove_cache_commit,)
+            delete_strategy = scan_cache_dir().delete_revisions(
+                remove_cache_commit,
+            )
 
             logger.warning_once(
-                "Unsloth: Will remove a cached repo with size " + \
-                delete_strategy.expected_freed_size_str,
+                "Unsloth: Will remove a cached repo with size "
+                + delete_strategy.expected_freed_size_str,
             )
 
             delete_strategy.execute()
-        pass
-    pass
-pass
 
 
 def _merge_lora(layer, name):
-
     bias = getattr(layer, "bias", None)
     if isinstance(layer, (Bnb_Linear4bit, Peft_Linear4bit, Peft_Linear)):
         # Is LoRA so we need to merge!
         W, quant_state, A, B, s, bias = get_lora_parameters_bias(layer)
         if quant_state is not None:
-            dtype = quant_state.dtype if type(quant_state) is not list else quant_state[2]
+            dtype = (
+                quant_state.dtype if type(quant_state) is not list else quant_state[2]
+            )
             W = fast_dequantize(W, quant_state)
         else:
             dtype = W.dtype
@@ -171,13 +219,13 @@ def _merge_lora(layer, name):
             # if not torch.isfinite(W).all():
             maximum_element = torch.max(W.min().abs(), W.max())
             if not torch.isfinite(maximum_element).item():
-                raise ValueError(f"Unsloth: Merge failed.\n{name} has some elements = infinity.")
-        pass
+                raise ValueError(
+                    f"Unsloth: Merge failed.\n{name} has some elements = infinity."
+                )
         W = W.t().to(dtype)
     else:
         W = layer.weight
     return W, bias
-pass
 
 
 def fast_save_pickle(shard, name):
@@ -191,41 +239,41 @@ def fast_save_pickle(shard, name):
         # pickle_protocol = pickle.HIGHEST_PROTOCOL,
     )
     return
-pass
 
 
 @torch.inference_mode
 def unsloth_save_model(
     model,
     tokenizer,
-    save_directory       : Union[str, os.PathLike],
-    save_method          : str = "lora", # ["lora", "merged_16bit", "merged_4bit"]
-    push_to_hub          : bool = False,
-    token                : Optional[Union[str, bool]] = None,
-    is_main_process      : bool = True,
-    state_dict           : Optional[dict] = None,
-    save_function        : Callable = torch.save,
-    max_shard_size       : Union[int, str] = "5GB",
-    safe_serialization   : bool = True,
-    variant              : Optional[str] = None,
-    save_peft_format     : bool = True,
-
+    save_directory: Union[str, os.PathLike],
+    save_method: str = "lora",  # ["lora", "merged_16bit", "merged_4bit"]
+    push_to_hub: bool = False,
+    token: Optional[Union[str, bool]] = None,
+    is_main_process: bool = True,
+    state_dict: Optional[dict] = None,
+    save_function: Callable = torch.save,
+    max_shard_size: Union[int, str] = "5GB",
+    safe_serialization: bool = True,
+    variant: Optional[str] = None,
+    save_peft_format: bool = True,
     # Push to hub
-    use_temp_dir         : Optional[bool] = None,
-    commit_message       : Optional[str] = "Trained with Unsloth",
-    private              : Optional[bool] = None,
-    create_pr            : bool = False,
-    revision             : str = None,
-    commit_description   : str = "Upload model trained with Unsloth 2x faster",
-    tags                 : List[str] = None,
-
+    use_temp_dir: Optional[bool] = None,
+    commit_message: Optional[str] = "Trained with Unsloth",
+    private: Optional[bool] = None,
+    create_pr: bool = False,
+    revision: str = None,
+    commit_description: str = "Upload model trained with Unsloth 2x faster",
+    tags: List[str] = None,
     # Our functions
-    temporary_location   : str = "_unsloth_temporary_saved_buffers",
-    maximum_memory_usage : float = 0.9,
+    temporary_location: str = "_unsloth_temporary_saved_buffers",
+    maximum_memory_usage: float = 0.9,
+    datasets: Optional[List[str]] = None,
 ):
-    if token is None: token = get_token()
+    if token is None:
+        token = get_token()
 
-    if commit_message is None: commit_message = ""
+    if commit_message is None:
+        commit_message = ""
     if "Unsloth" not in commit_message:
         commit_message += " (Trained with Unsloth)"
     commit_message = commit_message.lstrip()
@@ -234,185 +282,214 @@ def unsloth_save_model(
         commit_description = "Upload model trained with Unsloth 2x faster"
     elif "Unsloth 2x faster" not in commit_description:
         commit_description += " (Trained with Unsloth 2x faster)"
-    pass
 
     if save_method == "merged_4bit":
         raise RuntimeError(
-            "Unsloth: Merging into 4bit will cause your model to lose accuracy if you plan\n"\
-            "to merge to GGUF or others later on. I suggest you to do this as a final step\n"\
-            "if you're planning to do multiple saves.\n"\
+            "Unsloth: Merging into 4bit will cause your model to lose accuracy if you plan\n"
+            "to merge to GGUF or others later on. I suggest you to do this as a final step\n"
+            "if you're planning to do multiple saves.\n"
             "If you are certain, change `save_method` to `merged_4bit_forced`."
         )
     elif save_method == "merged_4bit_forced":
         save_method = "merged_4bit"
-    pass
 
     save_pretrained_settings = dict(locals())
-    for deletion in ("model", "tokenizer", "save_method", "temporary_location", "maximum_memory_usage"):
+    for deletion in (
+        "model",
+        "tokenizer",
+        "save_method",
+        "temporary_location",
+        "maximum_memory_usage",
+        "datasets",
+    ):
         del save_pretrained_settings[deletion]
-    pass
 
     # First check for a token!
     if push_to_hub:
         from huggingface_hub import whoami
-        try: 
+
+        try:
             username = whoami(token = token)["name"]
         except:
             raise RuntimeError(
-                "Unsloth: Please supply a token!\n"\
+                "Unsloth: Please supply a token!\n"
                 "Go to https://huggingface.co/settings/tokens"
             )
-        pass
-    pass
 
-    assert(maximum_memory_usage > 0 and maximum_memory_usage <= 0.95)
+    assert maximum_memory_usage > 0 and maximum_memory_usage <= 0.95
 
     # Clean memory up first
     for _ in range(3):
         torch.cuda.empty_cache()
         gc.collect()
-    pass
 
     save_method = save_method.lower().replace(" ", "_")
-    if save_method != "lora" and save_method != "merged_16bit" and save_method != "merged_4bit":
+    if (
+        save_method != "lora"
+        and save_method != "merged_16bit"
+        and save_method != "merged_4bit"
+    ):
         raise RuntimeError(
-            "Unsloth: You must select one of 3 options when saving models:\n"\
-            '"lora"         ==> This is the fastest and easiet. Just saves LoRA modules.\n'\
-            '"merged_16bit" ==> This merges LoRA weights and saves to float16. Needed for llama.cpp / GGUF.\n'\
+            "Unsloth: You must select one of 3 options when saving models:\n"
+            '"lora"         ==> This is the fastest and easiet. Just saves LoRA modules.\n'
+            '"merged_16bit" ==> This merges LoRA weights and saves to float16. Needed for llama.cpp / GGUF.\n'
             '"merged_4bit"  ==> This merges LoRA weights and saves to 4bit. Useful for DPO / inference.'
         )
-    pass
 
     if save_method == "merged_4bit":
-
         print("Unsloth: Merging 4bit and LoRA weights to 4bit...")
         print("This might take 5 minutes...")
 
         # Counteract no LoRA adapters!
         if hasattr(model, "merge_and_unload"):
             model = model.merge_and_unload()
-        pass
         print("Done.")
-    pass
 
     if tags is not None:
-        assert(isinstance(tags, (list, tuple)))
-        tags = list(tags) + ["unsloth",]
+        assert isinstance(tags, (list, tuple))
+        tags = list(tags) + [
+            "unsloth",
+        ]
     else:
-        tags = ["unsloth",]
-    pass
+        tags = [
+            "unsloth",
+        ]
     save_pretrained_settings["tags"] = tags
 
     if ((save_method == "lora") or (save_method == "merged_4bit")) and push_to_hub:
         if token is None:
             raise RuntimeError(
-                "Unsloth: Pushing to HF requires a token. Pass `token = 'hf_....'`\n"\
+                "Unsloth: Pushing to HF requires a token. Pass `token = 'hf_....'`\n"
                 "Go to https://huggingface.co/settings/tokens."
             )
-        pass
 
         if save_method == "lora":
             print("Unsloth: Saving LoRA adapters. Please wait...")
         elif save_method == "merged_4bit":
             print("Unsloth: Saving 4bit Bitsandbytes model. Please wait...")
-        pass
 
         # Update model tag
         _ = upload_to_huggingface(
-            model, save_directory, token,
-            "finetuned", "trl", file_location = None,
-            old_username = None, private = private,
+            model,
+            save_directory,
+            token,
+            "finetuned",
+            "trl",
+            file_location = None,
+            old_username = None,
+            private = private,
+            datasets = datasets,
         )
 
-        getattr(model, "original_push_to_hub", tokenizer.push_to_hub)\
-        (
-            repo_id            = save_directory,
-            use_temp_dir       = use_temp_dir,
-            commit_message     = commit_message,
-            private            = private,
-            token              = token,
-            max_shard_size     = max_shard_size,
-            create_pr          = create_pr,
+        getattr(model, "original_push_to_hub", model.push_to_hub)(
+            repo_id = save_directory,
+            use_temp_dir = use_temp_dir,
+            commit_message = commit_message,
+            private = private,
+            token = token,
+            max_shard_size = max_shard_size,
+            create_pr = create_pr,
             safe_serialization = safe_serialization,
-            revision           = revision,
+            revision = revision,
             commit_description = commit_description,
-            tags               = tags,
+            tags = tags,
         )
         if tokenizer is not None:
             # Set padding side to left for inference
             old_padding_side = tokenizer.padding_side
             tokenizer.padding_side = "left"
 
-            getattr(tokenizer, "original_push_to_hub", tokenizer.push_to_hub)\
-            (
-                repo_id            = save_directory,
-                use_temp_dir       = use_temp_dir,
-                commit_message     = commit_message,
-                private            = private,
-                token              = token,
-                max_shard_size     = max_shard_size,
-                create_pr          = create_pr,
+            getattr(tokenizer, "original_push_to_hub", tokenizer.push_to_hub)(
+                repo_id = save_directory,
+                use_temp_dir = use_temp_dir,
+                commit_message = commit_message,
+                private = private,
+                token = token,
+                max_shard_size = max_shard_size,
+                create_pr = create_pr,
                 safe_serialization = safe_serialization,
-                revision           = revision,
+                revision = revision,
                 commit_description = commit_description,
-                tags               = tags,
+                tags = tags,
             )
 
             # Revert back padding side
             tokenizer.padding_side = old_padding_side
-        pass
 
         if hasattr(model, "config"):
-            print(f"Saved {save_method} model to https://huggingface.co/" + save_directory)
-        pass
+            print(
+                f"Saved {save_method} model to https://huggingface.co/" + save_directory
+            )
         return save_directory, None
-    pass
 
     # Tokenizer has different saving arguments
-    tokenizer_save_settings = \
-    {
-        "save_directory"  : save_pretrained_settings["save_directory"],
-        "legacy_format"   : None,
-        "filename_prefix" : None,
-        "push_to_hub"     : save_pretrained_settings["push_to_hub"],
-        "private"         : save_pretrained_settings["private"],
-        "token"           : save_pretrained_settings["token"],
+    tokenizer_save_settings = {
+        "save_directory": save_pretrained_settings["save_directory"],
+        "legacy_format": None,
+        "filename_prefix": None,
+        "push_to_hub": save_pretrained_settings["push_to_hub"],
+        "private": save_pretrained_settings["private"],
+        "token": save_pretrained_settings["token"],
     }
 
     # Check if PEFT Model or not - if yes, 3 levels. If not 2 levels.
     from peft import PeftModelForCausalLM
+
     if isinstance(model, PeftModelForCausalLM):
         internal_model = model.model
     else:
         internal_model = model
-    pass
-        
+
     # Cannot be converted properly!
-    if (save_method == "merged_4bit") or (save_method == "lora") or (
-        not hasattr(model, "model") or \
-        not hasattr(internal_model.model, "layers")
+    if (
+        (save_method == "merged_4bit")
+        or (save_method == "lora")
+        or (not hasattr(model, "model") or not hasattr(internal_model.model, "layers"))
     ):
         # Do general saving
         # Edit save_pretrained_settings
         # [TODO] _create_repo has errors due to **kwargs getting accepted
         # commit_description does not seem to work?
-        what_to_delete = ("use_temp_dir", "commit_message", "create_pr", "revision", "commit_description", "tags",) \
-            if save_pretrained_settings["push_to_hub"] is False else \
-            ("use_temp_dir", "create_pr", "revision", "tags", "commit_description",)
+        what_to_delete = (
+            (
+                "use_temp_dir",
+                "commit_message",
+                "create_pr",
+                "revision",
+                "commit_description",
+                "tags",
+            )
+            if save_pretrained_settings["push_to_hub"] is False
+            else (
+                "use_temp_dir",
+                "create_pr",
+                "revision",
+                "tags",
+                "commit_description",
+            )
+        )
         for deletion in what_to_delete:
             del save_pretrained_settings[deletion]
-        pass
         if hasattr(model, "add_model_tags"):
-            model.add_model_tags(["unsloth",])
+            model.add_model_tags(
+                [
+                    "unsloth",
+                ]
+            )
 
         # Update model tag
         if push_to_hub:
-             _ = upload_to_huggingface(
-                model, save_pretrained_settings["save_directory"], token,
-                "finetuned", "trl", file_location = None,
-                old_username = None, private = private,
+            _ = upload_to_huggingface(
+                model,
+                save_pretrained_settings["save_directory"],
+                token,
+                "finetuned",
+                "trl",
+                file_location = None,
+                old_username = None,
+                private = private,
+                datasets = datasets,
             )
-        pass
 
         if tokenizer is not None:
             print("Unsloth: Saving tokenizer...", end = "")
@@ -431,47 +508,48 @@ def unsloth_save_model(
             print()
 
         print("Unsloth: Saving model...", end = "")
-        if save_method != "lora": print(" This might take 10 minutes for Llama-7b...", end = "")
+        if save_method != "lora":
+            print(" This might take 10 minutes for Llama-7b...", end = "")
 
         # [TODO] Is this correct?
         if save_method == "lora":
             save_pretrained_settings["selected_adapters"] = None
-        pass
 
         model.save_pretrained(**save_pretrained_settings)
 
         if push_to_hub and hasattr(model, "config"):
-            print("Saved to https://huggingface.co/" + save_pretrained_settings["save_directory"])
-        pass
+            print(
+                "Saved to https://huggingface.co/"
+                + save_pretrained_settings["save_directory"]
+            )
 
         print(" Done.")
         return save_directory, None
-    pass
 
     # If push_to_hub, we must remove the .../ part of a repo
     username = None
     if push_to_hub and "/" in save_directory:
-
         # +1 solves absolute path issues
         new_save_directory = save_directory
-        username = new_save_directory[:new_save_directory.find("/")]
-        new_save_directory = new_save_directory[new_save_directory.find("/")+1:]
+        username = new_save_directory[: new_save_directory.find("/")]
+        new_save_directory = new_save_directory[new_save_directory.find("/") + 1 :]
         if IS_KAGGLE_ENVIRONMENT:
-            new_save_directory = os.path.join(KAGGLE_TMP, new_save_directory[new_save_directory.find("/")+1:])
+            new_save_directory = os.path.join(
+                KAGGLE_TMP, new_save_directory[new_save_directory.find("/") + 1 :]
+            )
             logger.warning_once(
-                "Unsloth: You are pushing to hub in Kaggle environment.\n"\
+                "Unsloth: You are pushing to hub in Kaggle environment.\n"
                 f"To save memory, we shall move {save_directory} to {new_save_directory}"
             )
         else:
             logger.warning_once(
-                f"Unsloth: You are pushing to hub, but you passed your HF username = {username}.\n"\
+                f"Unsloth: You are pushing to hub, but you passed your HF username = {username}.\n"
                 f"We shall truncate {save_directory} to {new_save_directory}"
             )
 
         save_pretrained_settings["save_directory"] = new_save_directory
-        tokenizer_save_settings ["save_directory"] = new_save_directory
+        tokenizer_save_settings["save_directory"] = new_save_directory
         save_directory = new_save_directory
-    pass
 
     print("Unsloth: Merging 4bit and LoRA weights to 16bit...")
 
@@ -479,18 +557,25 @@ def unsloth_save_model(
     max_ram = psutil.virtual_memory().available
     sharded_ram_usage = 5 * 1024 * 1024 * 1024
     if type(max_shard_size) is str:
-        gb_found = re.match("([0-9]{1,})[\s]{0,}GB", max_shard_size, flags = re.IGNORECASE)
-        mb_found = re.match("([0-9]{1,})[\s]{0,}MB", max_shard_size, flags = re.IGNORECASE)
-        if   gb_found: sharded_ram_usage = int(gb_found.group(1)) * 1024 * 1024 * 1024
-        elif mb_found: sharded_ram_usage = int(mb_found.group(1)) * 1024 * 1024 
+        gb_found = re.match(
+            r"([0-9]{1,})[\s]{0,}GB", max_shard_size, flags = re.IGNORECASE
+        )
+        mb_found = re.match(
+            r"([0-9]{1,})[\s]{0,}MB", max_shard_size, flags = re.IGNORECASE
+        )
+        if gb_found:
+            sharded_ram_usage = int(gb_found.group(1)) * 1024 * 1024 * 1024
+        elif mb_found:
+            sharded_ram_usage = int(mb_found.group(1)) * 1024 * 1024
     elif type(max_shard_size) is int:
-        sharded_ram_usage = sharded_ram_usage
-    pass
+        sharded_ram_usage = max_shard_size
 
     # Switch to our fast saving modules if it's a slow PC!
     n_cpus = psutil.cpu_count(logical = False)
-    if n_cpus is None: n_cpus = psutil.cpu_count()
-    if n_cpus is None: n_cpus = 1
+    if n_cpus is None:
+        n_cpus = psutil.cpu_count()
+    if n_cpus is None:
+        n_cpus = 1
 
     if safe_serialization is None:
         safe_serialization = True
@@ -498,27 +583,27 @@ def unsloth_save_model(
 
     elif safe_serialization and (n_cpus <= 2):
         logger.warning_once(
-            f"Unsloth: You have {n_cpus} CPUs. Using `safe_serialization` is 10x slower.\n"\
-            f"We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.\n"\
+            f"Unsloth: You have {n_cpus} CPUs. Using `safe_serialization` is 10x slower.\n"
+            f"We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.\n"
             f"To force `safe_serialization`, set it to `None` instead.",
         )
         safe_serialization = False
         save_function = fast_save_pickle
         save_pretrained_settings["safe_serialization"] = safe_serialization
-        save_pretrained_settings["save_function"]      = save_function
-    pass
+        save_pretrained_settings["save_function"] = save_function
 
     # Only safe_serialization uses more RAM
     if safe_serialization:
         max_ram -= sharded_ram_usage
     else:
-        max_ram -= sharded_ram_usage*0.25 # Uses much less
-    pass
+        max_ram -= sharded_ram_usage * 0.25  # Uses much less
 
     max_ram = int(max(0, max_ram) * maximum_memory_usage)
-    print(f"Unsloth: Will use up to "\
-          f"{round(max_ram/1024/1024/1024, 2)} out of "\
-          f"{round(psutil.virtual_memory().total/1024/1024/1024, 2)} RAM for saving.")
+    print(
+        f"Unsloth: Will use up to "
+        f"{round(max_ram/1024/1024/1024, 2)} out of "
+        f"{round(psutil.virtual_memory().total/1024/1024/1024, 2)} RAM for saving."
+    )
 
     # Move temporary_location to /tmp in Kaggle
     if IS_KAGGLE_ENVIRONMENT:
@@ -527,36 +612,41 @@ def unsloth_save_model(
     # Max directory for disk saving
     if not os.path.exists(temporary_location):
         os.makedirs(temporary_location)
-    pass
 
     # Check if Kaggle or Colab, since only 20GB of Disk space allowed.
     if IS_KAGGLE_ENVIRONMENT or IS_COLAB_ENVIRONMENT:
         # We free up 4GB of space
         logger.warning_once(
-            "Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded\n"\
+            "Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded\n"
             "model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab."
         )
         _free_cached_model(internal_model)
-    pass
 
     # HF also uses a OrderedDict
     from collections import OrderedDict
+
     state_dict = OrderedDict()
 
-    torch_dtype = internal_model.config.torch_dtype
+    torch_dtype = dtype_from_config(internal_model.config)
     if type(torch_dtype) is str:
-        if   torch_dtype ==  "float16": torch_dtype = torch.float16
-        elif torch_dtype == "bfloat16": torch_dtype = torch.bfloat16
-    pass
+        if torch_dtype == "float16":
+            torch_dtype = torch.float16
+        elif torch_dtype == "bfloat16":
+            torch_dtype = torch.bfloat16
 
     # Check modules to save float32 dtype
-    state_dict["model.embed_tokens.weight"] = internal_model.model.embed_tokens.weight.data.to(torch_dtype)
+    state_dict["model.embed_tokens.weight"] = (
+        internal_model.model.embed_tokens.weight.data.to(torch_dtype)
+    )
 
-    max_vram = int(torch.cuda.get_device_properties(0).total_memory * maximum_memory_usage)
+    max_vram = int(
+        torch.cuda.get_device_properties(0).total_memory * maximum_memory_usage
+    )
 
     print("Unsloth: Saving model... This might take 5 minutes ...")
 
     from tqdm import tqdm as ProgressBar
+
     for j, layer in enumerate(ProgressBar(internal_model.model.layers)):
         for item in LLAMA_WEIGHTS:
             proj = eval(f"layer.{item}")
@@ -566,7 +656,6 @@ def unsloth_save_model(
             # Bias term
             if bias is not None:
                 state_dict[f"model.layers.{j}.{item}.bias"] = bias
-            pass
 
             if (torch.cuda.memory_allocated() + W.nbytes) < max_vram:
                 # Save to GPU memory
@@ -581,70 +670,104 @@ def unsloth_save_model(
                 # Save to Disk
                 logger.warning_once("\nWe will save to Disk and not RAM now.")
                 filename = os.path.join(temporary_location, f"{name}.pt")
-                torch.save(W, filename, pickle_module = pickle, pickle_protocol = pickle.HIGHEST_PROTOCOL,)
+                torch.save(
+                    W,
+                    filename,
+                    pickle_module = pickle,
+                    pickle_protocol = pickle.HIGHEST_PROTOCOL,
+                )
                 # weights_only = True weirdly fails?
-                state_dict[name] = torch.load(filename, map_location = "cpu", mmap = True, weights_only = False)
-        pass
+                state_dict[name] = torch.load(
+                    filename, map_location = "cpu", mmap = True, weights_only = False
+                )
         for item in LLAMA_LAYERNORMS:
             try:
                 # Skip for Gemma 2
-                state_dict[f"model.layers.{j}.{item}.weight"] = eval(f"layer.{item}.weight.data")
+                state_dict[f"model.layers.{j}.{item}.weight"] = eval(
+                    f"layer.{item}.weight.data"
+                )
             except:
                 continue
-        pass
-    pass
 
     state_dict["model.norm.weight"] = internal_model.model.norm.weight.data
     # Check for modules_to_save float32 dtype
 
     # Check for tied weights
-    if internal_model.model.embed_tokens.weight.data_ptr() != internal_model.lm_head.weight.data_ptr():
-        state_dict["lm_head.weight"] = internal_model.lm_head.weight.data.to(torch_dtype)
-    pass
+    if (
+        internal_model.model.embed_tokens.weight.data_ptr()
+        != internal_model.lm_head.weight.data_ptr()
+    ):
+        state_dict["lm_head.weight"] = internal_model.lm_head.weight.data.to(
+            torch_dtype
+        )
 
     # All tensors MUST be type torch.Tensor and not torch.nn.parameter.Parameter
     for key, value in state_dict.items():
-        if hasattr(value, "data"): state_dict[key] = value = value.data
+        if hasattr(value, "data"):
+            state_dict[key] = value = value.data
         if type(value) is not torch.Tensor:
             logger.warning_once(f"Unsloth: {key} is not a Tensor but a {type(value)}.")
-        pass
-    pass
 
     # Edit save_pretrained_settings
     # [TODO] _create_repo has errors due to **kwargs getting accepted
     save_pretrained_settings["state_dict"] = state_dict
-    
+
     # commit_description does not seem to work?
-    what_to_delete = ("use_temp_dir", "commit_message", "create_pr", "revision", "commit_description", "tags",) \
-        if not push_to_hub else \
-        ("use_temp_dir", "create_pr", "revision", "tags", "commit_description",)
+    what_to_delete = (
+        (
+            "use_temp_dir",
+            "commit_message",
+            "create_pr",
+            "revision",
+            "commit_description",
+            "tags",
+        )
+        if not push_to_hub
+        else (
+            "use_temp_dir",
+            "create_pr",
+            "revision",
+            "tags",
+            "commit_description",
+        )
+    )
     for deletion in what_to_delete:
         del save_pretrained_settings[deletion]
-    pass
     if hasattr(model, "add_model_tags"):
-        model.add_model_tags(["unsloth",])
+        model.add_model_tags(
+            [
+                "unsloth",
+            ]
+        )
 
     # Update model tag
     if push_to_hub:
         _ = upload_to_huggingface(
-            model, save_pretrained_settings["save_directory"], token,
-            "finetuned", "trl", file_location = None,
-            old_username = username, private = private,
+            model,
+            save_pretrained_settings["save_directory"],
+            token,
+            "finetuned",
+            "trl",
+            file_location = None,
+            old_username = username,
+            private = private,
+            datasets = datasets,
         )
-    pass
 
     # First check if we're pushing to an organization!
     save_directory = save_pretrained_settings["save_directory"]
 
     if save_pretrained_settings["push_to_hub"]:
-        new_save_directory, new_username = _determine_username(save_directory, username, token)
+        new_save_directory, new_username = _determine_username(
+            save_directory, username, token
+        )
 
         if token is not None:
             from huggingface_hub import whoami
+
             actual_username = whoami(token = token)["name"]
         else:
             actual_username = username
-    pass
 
     # Check if pushing to an organization
     if save_pretrained_settings["push_to_hub"] and (username != actual_username):
@@ -652,7 +775,6 @@ def unsloth_save_model(
         # We upload everything at the end!
         tokenizer_save_settings["push_to_hub"] = False
         tokenizer_save_settings["save_directory"] = new_save_directory
-    pass
 
     # Save tokenizer
     if tokenizer is not None:
@@ -666,11 +788,10 @@ def unsloth_save_model(
 
         # Revert back padding side
         tokenizer.padding_side = old_padding_side
-            
+
         print(" Done.")
     else:
         print()
-    pass
 
     # Since merged, edit quantization_config
     old_config = model.config
@@ -709,12 +830,11 @@ def unsloth_save_model(
             path_in_repo = ".",
             repo_id = new_save_directory,
             repo_type = "model",
-            commit_message  = "(Trained with Unsloth)",
+            commit_message = "(Trained with Unsloth)",
             ignore_patterns = "*.md",
         )
     else:
         internal_model.save_pretrained(**save_pretrained_settings)
-    pass
 
     # Revert config back
     original_model = model
@@ -725,8 +845,9 @@ def unsloth_save_model(
     print("Done.")
 
     if push_to_hub and hasattr(model, "config"):
-        print(f"Saved merged model to https://huggingface.co/{username}/{save_directory.lstrip('/').split('/')[-1]}")
-    pass
+        print(
+            f"Saved merged model to https://huggingface.co/{username}/{save_directory.lstrip('/').split('/')[-1]}"
+        )
 
     save_pretrained_settings["state_dict"] = None
 
@@ -735,8 +856,6 @@ def unsloth_save_model(
         if j % 10 == 0:
             torch.cuda.empty_cache()
             gc.collect()
-        pass
-    pass
     state_dict = None
     del state_dict
     torch.cuda.empty_cache()
@@ -744,20 +863,26 @@ def unsloth_save_model(
 
     # Remove temporary location
     import shutil
+
     shutil.rmtree(temporary_location, ignore_errors = True)
 
     for _ in range(3):
         torch.cuda.empty_cache()
         gc.collect()
     return save_directory, username
-pass
 
 
 def install_llama_cpp_clone_non_blocking():
-    full_command = ["git", "clone", "--recursive", "https://github.com/ggerganov/llama.cpp"]
-    run_installer = subprocess.Popen(full_command, stdout = subprocess.DEVNULL, stderr = subprocess.STDOUT)
+    full_command = [
+        "git",
+        "clone",
+        "--recursive",
+        "https://github.com/ggerganov/llama.cpp",
+    ]
+    run_installer = subprocess.Popen(
+        full_command, stdout = subprocess.DEVNULL, stderr = subprocess.STDOUT
+    )
     return run_installer
-pass
 
 
 def install_llama_cpp_make_non_blocking():
@@ -769,71 +894,90 @@ def install_llama_cpp_make_non_blocking():
     IS_CMAKE = False
     if check == 0:
         # Uses old MAKE
-        n_jobs = max(int(psutil.cpu_count()*1.5), 1)
-        full_command = ["make", "all", "-j"+str(n_jobs), "-C", "llama.cpp"]
+        n_jobs = max(int((psutil.cpu_count() or 1) * 1.5), 1)
+        full_command = ["make", "all", "-j" + str(n_jobs), "-C", "llama.cpp"]
         IS_CMAKE = False
     else:
         # Uses new CMAKE
-        n_jobs = max(int(psutil.cpu_count()), 1) # Use less CPUs since 1.5x faster
-        check = os.system("cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF -DLLAMA_CURL=ON")
+        n_jobs = max(int(psutil.cpu_count() or 1), 1)  # Use less CPUs since 1.5x faster
+        check = os.system(
+            f"cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF {CURL_FLAG}"
+        )
+
         if check != 0:
-            raise RuntimeError(f"*** Unsloth: Failed compiling llama.cpp using os.system(...) with error {check}. Please report this ASAP!")
-        pass
+            raise RuntimeError(
+                f"*** Unsloth: Failed compiling llama.cpp using os.system(...) with error {check}. Please report this ASAP!"
+            )
         # f"cmake --build llama.cpp/build --config Release -j{psutil.cpu_count()*2} --clean-first --target {' '.join(LLAMA_CPP_TARGETS)}",
         full_command = [
-            "cmake", "--build", "llama.cpp/build",
-            "--config", "Release",
-            "-j"+str(n_jobs),
+            "cmake",
+            "--build",
+            "llama.cpp/build",
+            "--config",
+            "Release",
+            "-j" + str(n_jobs),
             "--clean-first",
             "--target",
         ] + LLAMA_CPP_TARGETS
         IS_CMAKE = True
-    pass
     # https://github.com/ggerganov/llama.cpp/issues/7062
     # Weirdly GPU conversion for GGUF breaks??
     # run_installer = subprocess.Popen(full_command, env = env, stdout = subprocess.DEVNULL, stderr = subprocess.STDOUT)
-    run_installer = subprocess.Popen(full_command, stdout = subprocess.DEVNULL, stderr = subprocess.STDOUT)
+    run_installer = subprocess.Popen(
+        full_command, stdout = subprocess.DEVNULL, stderr = subprocess.STDOUT
+    )
     return run_installer, IS_CMAKE
-pass
 
 
 def install_python_non_blocking(packages = []):
     full_command = ["pip", "install"] + packages
-    run_installer = subprocess.Popen(full_command, stdout = subprocess.DEVNULL, stderr = subprocess.STDOUT)
+    run_installer = subprocess.Popen(
+        full_command, stdout = subprocess.DEVNULL, stderr = subprocess.STDOUT
+    )
     return run_installer
-pass
 
 
 def try_execute(commands, force_complete = False):
     for command in commands:
-        with subprocess.Popen(command, shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT, bufsize = 1) as sp:
+        with subprocess.Popen(
+            command,
+            shell = True,
+            stdout = subprocess.PIPE,
+            stderr = subprocess.STDOUT,
+            bufsize = 1,
+        ) as sp:
             for line in sp.stdout:
                 line = line.decode("utf-8", errors = "replace")
                 if "undefined reference" in line:
-                    raise RuntimeError(f"*** Unsloth: Failed compiling llama.cpp with {line}. Please report this ASAP!")
+                    raise RuntimeError(
+                        f"*** Unsloth: Failed compiling llama.cpp with {line}. Please report this ASAP!"
+                    )
                 elif "deprecated" in line:
                     return "CMAKE"
                 elif "Unknown argument" in line:
-                    raise RuntimeError(f"*** Unsloth: Failed compiling llama.cpp with {line}. Please report this ASAP!")
+                    raise RuntimeError(
+                        f"*** Unsloth: Failed compiling llama.cpp with {line}. Please report this ASAP!"
+                    )
                 elif "***" in line:
-                    raise RuntimeError(f"*** Unsloth: Failed compiling llama.cpp with {line}. Please report this ASAP!")
+                    raise RuntimeError(
+                        f"*** Unsloth: Failed compiling llama.cpp with {line}. Please report this ASAP!"
+                    )
                 print(line, flush = True, end = "")
-            pass
             if force_complete and sp.returncode is not None and sp.returncode != 0:
                 raise subprocess.CalledProcessError(sp.returncode, sp.args)
-        pass
-    pass
     return None
-pass
 
 
 def install_llama_cpp_old(version = -10):
     # Download the 10th latest release since the latest might be broken!
     # FALLBACK mechanism
-    releases = subprocess.check_output(["git", "ls-remote", "--tags", "https://github.com/ggerganov/llama.cpp.git"])
+    releases = subprocess.check_output(
+        ["git", "ls-remote", "--tags", "https://github.com/ggerganov/llama.cpp.git"]
+    )
     releases = releases.decode("utf-8").replace("\t", " ").split("\n")
     for i, x in enumerate(releases):
-        if "refs/tags/b" not in x: break
+        if "refs/tags/b" not in x:
+            break
     releases = releases[:i]
     latest = releases[-1]
     version = releases[version].split(" ")[0]
@@ -841,17 +985,18 @@ def install_llama_cpp_old(version = -10):
     # Check if the llama.cpp exists
     if os.path.exists("llama.cpp"):
         print(
-            "**[WARNING]** You have a llama.cpp directory which is broken.\n"\
-            "Unsloth will DELETE the broken directory and install a new one.\n"\
+            "**[WARNING]** You have a llama.cpp directory which is broken.\n"
+            "Unsloth will DELETE the broken directory and install a new one.\n"
             "Press CTRL + C / cancel this if this is wrong. We shall wait 30 seconds.\n"
         )
         import time
+
         for i in range(30):
             print(f"**[WARNING]** Deleting llama.cpp directory... {30-i} seconds left.")
             time.sleep(1)
         import shutil
+
         shutil.rmtree("llama.cpp", ignore_errors = True)
-    pass
 
     # Clone a specific commit
     # Also don't use the GPU!
@@ -864,27 +1009,33 @@ def install_llama_cpp_old(version = -10):
     # Try using MAKE
     commands = [
         "make clean -C llama.cpp",
-        f"make all -j{psutil.cpu_count()*2} -C llama.cpp",
+        f"make all -j{(psutil.cpu_count() or 1)*2} -C llama.cpp",
     ]
     if try_execute(commands) == "CMAKE":
         # Instead use CMAKE
         commands = [
-            "cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF -DLLAMA_CURL=ON",
-            f"cmake --build llama.cpp/build --config Release -j{psutil.cpu_count()*2} --clean-first --target {' '.join(LLAMA_CPP_TARGETS)}",
+            f"cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF {CURL_FLAG}",
+            f"cmake --build llama.cpp/build --config Release -j{(psutil.cpu_count() or 1)*2} --clean-first --target {' '.join(LLAMA_CPP_TARGETS)}",
             "cp llama.cpp/build/bin/llama-* llama.cpp",
             "rm -rf llama.cpp/build",
         ]
+
         try_execute(commands)
-    pass
 
     # Check if successful
-    if not os.path.exists("llama.cpp/quantize") and not os.path.exists("llama.cpp/llama-quantize"):
+    if not (
+        os.path.exists("llama.cpp/llama-quantize.exe")
+        or os.path.exists("llama.cpp/llama-quantize")
+        or os.path.exists("llama.cpp/quantize.exe")
+        or os.path.exists("llama.cpp/quantize")
+        or os.path.exists("llama.cpp/build/bin/llama-quantize")
+        or os.path.exists("llama.cpp/build/bin/quantize")
+    ):
         raise RuntimeError(
-            "Unsloth: The file 'llama.cpp/llama-quantize' or `llama.cpp/quantize` does not exist.\n"\
-            "But we expect this file to exist! Maybe the llama.cpp developers changed the name?"
+            "Unsloth: The file 'llama.cpp/llama-quantize' or `llama.cpp/quantize` does not exist.\n"
+            "We've also double checked the building directory under 'llama.cpp/build/bin/'.\n"
+            "But we expect this file to exist! Check if the file exists under llama.cpp and investigate the building process of llama.cpp (make/cmake)!"
         )
-    pass
-pass
 
 
 def install_llama_cpp_blocking(use_cuda = False):
@@ -896,27 +1047,26 @@ def install_llama_cpp_blocking(use_cuda = False):
         "git clone --recursive https://github.com/ggerganov/llama.cpp",
         "pip install gguf protobuf",
     ]
-    if os.path.exists("llama.cpp"): return
+    if os.path.exists("llama.cpp"):
+        return
     try_execute(commands)
 
     commands = [
         "make clean -C llama.cpp",
         # https://github.com/ggerganov/llama.cpp/issues/7062
         # Weirdly GPU conversion for GGUF breaks??
-        # f"{use_cuda} make all -j{psutil.cpu_count()*2} -C llama.cpp",
-        f"make all -j{psutil.cpu_count()*2} -C llama.cpp",
+        # f"{use_cuda} make all -j{(psutil.cpu_count() or 1)*2} -C llama.cpp",
+        f"make all -j{(psutil.cpu_count() or 1)*2} -C llama.cpp",
     ]
     if try_execute(commands) == "CMAKE":
         # Instead use CMAKE
         commands = [
-            "cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF -DLLAMA_CURL=ON",
-            f"cmake --build llama.cpp/build --config Release -j{psutil.cpu_count()*2} --clean-first --target {' '.join(LLAMA_CPP_TARGETS)}",
+            f"cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF {CURL_FLAG}",
+            f"cmake --build llama.cpp/build --config Release -j{(psutil.cpu_count() or 1)*2} --clean-first --target {' '.join(LLAMA_CPP_TARGETS)}",
             "cp llama.cpp/build/bin/llama-* llama.cpp",
             "rm -rf llama.cpp/build",
         ]
         try_execute(commands)
-    pass
-pass
 
 
 def get_executable(executables):
@@ -927,73 +1077,80 @@ def get_executable(executables):
         for executable in executables:
             path = os.path.join(directory, executable)
             # Check if the executable exists and is executable
-            if os.path.exists(path) and os.access(path, os.X_OK): return path
-        pass
-    pass
+            if os.path.exists(path) and os.access(path, os.X_OK):
+                return path
     return None
-pass
 
 
 def save_to_gguf(
-    model_type           : str,
-    model_dtype          : str,
-    is_sentencepiece     : bool = False,
-    model_directory      : str = "unsloth_finetuned_model",
-    quantization_method  = "fast_quantized", # Can be a list of options! ["q4_k_m", "q8_0", "q5_k_m"]
-    first_conversion     : str = None,
-    _run_installer = None, # Non blocking install of llama.cpp
+    model_name: str,
+    model_type: str,
+    model_dtype: str,
+    is_sentencepiece: bool = False,
+    model_directory: str = "unsloth_finetuned_model",
+    quantization_method = "fast_quantized",  # Can be a list of options! ["q4_k_m", "q8_0", "q5_k_m"]
+    first_conversion: str = None,
+    is_vlm: bool = False,
+    is_gpt_oss: bool = False,
 ):
-    # logger.warning(
-    #     "NOTICE: llama.cpp GGUF conversion is currently unstable, since llama.cpp is\n"\
-    #     "undergoing some major bug fixes as at 5th of May 2024. This is not an Unsloth issue.\n"\
-    #     "Please be patient - GGUF saving should still work, but might not work as well."
-    # )
-    assert(model_dtype == "float16" or model_dtype == "bfloat16")
+    """
+    Orchestrates the complete GGUF conversion process.
+    Handles installation, conversion, and quantization.
+    """
+    # print_output True only if UNSLOTH_ENABLE_LOGGING=1
+    if os.environ.get("UNSLOTH_ENABLE_LOGGING", "0") == "1":
+        print_output = True
+    else:
+        print_output = False
+
+    # Validate model dtype
+    assert model_dtype == "float16" or model_dtype == "bfloat16"
     model_dtype = "f16" if model_dtype == "float16" else "bf16"
 
     # Convert quantization_method to list
-    if   isinstance(quantization_method, list):  pass
-    elif isinstance(quantization_method, str):   quantization_method = [ quantization_method, ]
-    elif isinstance(quantization_method, tuple): quantization_method = list(quantization_method)
+    if isinstance(quantization_method, list):
+        pass
+    elif isinstance(quantization_method, str):
+        quantization_method = [
+            quantization_method,
+        ]
+    elif isinstance(quantization_method, tuple):
+        quantization_method = list(quantization_method)
     else:
-        raise TypeError("Unsloth: quantization_method can only be a string or a list of strings")
-    pass
-    
+        raise TypeError(
+            "Unsloth: quantization_method can only be a string or a list of strings"
+        )
+
     # Check if bfloat16 is supported
     if model_dtype == "bf16" and not torch.cuda.is_bf16_supported():
         logger.warning(
-            "Unsloth: Cannot convert to bf16 GGUF since your computer doesn't support it.\n"\
+            "Unsloth: Cannot convert to bf16 GGUF since your computer doesn't support it.\n"
             "We shall switch instead to f16."
         )
         model_dtype = "f16"
-    pass
 
     # Check first_conversion as well
     if first_conversion is None:
         first_conversion = model_dtype
-    pass
 
     # Check I quants
-    for quant_method in quantization_method: 
+    for quant_method in quantization_method:
         if quant_method.startswith("iq2"):
-            raise RuntimeError("Unsloth: Currently iq2 type quantizations aren't supported yet - sorry!")
-    pass
-
-    # Careful convert.py is only for Llama / Mistral based archs
-    use_fast_convert = False
-    if not is_sentencepiece:      use_fast_convert = False # Llama-3
-    elif model_type == "llama":   use_fast_convert = True
-    elif model_type == "mistral": use_fast_convert = True
-    pass
-    logger.warning_once(f"Unsloth: Converting {model_type} model. Can use fast conversion = {use_fast_convert}.")
+            raise RuntimeError(
+                "Unsloth: Currently iq2 type quantizations aren't supported yet - sorry!"
+            )
 
     # Map quant methods
-    new_quantization_method = []
+    new_quantization_methods = []
     for quant_method in quantization_method:
-        if   quant_method == "not_quantized":  quant_method = model_dtype
-        elif quant_method == "fast_quantized": quant_method = "q8_0"
-        elif quant_method == "quantized":      quant_method = "q4_k_m"
-        elif quant_method is None:             quant_method = "q8_0"
+        if quant_method == "not_quantized":
+            quant_method = model_dtype
+        elif quant_method == "fast_quantized":
+            quant_method = "q8_0"
+        elif quant_method == "quantized":
+            quant_method = "q4_k_m"
+        elif quant_method is None:
+            quant_method = "q8_0"
 
         # Check if wrong method
         if quant_method not in ALLOWED_QUANTS.keys():
@@ -1001,234 +1158,249 @@ def save_to_gguf(
             for key, value in ALLOWED_QUANTS.items():
                 error += f"[{key}] => {value}\n"
             raise RuntimeError(error)
-        pass
 
-        new_quantization_method.append(quant_method)
-    pass
-    quantization_method = new_quantization_method
+        new_quantization_methods.append(quant_method)
+    quantization_method = new_quantization_methods
 
-    print_info = \
-        f"==((====))==  Unsloth: Conversion from QLoRA to GGUF information\n"\
-        f"   \\\   /|    [0] Installing llama.cpp might take 3 minutes.\n"\
-        f"O^O/ \_/ \\    [1] Converting HF to GGUF 16bits might take 3 minutes.\n"\
-        f"\        /    [2] Converting GGUF 16bits to {quantization_method} might take 10 minutes each.\n"\
+    # Determine optimal first_conversion
+    if is_gpt_oss:
+        print("Unsloth: GPT-OSS model detected - using special conversion settings")
+        first_conversion = "None"  # No quantization for GPT-OSS
+        # Only keep one conversion method since GPT-OSS doesn't quantize
+        quantization_method = ["None"]
+    else:
+        if first_conversion is None:
+            # Check if q8_0 is the ONLY quantization method requested
+            if len(quantization_method) == 1 and quantization_method[0] == "q8_0":
+                first_conversion = "None"  # Let llama-quantize do the direct conversion
+            else:
+                # For all other cases, choose the highest precision format
+                # that can be requantized to all requested formats
+                strength = 0
+                for quant_method in quantization_method:
+                    if quant_method == "f32":
+                        strength = max(strength, 3)
+                    elif quant_method == "f16":
+                        strength = max(strength, 2)
+                    elif quant_method == "bf16":
+                        strength = max(strength, 1)
+                    # Note: we don't set strength for q8_0 here since we handle it above
+
+                if strength >= 3:
+                    first_conversion = "f32"
+                elif strength >= 2:
+                    first_conversion = "f16"
+                elif strength >= 1:
+                    first_conversion = "bf16"
+                else:
+                    first_conversion = "bf16"  # requantizing from q8_0 disallowed in new llama.cpp default to bf16.
+
+    # Check bfloat16 support again for first_conversion
+    if first_conversion == "bf16" and not torch.cuda.is_bf16_supported():
+        logger.warning("Unsloth: Switching bf16 to f16 due to hardware limitations")
+        first_conversion = "f16"
+
+    first_conversion_dtype = "" if first_conversion == "None" else first_conversion
+    # Print conversion info
+    print_info = (
+        f"==((====))==  Unsloth: Conversion from HF to GGUF information\n"
+        f"   {chr(92)}{chr(92)}   /|    [0] Installing llama.cpp might take 3 minutes.\n"
+        f"O^O/ {chr(92)}_/ {chr(92)}    [1] Converting HF to GGUF {first_conversion_dtype} might take 3 minutes.\n"
+        f"{chr(92)}        /    [2] Converting GGUF {first_conversion_dtype} to {quantization_method} might take 10 minutes each.\n"
         f' "-____-"     In total, you will have to wait at least 16 minutes.\n'
+    )
     print(print_info)
 
-    # Check first_conversion format
-    if   first_conversion == "f16"  : pass
-    elif first_conversion == "bf16" : pass
-    elif first_conversion == "f32"  : pass
-    elif first_conversion == "q8_0" : pass
-    else:
-        raise RuntimeError(
-            f"Unsloth: `first_conversion` can only be one of ['f16', 'bf16', 'f32', 'q8_0'] and not `{first_conversion}`."
-        )
-    pass
-
-    # Determine maximum first_conversion state
-    if   first_conversion == "f32"  : strength = 3
-    elif first_conversion == "f16"  : strength = 2
-    elif first_conversion == "bf16" : strength = 1
-    elif first_conversion == "q8_0" : strength = 0
-
-    for quant_method in quantization_method:
-        if   quant_method == "f32":  strength = max(strength, 3)
-        elif quant_method == "f16":  strength = max(strength, 2)
-        elif quant_method == "bf16": strength = max(strength, 1)
-        elif quant_method == "q8_0": strength = max(strength, 0)
-        else:
-            # Quantized models must have f16 as the default argument
-            if   first_conversion == "f32"  : pass
-            elif first_conversion == "f16"  : pass
-            elif first_conversion == "bf16" : pass
-            elif first_conversion == "q8_0":
-                logger.warning_once(
-                    "Unsloth: Using q8_0 for the `first_conversion` will lose a bit of accuracy, "\
-                    "but saves disk space!"
-                )
-                # first_conversion = "f16"
-            pass
-        pass
-    pass
-
-    # If only q8_0:
-    if len(quantization_method) == 1 and quantization_method[0] == "q8_0":
-        strength = 0
-    pass
-
-    if   strength >= 3: first_conversion = "f32"
-    elif strength >= 2: first_conversion = "f16"
-    elif strength >= 1: first_conversion = "bf16"
-    else: first_conversion = "q8_0"
-
-    # Non llama/mistral needs can only use f32 or f16
-    if not use_fast_convert and \
-        (first_conversion != "f16" or first_conversion != "bf16" or first_conversion != "f32"):
-
-        pass
-        # Latest llama.cpp works for all models for q8_0!
-
-        # logger.warning_once("Unsloth: We must use f16 for non Llama and Mistral models.")
-        # first_conversion = "f16"
-    pass
-
-    # Check if bfloat16 is supported
-    if first_conversion == "bf16" and not torch.cuda.is_bf16_supported():
-        logger.warning(
-            "Unsloth: Cannot convert to bf16 GGUF since your computer doesn't support it.\n"\
-            "We shall switch instead to f16."
-        )
-        first_conversion = "f16"
-    pass
-
-    n_cpus = psutil.cpu_count()
-    if n_cpus is None: n_cpus = 1
-    n_cpus *= 2
-    # Concurrency from https://rentry.org/llama-cpp-conversions#merging-loras-into-a-model
-
-    final_location = str((Path(model_directory) / f"unsloth.{first_conversion.upper()}.gguf").absolute())
-    
-    print(f"Unsloth: [1] Converting model at {model_directory} into {first_conversion} GGUF format.\n"\
-          f"The output location will be {final_location}\n"\
-          "This might take 3 minutes...")
-
-    # We first check if tokenizer.model exists in the model_directory
-    if os.path.exists(f"{model_directory}/tokenizer.model"):
-        vocab_type = "spm,hfft,bpe"
-        # Fix Sentencepiece model as well!
-        fix_sentencepiece_gguf(model_directory)
-    else:
-        vocab_type = "bpe"
-    pass
-
-    # convert.py is deprecated!
-    use_fast_convert = False
-    if use_fast_convert:
-        command = f"python llama.cpp/convert.py {model_directory} "\
-            f"--outfile {final_location} --vocab-type {vocab_type} "\
-            f"--outtype {first_conversion} --concurrency {n_cpus} --pad-vocab"
-    else:
-        command = f"python {convert_location} {model_directory} "\
-            f"--outfile {final_location} "\
-            f"--outtype {first_conversion}"
-    pass
-
-    try_execute([command,], force_complete = True)
-
-    # Check if quantization succeeded!
-    if not os.path.isfile(final_location):
+    # Step 1: Ensure llama.cpp is installed
+    try:
+        quantizer_location, converter_location = check_llama_cpp()
+        print("Unsloth: llama.cpp found in the system. Skipping installation.")
+    except:
+        print("Unsloth: Installing llama.cpp. This might take 3 minutes...")
         if IS_KAGGLE_ENVIRONMENT:
-            if not Path(final_location).resolve().is_relative_to(Path('/tmp').resolve()):
-                raise RuntimeError(
-                    f"Unsloth: Quantization failed for {final_location}\n"\
-                    "You are in a Kaggle environment, which might be the reason this is failing.\n"\
-                    "Kaggle only provides 20GB of disk space in the working directory.\n"\
-                    "Merging to 16bit for 7b models use 16GB of space.\n"\
-                    "This means using `model.{save_pretrained/push_to_hub}_merged` works, but\n"\
-                    "`model.{save_pretrained/push_to_hub}_gguf will use too much disk space.\n"\
-                    "You can try saving it to the `/tmp` directory for larger disk space.\n"\
-                    "I suggest you to save the 16bit model first, then use manual llama.cpp conversion."
-                )
-        else:
-            raise RuntimeError(
-                f"Unsloth: Quantization failed for {final_location}\n"\
-                "You might have to compile llama.cpp yourself, then run this again.\n"\
-                "You do not need to close this Python program. Run the following commands in a new terminal:\n"\
-                "You must run this in the same folder as you're saving your model.\n"\
-                "git clone --recursive https://github.com/ggerganov/llama.cpp\n"\
-                "cd llama.cpp && make clean && make all -j\n"\
-                "Once that's done, redo the quantization."
+            # Kaggle: no CUDA support due to environment limitations
+            quantizer_location, converter_location = install_llama_cpp(
+                gpu_support = False, print_output = print_output
+            )
+        else:
+            quantizer_location, converter_location = install_llama_cpp(
+                gpu_support = False,  # GGUF conversion doesn't need CUDA
+                print_output = print_output,
             )
-        pass
-    pass
-    print(f"Unsloth: Conversion completed! Output location: {final_location}")
 
-    full_precision_location = final_location
+    # Step 2: Download and patch converter script
+    print("Unsloth: Preparing converter script...")
+    with use_local_gguf():
+        converter_path, supported_text_archs, supported_vision_archs = (
+            _download_convert_hf_to_gguf()
+        )
 
-    all_saved_locations = [full_precision_location,]
-    # Convert each type!
-    for quant_method in quantization_method:
-        if quant_method != first_conversion:
-            print(f"Unsloth: [2] Converting GGUF 16bit into {quant_method}. This might take 20 minutes...")
-            final_location = str((Path(model_directory) / f"unsloth.{quant_method.upper()}.gguf").absolute())
+        # Step 3: Initial GGUF conversion
+        print(
+            f"Unsloth: [1] Converting model into {first_conversion_dtype} GGUF format."
+        )
+        print(f"This might take 3 minutes...")
 
-            command = f"./{quantize_location} {full_precision_location} "\
-                f"{final_location} {quant_method} {n_cpus}"
-            
-            try_execute([command,], force_complete = True)
+        initial_files, is_vlm_update = convert_to_gguf(
+            model_name = model_name,
+            input_folder = model_directory,
+            model_dtype = model_dtype,
+            quantization_type = first_conversion,
+            converter_location = converter_path,
+            supported_text_archs = supported_text_archs,
+            supported_vision_archs = supported_vision_archs,
+            is_vlm = is_vlm,
+            is_gpt_oss = is_gpt_oss,
+            max_shard_size = "50GB",
+            print_output = print_output,
+        )
+    # update is_vlm switch
+    is_vlm = is_vlm_update
+    # Check conversion success
+    for file in initial_files:
+        if not os.path.exists(file):
+            if IS_KAGGLE_ENVIRONMENT:
+                raise RuntimeError(
+                    f"Unsloth: Conversion failed for {file}\n"
+                    "You are in a Kaggle environment with limited disk space (20GB).\n"
+                    "Try saving to /tmp for more space or use a smaller model.\n"
+                    "Alternatively, save the 16bit model first, then convert manually."
+                )
+            else:
+                raise RuntimeError(
+                    f"Unsloth: Conversion failed for {file}\n"
+                    "Please check disk space and try again."
+                )
 
-            # Check if quantization succeeded!
-            if not os.path.isfile(final_location):
-                if IS_KAGGLE_ENVIRONMENT:
-                    if not Path(final_location).resolve().is_relative_to(Path('/tmp').resolve()):
-                        raise RuntimeError(
-                            f"Unsloth: Quantization failed for {final_location}\n"\
-                            "You are in a Kaggle environment, which might be the reason this is failing.\n"\
-                            "Kaggle only provides 20GB of disk space in the working directory.\n"\
-                            "Merging to 16bit for 7b models use 16GB of space.\n"\
-                            "This means using `model.{save_pretrained/push_to_hub}_merged` works, but\n"\
-                            "`model.{save_pretrained/push_to_hub}_gguf will use too much disk space.\n"\
-                            "You can try saving it to the `/tmp` directory for larger disk space.\n"\
-                            "I suggest you to save the 16bit model first, then use manual llama.cpp conversion."
-                        )
-                else:
-                    raise RuntimeError(
-                        "Unsloth: Quantization failed! You might have to compile llama.cpp yourself, then run this again.\n"\
-                        "You do not need to close this Python program. Run the following commands in a new terminal:\n"\
-                        "You must run this in the same folder as you're saving your model.\n"\
-                        "git clone --recursive https://github.com/ggerganov/llama.cpp\n"\
-                        "cd llama.cpp && make clean && make all -j\n"\
-                        "Once that's done, redo the quantization."
+    # Move initial GGUF files into a dedicated _gguf directory
+    gguf_directory = f"{model_directory}_gguf"
+    os.makedirs(gguf_directory, exist_ok = True)
+    moved_files = []
+    for fpath in initial_files:
+        dst = os.path.join(gguf_directory, os.path.basename(fpath))
+        shutil.move(fpath, dst)
+        moved_files.append(dst)
+    initial_files = moved_files
+
+    print(f"Unsloth: Initial conversion completed! Files: {initial_files}")
+
+    # Step 4: Additional quantizations using llama-quantize
+    all_saved_locations = initial_files.copy()
+
+    # Get CPU count for quantization
+    n_cpus = psutil.cpu_count()
+    if n_cpus is None:
+        n_cpus = 1
+    n_cpus *= 2
+
+    if not is_gpt_oss:
+        base_gguf = initial_files[0]
+        quants_created = False
+        for quant_method in quantization_method:
+            if quant_method != first_conversion:
+                print(
+                    f"Unsloth: [2] Converting GGUF {first_conversion_dtype} into {quant_method}. This might take 10 minutes..."
+                )
+                output_location = os.path.join(
+                    gguf_directory, f"{model_name}.{quant_method.upper()}.gguf"
+                )
+                try:
+                    # Use the quantize_gguf function we created
+                    quantized_file = quantize_gguf(
+                        input_gguf = base_gguf,
+                        output_gguf = output_location,
+                        quant_type = quant_method,
+                        quantizer_location = quantizer_location,
+                        print_output = print_output,
                     )
-                pass
-            pass
+                    all_saved_locations.append(quantized_file)
+                    quants_created = True
+                except Exception as e:
+                    if IS_KAGGLE_ENVIRONMENT:
+                        raise RuntimeError(
+                            f"Unsloth: Quantization failed for {output_location}\n"
+                            "You are in a Kaggle environment, which might be the reason this is failing.\n"
+                            "Kaggle only provides 20GB of disk space in the working directory.\n"
+                            "Merging to 16bit for 7b models use 16GB of space.\n"
+                            "This means using `model.{save_pretrained/push_to_hub}_merged` works, but\n"
+                            "`model.{save_pretrained/push_to_hub}_gguf will use too much disk space.\n"
+                            "You can try saving it to the `/tmp` directory for larger disk space.\n"
+                            "I suggest you to save the 16bit model first, then use manual llama.cpp conversion.\n"
+                            f"Error: {e}"
+                        )
+                    else:
+                        if IS_WINDOWS:
+                            build_instructions = (
+                                f'cd "{LLAMA_CPP_DEFAULT_DIR}"\n'
+                                f"cmake -S . -B build -DBUILD_SHARED_LIBS=OFF\n"
+                                f"cmake --build build --config Release"
+                            )
+                        else:
+                            build_instructions = f'cd "{LLAMA_CPP_DEFAULT_DIR}" && make clean && make all -j'
 
-            print(f"Unsloth: Conversion completed! Output location: {final_location}")
-            all_saved_locations.append(final_location)
-        pass
-    pass
+                        raise RuntimeError(
+                            f"Unsloth: Quantization failed for {output_location}\n"
+                            "You might have to compile llama.cpp yourself, then run this again.\n"
+                            "You do not need to close this Python program. Run the following commands in a new terminal:\n"
+                            f'git clone --recursive https://github.com/ggerganov/llama.cpp "{LLAMA_CPP_DEFAULT_DIR}"\n'
+                            f"{build_instructions}\n"
+                            "Once that's done, redo the quantization.\n"
+                            f"Error: {e}"
+                        )
+        print("Unsloth: Model files cleanup...")
+        if quants_created:
+            all_saved_locations.remove(base_gguf)
+            Path(base_gguf).unlink(missing_ok = True)
 
-    # Finally check if first_conversion (f16, bf16 etc) was in the list of actual quant methods
-    full_precision_seen = first_conversion in frozenset(quantization_method)
+            # flip the list to get [text_model, mmproj] order. for text models stays the same.
+            all_saved_locations.reverse()
+    else:
+        print("Unsloth: GPT-OSS model - skipping additional quantizations")
 
-    return all_saved_locations, full_precision_seen
-pass
+    if is_gpt_oss:
+        want_full_precision = True
+    else:
+        want_full_precision = first_conversion in frozenset(quantization_method)
+
+    print(f"Unsloth: All GGUF conversions completed successfully!")
+    print(f"Generated files: {all_saved_locations}")
+
+    return all_saved_locations, want_full_precision, is_vlm
 
 
 def unsloth_save_pretrained_merged(
     self,
-    save_directory       : Union[str, os.PathLike],
-    tokenizer            = None,
-    save_method          : str = "merged_16bit", # ["lora", "merged_16bit", "merged_4bit"]
-    push_to_hub          : bool = False,
-    token                : Optional[Union[str, bool]] = None,
-    is_main_process      : bool = True,
-    state_dict           : Optional[dict] = None,
-    save_function        : Callable = torch.save,
-    max_shard_size       : Union[int, str] = "5GB",
-    safe_serialization   : bool = True,
-    variant              : Optional[str] = None,
-    save_peft_format     : bool = True,
-    tags                 : List[str] = None,
-    temporary_location   : str = "_unsloth_temporary_saved_buffers",
-    maximum_memory_usage : float = 0.75,
+    save_directory: Union[str, os.PathLike],
+    tokenizer = None,
+    save_method: str = "merged_16bit",  # ["lora", "merged_16bit", "merged_4bit"]
+    push_to_hub: bool = False,
+    token: Optional[Union[str, bool]] = None,
+    is_main_process: bool = True,
+    state_dict: Optional[dict] = None,
+    save_function: Callable = torch.save,
+    max_shard_size: Union[int, str] = "5GB",
+    safe_serialization: bool = True,
+    variant: Optional[str] = None,
+    save_peft_format: bool = True,
+    tags: List[str] = None,
+    temporary_location: str = "_unsloth_temporary_saved_buffers",
+    maximum_memory_usage: float = 0.75,
+    datasets: Optional[List[str]] = None,
 ):
     """
-        Same as .save_pretrained(...) except 4bit weights are auto
-        converted to float16 with as few overhead as possible.
+    Same as .save_pretrained(...) except 4bit weights are auto
+    converted to float16 with as few overhead as possible.
 
-        Choose for `save_method` to be either:
-        1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp.
-        2.  `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference.
-        3.  `lora`: Save LoRA adapters with no merging. Useful for HF inference.
+    Choose for `save_method` to be either:
+    1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp.
+    2.  `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference.
+    3.  `lora`: Save LoRA adapters with no merging. Useful for HF inference.
     """
     if tokenizer is None:
         logger.warning_once(
-            "Unsloth: You're not saving a tokenizer as well?\n"\
+            "Unsloth: You're not saving a tokenizer as well?\n"
             "You can do it separately via `tokenizer.save_pretrained(...)`"
         )
-    pass
 
     arguments = dict(locals())
     arguments["model"] = self
@@ -1236,57 +1408,54 @@ def unsloth_save_pretrained_merged(
     unsloth_save_model(**arguments)
     for _ in range(3):
         gc.collect()
-pass
 
 
 def unsloth_push_to_hub_merged(
     self,
-    repo_id              : str,
-    tokenizer            = None,
-    save_method          : str = "merged_16bit", # ["lora", "merged_16bit", "merged_4bit"]
-    use_temp_dir         : Optional[bool] = None,
-    commit_message       : Optional[str] = "Trained with Unsloth",
-    private              : Optional[bool] = None,
-    token                : Union[bool, str, None] = None,
-    max_shard_size       : Union[int, str, None] = "5GB",
-    create_pr            : bool = False,
-    safe_serialization   : bool = True,
-    revision             : str = None,
-    commit_description   : str = "Upload model trained with Unsloth 2x faster",
-    tags                 : List[str] = None,
-    temporary_location   : str = "_unsloth_temporary_saved_buffers",
-    maximum_memory_usage : float = 0.75,
+    repo_id: str,
+    tokenizer = None,
+    save_method: str = "merged_16bit",  # ["lora", "merged_16bit", "merged_4bit"]
+    use_temp_dir: Optional[bool] = None,
+    commit_message: Optional[str] = "Trained with Unsloth",
+    private: Optional[bool] = None,
+    token: Union[bool, str, None] = None,
+    max_shard_size: Union[int, str, None] = "5GB",
+    create_pr: bool = False,
+    safe_serialization: bool = True,
+    revision: str = None,
+    commit_description: str = "Upload model trained with Unsloth 2x faster",
+    tags: Optional[List[str]] = None,
+    temporary_location: str = "_unsloth_temporary_saved_buffers",
+    maximum_memory_usage: float = 0.75,
+    datasets: Optional[List[str]] = None,
 ):
     """
-        Same as .push_to_hub(...) except 4bit weights are auto
-        converted to float16 with as few overhead as possible.
+    Same as .push_to_hub(...) except 4bit weights are auto
+    converted to float16 with as few overhead as possible.
 
-        Choose for `save_method` to be either:
-        1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp.
-        2.  `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference.
-        3.  `lora`: Save LoRA adapters with no merging. Useful for HF inference.
+    Choose for `save_method` to be either:
+    1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp.
+    2.  `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference.
+    3.  `lora`: Save LoRA adapters with no merging. Useful for HF inference.
     """
     if tokenizer is None:
         logger.warning_once(
-            "Unsloth: You're not saving a tokenizer as well?\n"\
+            "Unsloth: You're not saving a tokenizer as well?\n"
             "You can do it separately via `tokenizer.push_to_hub(...)`"
         )
-    pass
 
     arguments = dict(locals())
-    arguments["model"]          = self
+    arguments["model"] = self
     arguments["save_directory"] = repo_id
-    arguments["push_to_hub"]    = True
+    arguments["push_to_hub"] = True
     del arguments["self"]
     del arguments["repo_id"]
     unsloth_save_model(**arguments)
     for _ in range(3):
         gc.collect()
-pass
 
 
-MODEL_CARD = \
-"""---
+MODEL_CARD = """---
 base_model: {base_model}
 tags:
 - text-generation-inference
@@ -1305,7 +1474,7 @@ language:
 - **License:** apache-2.0
 - **Finetuned from model :** {base_model}
 
-This {model_type} model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
+This {model_type} model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth)
 
 [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 """
@@ -1316,19 +1485,19 @@ def _determine_username(save_directory, old_username, token):
     save_directory = save_directory.lstrip("./")
     if "/" not in save_directory:
         from huggingface_hub import whoami
-        try: 
+
+        try:
             username = whoami(token = token)["name"]
             if type(old_username) is str and username != old_username:
                 username = old_username
-            pass
             save_directory = f"{username}/{save_directory}"
         except:
-            raise RuntimeError(f"Unsloth: {save_directory} is not a Huggingface directory.")
+            raise RuntimeError(
+                f"Unsloth: {save_directory} is not a Huggingface directory."
+            )
     else:
         username = save_directory.split("/")[0]
-    pass
     return save_directory, username
-pass
 
 
 def create_huggingface_repo(
@@ -1336,38 +1505,52 @@ def create_huggingface_repo(
     save_directory,
     token = None,
     private = False,
+    datasets = None,
 ):
-    if token is None :
+    if token is None:
         token = get_token()
-    pass
-    save_directory, username = _determine_username(save_directory, "", token)
+    save_directory, username = _determine_username(save_directory, None, token)
 
     from huggingface_hub import create_repo
+
     try:
         create_repo(
-            repo_id   = save_directory,
-            token     = token,
+            repo_id = save_directory,
+            token = token,
             repo_type = "model",
-            exist_ok  = False,
-            private   = private,
-        ) 
+            exist_ok = False,
+            private = private,
+        )
 
         # Create model card
         from huggingface_hub import ModelCard
+
         content = MODEL_CARD.format(
-            username   = username,
+            username = username,
             base_model = model.config._name_or_path,
             model_type = model.config.model_type,
-            method     = "",
-            extra      = "unsloth",
+            method = "",
+            extra = "unsloth",
         )
         card = ModelCard(content)
+        if datasets:
+            card.data.datasets = datasets
         card.push_to_hub(save_directory, token = token)
     except:
-        pass
+        # Repo already exists — update datasets metadata separately
+        if datasets:
+            try:
+                from huggingface_hub import metadata_update
+
+                metadata_update(
+                    save_directory, {"datasets": datasets}, overwrite = True, token = token
+                )
+            except Exception as e:
+                logger.warning_once(
+                    f"Unsloth: Could not update datasets metadata for {save_directory}: {e}"
+                )
     hf_api = HfApi(token = token)
     return save_directory, hf_api
-pass
 
 
 def upload_to_huggingface(
@@ -1380,539 +1563,901 @@ def upload_to_huggingface(
     old_username = None,
     private = None,
     create_config = True,
+    datasets = None,
 ):
     save_directory, username = _determine_username(save_directory, old_username, token)
 
     from huggingface_hub import create_repo
+
     try:
         create_repo(
-            repo_id   = save_directory,
-            token     = token,
+            repo_id = save_directory,
+            token = token,
             repo_type = "model",
-            exist_ok  = False,
-            private   = private,
-        ) 
+            exist_ok = False,
+            private = private,
+        )
 
         # Create model card
         from huggingface_hub import ModelCard
+
         content = MODEL_CARD.format(
-            username   = username,
+            username = username,
             base_model = model.config._name_or_path,
             model_type = model.config.model_type,
-            method     = "",
-            extra      = extra,
+            method = "",
+            extra = extra,
         )
         card = ModelCard(content)
+        if datasets:
+            card.data.datasets = datasets
         card.push_to_hub(save_directory, token = token)
     except:
-        pass
+        # Repo already exists — update datasets metadata separately
+        if datasets:
+            try:
+                from huggingface_hub import metadata_update
+
+                metadata_update(
+                    save_directory, {"datasets": datasets}, overwrite = True, token = token
+                )
+            except Exception as e:
+                logger.warning_once(
+                    f"Unsloth: Could not update datasets metadata for {save_directory}: {e}"
+                )
 
     if file_location is not None:
         # Now upload file
         hf_api = HfApi(token = token)
 
         if "/" in file_location:
-            uploaded_location = file_location[file_location.rfind("/")+1:]
+            uploaded_location = file_location[file_location.rfind("/") + 1 :]
         else:
             uploaded_location = file_location
-        pass
 
         # find ftevent file from tensorboard and upload it
         import glob
+
         ftevent_files = glob.glob("*out.tfevents*", recursive = True)
         if len(ftevent_files) > 0:
-            print("Unsloth: Uploading tensorboard files... Please wait...", file_location + "*out.tfevents*")
+            print(
+                "Unsloth: Uploading tensorboard files... Please wait...",
+                file_location + "*out.tfevents*",
+            )
             for ftevent_file in ftevent_files:
                 hf_api.upload_file(
                     path_or_fileobj = ftevent_file,
-                    path_in_repo    = ftevent_file.replace(file_location, ""),
-                    repo_id         = save_directory,
-                    repo_type       = "model",
-                    commit_message  = "(Trained with Unsloth)",
+                    path_in_repo = ftevent_file.replace(file_location, ""),
+                    repo_id = save_directory,
+                    repo_type = "model",
+                    commit_message = "(Trained with Unsloth)",
                 )
-            pass
-        pass
 
         hf_api.upload_file(
             path_or_fileobj = file_location,
-            path_in_repo    = uploaded_location,
-            repo_id         = save_directory,
-            repo_type       = "model",
-            commit_message  = "(Trained with Unsloth)",
+            path_in_repo = uploaded_location,
+            repo_id = save_directory,
+            repo_type = "model",
+            commit_message = "(Trained with Unsloth)",
         )
 
         # We also upload a config.json file
         if create_config:
             import json
-            with open("_temporary_unsloth_config.json", "w") as file:
-                json.dump({"model_type" : model.config.model_type}, file, indent = 4)
-            pass
+
+            with open("_temporary_unsloth_config.json", "w", encoding = "utf-8") as file:
+                json.dump({"model_type": model.config.model_type}, file, indent = 4)
             hf_api.upload_file(
                 path_or_fileobj = "_temporary_unsloth_config.json",
-                path_in_repo    = "config.json",
-                repo_id         = save_directory,
-                repo_type       = "model",
-                commit_message  = "(Trained with Unsloth)",
+                path_in_repo = "config.json",
+                repo_id = save_directory,
+                repo_type = "model",
+                commit_message = "(Trained with Unsloth)",
             )
             os.remove("_temporary_unsloth_config.json")
-        pass
-    pass
     return username
-pass
 
 
 def fix_tokenizer_bos_token(tokenizer):
     # Check if BOS added already, then warn
     fix_bos_token = False
     chat_template = getattr(tokenizer, "chat_template", None)
-    
-    if (tokenizer("A").input_ids[0] == getattr(tokenizer, "bos_token_id", None)):
-        if chat_template is not None and \
-            (
-                tokenizer.bos_token in chat_template or \
-                "{bos_token}" in chat_template.replace(" ", "") or \
-                "{bos_token+" in chat_template.replace(" ", "")
-            ):
 
+    if tokenizer("A").input_ids[0] == getattr(tokenizer, "bos_token_id", None):
+        if chat_template is not None and (
+            tokenizer.bos_token in chat_template
+            or "{bos_token}" in chat_template.replace(" ", "")
+            or "{bos_token+" in chat_template.replace(" ", "")
+        ):
             fix_bos_token = True
             logger.warning(
-                "Unsloth: ##### The current model auto adds a BOS token.\n"\
+                "Unsloth: ##### The current model auto adds a BOS token.\n"
                 "Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily."
             )
 
             # Remove {{bos_token}}
-            new_chat_template = re.sub(r"\{[\s]{0,}\{[\s]{0,}bos\_token[\s]{0,}\}[\s]{0,}\}", "", chat_template)
+            new_chat_template = re.sub(
+                r"\{[\s]{0,}\{[\s]{0,}bos\_token[\s]{0,}\}[\s]{0,}\}", "", chat_template
+            )
             # Remove {{bos_token +
-            new_chat_template = re.sub(r"\{[\s]{0,}\{[\s]{0,}bos\_token[\s]{0,}\+[\s]{0,}", "", new_chat_template)
-            
+            new_chat_template = re.sub(
+                r"\{[\s]{0,}\{[\s]{0,}bos\_token[\s]{0,}\+[\s]{0,}",
+                "",
+                new_chat_template,
+            )
+
             tokenizer.chat_template = new_chat_template
 
-        pass
-    pass
     return fix_bos_token, chat_template
-pass
 
 
-def create_ollama_modelfile(tokenizer, gguf_location):
+def create_ollama_modelfile(tokenizer, base_model_name, model_location):
     """
-        Creates an Ollama Modelfile.
-        Use ollama.create(model = "new_ollama_model", modelfile = modelfile)
+    Creates an Ollama Modelfile.
+    Use ollama.create(model = "new_ollama_model", modelfile = modelfile)
     """
-    modelfile = getattr(tokenizer, "_ollama_modelfile", None)
-    if modelfile is None: return None
+    ollama_template_name = MODEL_TO_OLLAMA_TEMPLATE_MAPPER.get(base_model_name)
+    if not ollama_template_name:
+        print(
+            f"Unsloth: No Ollama template mapping found for model '{base_model_name}'. Skipping Ollama Modelfile"
+        )
+        return None
+    ollama_modelfile = OLLAMA_TEMPLATES.get(ollama_template_name)
+    if not ollama_modelfile:
+        print(
+            f"Unsloth: No Ollama template mapping found for model '{base_model_name}'. Skipping Ollama Modelfile"
+        )
+        return None
+    tokenizer._ollama_modelfile = (
+        ollama_modelfile  # This comes from the unpacking above
+    )
+    modelfile = ollama_modelfile
 
     FILE_LOCATION_REPLACER = "⚫@✅#🦥__FILE_LOCATION__⚡@🦥#⛵"
-    EOS_TOKEN_REPLACER     = "⚫@✅#🦥__EOS_TOKEN__⚡@🦥#⛵"
-    LEFT_BRACKET_REPLACER  = "⚫@✅#🦥"
+    EOS_TOKEN_REPLACER = "⚫@✅#🦥__EOS_TOKEN__⚡@🦥#⛵"
+    LEFT_BRACKET_REPLACER = "⚫@✅#🦥"
     RIGHT_BRACKET_REPLACER = "⚡@🦥#⛵"
 
     # Fixes https://github.com/unslothai/unsloth/issues/1087
     # We must convert all {'s and }'s but keep {__FILE_LOCATION__} intact
-    modelfile = modelfile\
-        .replace("{__FILE_LOCATION__}", FILE_LOCATION_REPLACER)\
-        .replace("{__EOS_TOKEN__}",     EOS_TOKEN_REPLACER)\
-        .replace("{", LEFT_BRACKET_REPLACER)\
+    modelfile = (
+        modelfile.replace("{__FILE_LOCATION__}", FILE_LOCATION_REPLACER)
+        .replace("{__EOS_TOKEN__}", EOS_TOKEN_REPLACER)
+        .replace("{", LEFT_BRACKET_REPLACER)
         .replace("}", RIGHT_BRACKET_REPLACER)
+    )
 
     # Revert {__FILE_LOCATION__} back
-    modelfile = modelfile\
-        .replace(FILE_LOCATION_REPLACER, "{__FILE_LOCATION__}")\
-        .replace(EOS_TOKEN_REPLACER,     "{__EOS_TOKEN__}")
-    
+    modelfile = modelfile.replace(
+        FILE_LOCATION_REPLACER, "{__FILE_LOCATION__}"
+    ).replace(EOS_TOKEN_REPLACER, "{__EOS_TOKEN__}")
+
     if "__EOS_TOKEN__" in modelfile:
         modelfile = modelfile.format(
-            __FILE_LOCATION__  = gguf_location,
-            __EOS_TOKEN__      = tokenizer.eos_token,
+            __FILE_LOCATION__ = model_location,
+            __EOS_TOKEN__ = tokenizer.eos_token,
         )
     else:
         modelfile = modelfile.format(
-            __FILE_LOCATION__  = gguf_location,
+            __FILE_LOCATION__ = model_location,
         )
-    pass
-    
-    modelfile = modelfile\
-        .replace("⚫@✅#🦥", "{")\
-        .replace("⚡@🦥#⛵", "}")\
-        .rstrip()
+
+    modelfile = modelfile.replace("⚫@✅#🦥", "{").replace("⚡@🦥#⛵", "}").rstrip()
 
     return modelfile
-pass
+
+
+def create_ollama_model(username: str, model_name: str, tag: str, modelfile_path: str):
+    try:
+        init_check = subprocess.run(
+            ["curl", "http://localhost:11434"],
+            capture_output = True,
+            text = True,
+            timeout = 3,
+        )
+        if init_check.returncode == 0:
+            print(init_check.stdout.strip())
+        else:
+            print("Ollama Server is not Running")
+    except subprocess.TimeoutExpired:
+        return "Ollama Request Timeout"
+
+    process = subprocess.Popen(
+        [
+            "ollama",
+            "create",
+            f"{username}/{model_name}:{tag}",
+            "-f",
+            f"{modelfile_path}",
+        ],
+        stdout = subprocess.PIPE,
+        stderr = subprocess.STDOUT,
+        text = True,
+        bufsize = 1,
+        universal_newlines = True,
+    )
+
+    for line in iter(process.stdout.readline, ""):
+        print(line, end = "")
+        sys.stdout.flush()
+
+    return_code = process.wait()
+
+    if return_code != 0:
+        print(f"\nMODEL CREATED FAILED WITH RETURN CODE {return_code}")
+    else:
+        print("\nMODEL CREATED SUCCESSFULLY")
+
+
+def push_to_ollama_hub(username: str, model_name: str, tag: str):
+    try:
+        init_check = subprocess.run(
+            ["curl", "http://localhost:11434"],
+            capture_output = True,
+            text = True,
+            timeout = 3,
+        )
+        if init_check.returncode == 0:
+            print(init_check.stdout.strip())
+        else:
+            print("Ollama Server is not Running")
+    except subprocess.TimeoutExpired:
+        return "Ollama Request Timeout"
+
+    process = subprocess.Popen(
+        ["ollama", "push", f"{username}/{model_name}:{tag}"],
+        stdout = subprocess.PIPE,
+        stderr = subprocess.STDOUT,
+        text = True,
+        bufsize = 1,
+        universal_newlines = True,
+    )
+
+    for line in iter(process.stdout.readline, ""):
+        print(line, end = "")
+        sys.stdout.flush()
+
+    return_code = process.wait()
+
+    if return_code != 0:
+        print(f"\nMODEL PUBLISHED FAILED WITH RETURN CODE {return_code}")
+    else:
+        print("\nMODEL PUBLISHED SUCCESSFULLY")
+
+
+def push_to_ollama(tokenizer, gguf_location, username: str, model_name: str, tag: str):
+    model_file = create_ollama_modelfile(
+        tokenizer = tokenizer, gguf_location = gguf_location
+    )
+
+    with open(f"Modelfile_{model_name}", "w", encoding = "utf-8") as f:
+        f.write(model_file)
+        f.close()
+
+    create_ollama_model(
+        username = username,
+        model_name = model_name,
+        tag = tag,
+        modelfile_path = f"Modelfile_{model_name}",
+    )
+
+    push_to_ollama_hub(username = username, model_name = model_name, tag = tag)
+
+    print("Successfully pushed to ollama")
 
 
 def unsloth_save_pretrained_gguf(
     self,
-    save_directory       : Union[str, os.PathLike],
-    tokenizer            = None,
-    quantization_method  : str = "fast_quantized",
-    first_conversion     : str = None,
-    push_to_hub          : bool = False,
-    token                : Optional[Union[str, bool]] = None,
-    private              : Optional[bool] = None,
-    is_main_process      : bool = True,
-    state_dict           : Optional[dict] = None,
-    save_function        : Callable = torch.save,
-    max_shard_size       : Union[int, str] = "5GB",
-    safe_serialization   : bool = True,
-    variant              : Optional[str] = None,
-    save_peft_format     : bool = True,
-    tags                 : List[str] = None,
-    temporary_location   : str = "_unsloth_temporary_saved_buffers",
-    maximum_memory_usage : float = 0.85,
+    save_directory: Union[str, os.PathLike],
+    tokenizer = None,
+    quantization_method = "fast_quantized",
+    first_conversion: str = None,
+    push_to_hub: bool = False,
+    token: Optional[Union[str, bool]] = None,
+    private: Optional[bool] = None,
+    is_main_process: bool = True,
+    state_dict: Optional[dict] = None,
+    save_function: Callable = torch.save,
+    max_shard_size: Union[int, str] = "5GB",
+    safe_serialization: bool = True,
+    variant: Optional[str] = None,
+    save_peft_format: bool = True,
+    tags: List[str] = None,
+    temporary_location: str = "_unsloth_temporary_saved_buffers",
+    maximum_memory_usage: float = 0.85,
 ):
     """
-        Same as .save_pretrained(...) except 4bit weights are auto
-        converted to float16 then converted to GGUF / llama.cpp format.
+    Same as .save_pretrained(...) except 4bit weights are auto
+    converted to float16 then converted to GGUF / llama.cpp format.
 
-        Choose for `quantization_method` to be:
-        "not_quantized"  : "Recommended. Fast conversion. Slow inference, big files.",
-        "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
-        "quantized"      : "Recommended. Slow conversion. Fast inference, small files.",
-        "f32"     : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
-        "f16"     : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
-        "q8_0"    : "Fast conversion. High resource use, but generally acceptable.",
-        "q4_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
-        "q5_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
-        "q2_k"    : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
-        "q3_k_l"  : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
-        "q3_k_m"  : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
-        "q3_k_s"  : "Uses Q3_K for all tensors",
-        "q4_0"    : "Original quant method, 4-bit.",
-        "q4_1"    : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
-        "q4_k_s"  : "Uses Q4_K for all tensors",
-        "q4_k"    : "alias for q4_k_m",
-        "q5_k"    : "alias for q5_k_m",
-        "q5_0"    : "Higher accuracy, higher resource usage and slower inference.",
-        "q5_1"    : "Even higher accuracy, resource usage and slower inference.",
-        "q5_k_s"  : "Uses Q5_K for all tensors",
-        "q6_k"    : "Uses Q8_K for all tensors",
-        "iq2_xxs" : "2.06 bpw quantization",
-        "iq2_xs"  : "2.31 bpw quantization",
-        "iq3_xxs" : "3.06 bpw quantization",
-        "q3_k_xs" : "3-bit extra small quantization",
+    Choose for `quantization_method` to be:
+    "not_quantized"  : "Recommended. Fast conversion. Slow inference, big files.",
+    "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
+    "quantized"      : "Recommended. Slow conversion. Fast inference, small files.",
+    "f32"     : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
+    "f16"     : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
+    "q8_0"    : "Fast conversion. High resource use, but generally acceptable.",
+    "q4_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
+    "q5_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
+    "q2_k"    : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
+    "q3_k_l"  : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
+    "q3_k_m"  : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
+    "q3_k_s"  : "Uses Q3_K for all tensors",
+    "q4_0"    : "Original quant method, 4-bit.",
+    "q4_1"    : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
+    "q4_k_s"  : "Uses Q4_K for all tensors",
+    "q4_k"    : "alias for q4_k_m",
+    "q5_k"    : "alias for q5_k_m",
+    "q5_0"    : "Higher accuracy, higher resource usage and slower inference.",
+    "q5_1"    : "Even higher accuracy, resource usage and slower inference.",
+    "q5_k_s"  : "Uses Q5_K for all tensors",
+    "q6_k"    : "Uses Q8_K for all tensors",
+    "iq2_xxs" : "2.06 bpw quantization",
+    "iq2_xs"  : "2.31 bpw quantization",
+    "iq3_xxs" : "3.06 bpw quantization",
+    "q3_k_xs" : "3-bit extra small quantization",
     """
     if tokenizer is None:
         raise ValueError("Unsloth: Saving to GGUF must have a tokenizer.")
 
+    try:
+        base_model_name = get_model_name(self.config._name_or_path, load_in_4bit = False)
+        model_name = base_model_name.split("/")[-1]
+    except:
+        base_model_name = self.config._name_or_path
+        model_name = base_model_name.split("/")[-1]
+
+    # Check if push_to_hub is requested
+    if push_to_hub:
+        raise ValueError(
+            "Unsloth: Please use .push_to_hub_gguf() instead of .save_pretrained_gguf() with push_to_hub=True"
+        )
+
+    # Step 1: Check if this is a VLM (Vision-Language Model) and check if gpt-oss
+    is_vlm = False
+    if hasattr(self, "config") and hasattr(self.config, "architectures"):
+        is_vlm = any(
+            x.endswith(("ForConditionalGeneration", "ForVisionText2Text"))
+            for x in self.config.architectures
+        )
+        is_vlm = is_vlm or hasattr(self.config, "vision_config")
+
+    is_processor = is_vlm and isinstance(tokenizer, ProcessorMixin)
+
+    is_gpt_oss = (
+        True
+        if (
+            hasattr(self.config, "architectures")
+            and self.config.architectures == "GptOssForCausalLM"
+        )
+        or (
+            hasattr(self.config, "model_type")
+            and self.config.model_type in ["gpt-oss", "gpt_oss"]
+        )
+        else False
+    )
+    # Step 2: Prepare arguments for model saving
     arguments = dict(locals())
-    arguments["model"]        = self
-    arguments["tokenizer"]    = tokenizer
-    arguments["push_to_hub"]  = False # We save ourselves
-    arguments["save_method"] = "merged_16bit" # Must be 16bit
+    arguments["model"] = self
+    arguments["tokenizer"] = tokenizer
+    arguments["push_to_hub"] = False  # We handle upload ourselves
+    # GPT-OSS needs mxfp4 save method
+    if is_gpt_oss:
+        if quantization_method is not None:
+            _qm = (
+                quantization_method
+                if isinstance(quantization_method, (list, tuple))
+                else [quantization_method]
+            )
+            _ignored = [q for q in _qm if str(q).lower() != "mxfp4"]
+            if _ignored:
+                logger.warning_once(
+                    f"Unsloth: GPT-OSS does not support GGUF quantization "
+                    f"(requested: {', '.join(str(q) for q in _ignored)}). "
+                    f"Overriding to MXFP4 format. "
+                    f"Pass quantization_method=None to suppress this warning."
+                )
+        arguments["save_method"] = "mxfp4"
+    else:
+        arguments["save_method"] = "merged_16bit"
     del arguments["self"]
     del arguments["quantization_method"]
     del arguments["first_conversion"]
+    del arguments["is_vlm"]
+    del arguments["is_gpt_oss"]
+    del arguments["model_name"]
+    del arguments["base_model_name"]
+    del arguments["is_processor"]
 
-    # Fix tokenizer adding an extra BOS token at the front
-    fix_bos_token, old_chat_template = fix_tokenizer_bos_token(tokenizer)
-
-    # Non blocking install GGUF first
-    if not os.path.exists("llama.cpp"):
-
-        if IS_KAGGLE_ENVIRONMENT:
-            # Kaggle is weird - no blocking installs, and no CUDA?
-            python_install = install_python_non_blocking(["gguf", "protobuf"])
-            python_install.wait()
-            install_llama_cpp_blocking(use_cuda = False)
-            new_save_directory, old_username = unsloth_save_model(**arguments)
-            makefile = None
-        else:
-            git_clone = install_llama_cpp_clone_non_blocking()
-            python_install = install_python_non_blocking(["gguf", "protobuf"])
-            git_clone.wait()
-            makefile = install_llama_cpp_make_non_blocking()
-            new_save_directory, old_username = unsloth_save_model(**arguments)
-            python_install.wait()
-        pass
+    # Step 3: Fix tokenizer BOS token if needed
+    if is_processor:
+        fix_bos_token, old_chat_template = fix_tokenizer_bos_token(tokenizer.tokenizer)
     else:
-        try:
-            new_save_directory, old_username = unsloth_save_model(**arguments)
-            makefile = None
-        except:
-            # Retry by recloning llama.cpp
-            if IS_KAGGLE_ENVIRONMENT:
-                # Kaggle is weird - no blocking installs, and no CUDA?
-                python_install = install_python_non_blocking(["gguf", "protobuf"])
-                python_install.wait()
-                install_llama_cpp_blocking(use_cuda = False)
-                new_save_directory, old_username = unsloth_save_model(**arguments)
-                makefile = None
-            else:
-                git_clone = install_llama_cpp_clone_non_blocking()
-                python_install = install_python_non_blocking(["gguf", "protobuf"])
-                git_clone.wait()
-                makefile = install_llama_cpp_make_non_blocking()
-                new_save_directory, old_username = unsloth_save_model(**arguments)
-                python_install.wait()
-            pass
-        pass
-    pass
+        fix_bos_token, old_chat_template = fix_tokenizer_bos_token(tokenizer)
+
+    # Step 4: Save/merge model to 16-bit format
+    print(
+        f'Unsloth: Merging model weights to {"mxfp4" if is_gpt_oss else "16-bit"} format...'
+    )
+    try:
+        # Call unsloth_generic_save directly (it's in the same file)
+        unsloth_generic_save(**arguments)
+
+    except Exception as e:
+        raise RuntimeError(f"Failed to save/merge model: {e}")
+
+    if is_processor:
+        tokenizer = tokenizer.tokenizer
 
     # Use old chat template if the bos is removed
     if fix_bos_token:
         tokenizer.chat_template = old_chat_template
-    pass
 
+    # Step 6: Clean up memory
     for _ in range(3):
+        import gc
+
         gc.collect()
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
 
-    model_dtype = self.config.torch_dtype
-    model_type  = self.config.model_type
-    if type(model_dtype) is str:
-        assert(model_dtype == "float16" or model_dtype == "bfloat16")
-    elif model_dtype == torch.float16:
+    # Step 7: Get model dtype and type
+    try:
+        model_dtype = dtype_from_config(self.config)
+        model_type = self.config.model_type
+        if type(model_dtype) is str:
+            assert model_dtype == "float16" or model_dtype == "bfloat16"
+        elif model_dtype == torch.float16:
+            model_dtype = "float16"
+        elif model_dtype == torch.bfloat16:
+            model_dtype = "bfloat16"
+        else:
+            raise TypeError("Unsloth: Model dtype can only be float16 or bfloat16")
+    except Exception as e:
+        # Fallback if dtype_from_config fails
+        print(f"Unsloth: Could not determine dtype ({e}), defaulting to float16")
         model_dtype = "float16"
-    elif model_dtype == torch.bfloat16:
-        model_dtype = "bfloat16"
-    else:
-        raise TypeError("Unsloth: Model dtype can only be float16 or bfloat16")
-    pass
 
-    is_sentencepiece_model = check_if_sentencepiece_model(self)
+    # Step 8: Convert to GGUF format
+    print("Unsloth: Converting to GGUF format...")
 
-    # Save to GGUF
-    all_file_locations, want_full_precision = save_to_gguf(
-        model_type, model_dtype, is_sentencepiece_model, 
-        new_save_directory, quantization_method, first_conversion, makefile,
-    )
+    # Convert quantization_method to list if string
+    # Use old style quantization_method
+    quantization_methods = []
+    if quantization_method is not None:
+        # Convert quantization_method to list
+        if isinstance(quantization_method, list):
+            pass
+        elif isinstance(quantization_method, str):
+            quantization_method = [
+                quantization_method,
+            ]
+        elif isinstance(quantization_method, tuple):
+            quantization_method = list(quantization_method)
+        else:
+            raise TypeError(
+                "Unsloth: quantization_method can only be a string or a list of strings"
+            )
+        for i, quant_method in enumerate(quantization_method):
+            quant_method = quant_method.lower()
+            if quant_method == "not_quantized":
+                quant_method = "f16"
+            elif quant_method == "fast_quantized":
+                quant_method = "q8_0"
+            elif quant_method == "quantized":
+                quant_method = "q4_k_m"
+            elif quant_method is None:
+                quant_method = "q8_0"
+            quantization_methods.append(quant_method.lower())
 
-    # Save Ollama modelfile
-    modelfile = create_ollama_modelfile(tokenizer, all_file_locations[0])
+    try:
+        all_file_locations, want_full_precision, is_vlm_update = save_to_gguf(
+            model_name = model_name,
+            model_type = model_type,
+            model_dtype = model_dtype,
+            is_sentencepiece = False,
+            model_directory = save_directory,
+            quantization_method = quantization_methods,
+            first_conversion = first_conversion,
+            is_vlm = is_vlm,  # Pass VLM flag
+            is_gpt_oss = is_gpt_oss,  # Pass gpt_oss Flag
+        )
+    except Exception as e:
+        if IS_KAGGLE_ENVIRONMENT:
+            raise RuntimeError(
+                f"Unsloth: GGUF conversion failed in Kaggle environment.\n"
+                f"This is likely due to the 20GB disk space limit.\n"
+                f"Try saving to /tmp directory or use a smaller model.\n"
+                f"Error: {e}"
+            )
+        else:
+            raise RuntimeError(f"Unsloth: GGUF conversion failed: {e}")
+
+    # Step 9: Create Ollama modelfile
+    gguf_directory = f"{save_directory}_gguf"
     modelfile_location = None
-    if modelfile is not None:
-        modelfile_location = os.path.join(new_save_directory, "Modelfile")
-        with open(modelfile_location, "w") as file:
-            file.write(modelfile)
-        pass
-        print(f"Unsloth: Saved Ollama Modelfile to {modelfile_location}")
-    pass
+    ollama_success = False
+    if all_file_locations:
+        try:
+            if is_vlm_update:
+                modelfile = create_ollama_modelfile(tokenizer, base_model_name, ".")
+            else:
+                modelfile = create_ollama_modelfile(
+                    tokenizer,
+                    base_model_name,
+                    os.path.basename(all_file_locations[0]),
+                )
+            if modelfile is not None:
+                modelfile_location = os.path.join(gguf_directory, "Modelfile")
+                with open(modelfile_location, "w", encoding = "utf-8") as file:
+                    file.write(modelfile)
+                ollama_success = True
+        except Exception as e:
+            print(f"Warning: Could not create Ollama modelfile: {e}")
 
+    # Step 10: Show BOS token warning if applicable
     if fix_bos_token:
         logger.warning(
-            "Unsloth: ##### The current model auto adds a BOS token.\n"\
+            "Unsloth: ##### The current model auto adds a BOS token.\n"
             "Unsloth: ##### We removed it in GGUF's chat template for you."
         )
-    pass
 
-    if push_to_hub:
-        print("Unsloth: Uploading GGUF to Huggingface Hub...")
+    _exe = ".exe" if IS_WINDOWS else ""
+    if IS_WINDOWS:
+        _bin_dir = os.path.join(LLAMA_CPP_DEFAULT_DIR, "build", "bin", "Release")
+    else:
+        _bin_dir = LLAMA_CPP_DEFAULT_DIR
 
-        # If not needing full precision, skip the first
-        if not want_full_precision: all_file_locations = all_file_locations[1:]
+    if is_vlm_update:
+        print("\n")
+        print(
+            f"Unsloth: example usage for Multimodal LLMs: {os.path.join(_bin_dir, 'llama-mtmd-cli' + _exe)} -m {all_file_locations[0]} --mmproj {all_file_locations[-1]}"
+        )
+        print("Unsloth: load image inside llama.cpp runner: /image test_image.jpg")
+        print("Unsloth: Prompt model to describe the image")
+    else:
+        print(
+            f'Unsloth: example usage for text only LLMs: {os.path.join(_bin_dir, "llama-cli" + _exe)} --model {all_file_locations[0]} -p "why is the sky blue?"'
+        )
 
-        for file_location in all_file_locations:
-            username = upload_to_huggingface(
-                self, save_directory, token,
-                "GGUF converted", "gguf", file_location, old_username, private,
-            )
-            link = f"{username}/{new_save_directory.lstrip('/.')}" \
-                if username not in new_save_directory else \
-                new_save_directory.lstrip('/.')
-            print(f"Saved GGUF to https://huggingface.co/{link}")
-        pass
+    if ollama_success:
+        print(f"Unsloth: Saved Ollama Modelfile to {modelfile_location}")
+        print(
+            f"Unsloth: convert model to ollama format by running - ollama create model_name -f {modelfile_location}"
+        )
 
-        # Save modelfile
-        if modelfile_location is not None:
-            username = upload_to_huggingface(
-                self, save_directory, token,
-                "GGUF converted", "gguf", modelfile_location, old_username, private,
-            )
-            print(f"Saved Ollama Modelfile to https://huggingface.co/{link}")
-        pass
-    pass
-pass
+    # Return a dict with all needed info for push_to_hub
+    return {
+        "save_directory": save_directory,
+        "gguf_directory": gguf_directory,
+        "gguf_files": all_file_locations,
+        "modelfile_location": modelfile_location,
+        "want_full_precision": want_full_precision,
+        "is_vlm": is_vlm_update,
+        "fix_bos_token": fix_bos_token,
+    }
 
 
 def unsloth_push_to_hub_gguf(
     self,
-    repo_id              : str,
-    tokenizer            = None,
-    quantization_method  : str = "fast_quantized",
-    first_conversion     : str = None,
-    use_temp_dir         : Optional[bool] = None,
-    commit_message       : Optional[str] = "Trained with Unsloth",
-    private              : Optional[bool] = None,
-    token                : Union[bool, str, None] = None,
-    max_shard_size       : Union[int, str, None] = "5GB",
-    create_pr            : bool = False,
-    safe_serialization   : bool = True,
-    revision             : str = None,
-    commit_description   : str = "Upload model trained with Unsloth 2x faster",
-    tags                 : Optional[List[str]] = None,
-    temporary_location   : str = "_unsloth_temporary_saved_buffers",
-    maximum_memory_usage : float = 0.85,
+    repo_id: str,
+    tokenizer = None,
+    quantization_method = "fast_quantized",
+    first_conversion: str = None,
+    use_temp_dir: Optional[bool] = None,
+    commit_message: Optional[str] = "Trained with Unsloth",
+    private: Optional[bool] = None,
+    token: Union[bool, str, None] = None,
+    max_shard_size: Union[int, str, None] = "5GB",
+    create_pr: bool = False,
+    safe_serialization: bool = True,
+    revision: str = None,
+    commit_description: str = "Upload model trained with Unsloth 2x faster",
+    tags: Optional[List[str]] = None,
+    temporary_location: str = "_unsloth_temporary_saved_buffers",
+    maximum_memory_usage: float = 0.85,
+    datasets: Optional[List[str]] = None,
 ):
     """
-        Same as .push_to_hub(...) except 4bit weights are auto
-        converted to float16 then converted to GGUF / llama.cpp format.
+    Same as .push_to_hub(...) except 4bit weights are auto
+    converted to float16 then converted to GGUF / llama.cpp format.
 
-        Choose for `quantization_method` to be:
-        "not_quantized"  : "Recommended. Fast conversion. Slow inference, big files.",
-        "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
-        "quantized"      : "Recommended. Slow conversion. Fast inference, small files.",
-        "f32"     : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
-        "f16"     : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
-        "q8_0"    : "Fast conversion. High resource use, but generally acceptable.",
-        "q4_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
-        "q5_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
-        "q2_k"    : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
-        "q3_k_l"  : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
-        "q3_k_m"  : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
-        "q3_k_s"  : "Uses Q3_K for all tensors",
-        "q4_0"    : "Original quant method, 4-bit.",
-        "q4_1"    : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
-        "q4_k_s"  : "Uses Q4_K for all tensors",
-        "q4_k"    : "alias for q4_k_m",
-        "q5_k"    : "alias for q5_k_m",
-        "q5_0"    : "Higher accuracy, higher resource usage and slower inference.",
-        "q5_1"    : "Even higher accuracy, resource usage and slower inference.",
-        "q5_k_s"  : "Uses Q5_K for all tensors",
-        "q6_k"    : "Uses Q8_K for all tensors",
+    Choose for `quantization_method` to be:
+    "not_quantized"  : "Recommended. Fast conversion. Slow inference, big files.",
+    "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
+    "quantized"      : "Recommended. Slow conversion. Fast inference, small files.",
+    "f32"     : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
+    "f16"     : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
+    "q8_0"    : "Fast conversion. High resource use, but generally acceptable.",
+    "q4_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
+    "q5_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
+    "q2_k"    : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
+    "q3_k_l"  : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
+    "q3_k_m"  : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
+    "q3_k_s"  : "Uses Q3_K for all tensors",
+    "q4_0"    : "Original quant method, 4-bit.",
+    "q4_1"    : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
+    "q4_k_s"  : "Uses Q4_K for all tensors",
+    "q5_0"    : "Higher accuracy, higher resource usage and slower inference.",
+    "q5_1"    : "Even higher accuracy, resource usage and slower inference.",
+    "q5_k_s"  : "Uses Q5_K for all tensors",
+    "q6_k"    : "Uses Q8_K for all tensors",
     """
     if tokenizer is None:
         raise ValueError("Unsloth: Saving to GGUF must have a tokenizer.")
 
-    arguments = dict(locals())
-    arguments["model"]          = self
-    arguments["tokenizer"]      = tokenizer
-    arguments["save_directory"] = repo_id
-    arguments["push_to_hub"]    = False # We save ourselves
-    arguments["save_method"]   = "merged_16bit" # Must be 16bit
-    del arguments["self"]
-    del arguments["repo_id"]
-    del arguments["quantization_method"]
-    del arguments["first_conversion"]
+    # Step 1: Determine save directory
+    model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
 
-    # Fix tokenizer adding an extra BOS token at the front
-    fix_bos_token, old_chat_template = fix_tokenizer_bos_token(tokenizer)
+    if use_temp_dir or use_temp_dir is None:
+        import tempfile
 
-    # Non blocking install GGUF first
-    if not os.path.exists("llama.cpp"):
+        temp_dir = tempfile.mkdtemp(prefix = "unsloth_gguf_")
+        save_directory = temp_dir
+        cleanup_temp = True
+    else:
+        save_directory = model_name  # Use model name, not repo_id
+        cleanup_temp = False
 
-        if IS_KAGGLE_ENVIRONMENT:
-            # Kaggle is weird - no blocking installs, and no CUDA?
-            python_install = install_python_non_blocking(["gguf", "protobuf"])
-            python_install.wait()
-            install_llama_cpp_blocking(use_cuda = False)
-            new_save_directory, old_username = unsloth_save_model(**arguments)
-            makefile = None
+    # Step 2: Call save_pretrained_gguf to do the conversion
+    print(f"Unsloth: Converting model to GGUF format...")
+
+    try:
+        # Call save_pretrained_gguf - it returns all the info we need
+        result = unsloth_save_pretrained_gguf(
+            self = self,
+            save_directory = save_directory,
+            tokenizer = tokenizer,
+            quantization_method = quantization_method,
+            first_conversion = first_conversion,
+            push_to_hub = False,  # Never push from here
+            token = None,  # Don't need token for local save
+            max_shard_size = max_shard_size,
+            safe_serialization = safe_serialization,
+            temporary_location = temporary_location,
+            maximum_memory_usage = maximum_memory_usage,
+        )
+
+        # Extract results
+        all_file_locations = result["gguf_files"]
+        modelfile_location = result["modelfile_location"]
+        want_full_precision = result["want_full_precision"]
+        is_vlm = result["is_vlm"]
+        fix_bos_token = result["fix_bos_token"]
+        actual_save_directory = result["save_directory"]
+
+    except Exception as e:
+        if cleanup_temp:
+            import shutil
+
+            for d in [save_directory, f"{save_directory}_gguf"]:
+                try:
+                    shutil.rmtree(d)
+                except:
+                    pass
+        raise RuntimeError(f"Failed to convert model to GGUF: {e}")
+
+    # Step 3: Upload to HuggingFace Hub
+    print("Unsloth: Uploading GGUF to Huggingface Hub...")
+
+    try:
+        from huggingface_hub import HfApi
+
+        api = HfApi(token = token)
+
+        # Get full repo id
+        if "/" not in repo_id:
+            username = api.whoami()["name"]
+            full_repo_id = f"{username}/{repo_id}"
         else:
-            git_clone = install_llama_cpp_clone_non_blocking()
-            python_install = install_python_non_blocking(["gguf", "protobuf"])
-            git_clone.wait()
-            makefile = install_llama_cpp_make_non_blocking()
-            new_save_directory, old_username = unsloth_save_model(**arguments)
-            python_install.wait()
-        pass
-    else:
-        try:
-            new_save_directory, old_username = unsloth_save_model(**arguments)
-            makefile = None
-        except:
-            # Retry by recloning llama.cpp
-            if IS_KAGGLE_ENVIRONMENT:
-                # Kaggle is weird - no blocking installs, and no CUDA?
-                python_install = install_python_non_blocking(["gguf", "protobuf"])
-                python_install.wait()
-                install_llama_cpp_blocking(use_cuda = False)
-                new_save_directory, old_username = unsloth_save_model(**arguments)
-                makefile = None
+            full_repo_id = repo_id
+
+        # Create repo
+        api.create_repo(
+            repo_id = full_repo_id,
+            repo_type = "model",
+            private = private,
+            exist_ok = True,
+        )
+
+        # Upload GGUF files
+        for file_location in all_file_locations:
+            original_name = os.path.basename(file_location)
+            # Replace temp directory name with proper model name
+            if cleanup_temp and "unsloth_gguf_" in original_name:
+                # Extract the quantization part (e.g., ".Q8_0.gguf" or ".Q8_0-mmproj.gguf")
+                quant_suffix = (
+                    original_name.split(".", 1)[1]
+                    if "." in original_name
+                    else original_name
+                )
+                proper_name = f"{model_name}.{quant_suffix}"
             else:
-                git_clone = install_llama_cpp_clone_non_blocking()
-                python_install = install_python_non_blocking(["gguf", "protobuf"])
-                git_clone.wait()
-                makefile = install_llama_cpp_make_non_blocking()
-                new_save_directory, old_username = unsloth_save_model(**arguments)
-                python_install.wait()
+                proper_name = original_name.replace(
+                    os.path.basename(save_directory), model_name
+                )
+
+            print(f"Uploading {proper_name}...")
+
+            api.upload_file(
+                path_or_fileobj = file_location,
+                path_in_repo = proper_name,
+                repo_id = full_repo_id,
+                repo_type = "model",
+                commit_message = commit_message,
+                commit_description = commit_description,
+                create_pr = create_pr,
+                revision = revision,
+            )
+
+        # Upload config.json if exists
+        config_path = os.path.join(actual_save_directory, "config.json")
+        if os.path.exists(config_path):
+            print("Uploading config.json...")
+            api.upload_file(
+                path_or_fileobj = config_path,
+                path_in_repo = "config.json",
+                repo_id = full_repo_id,
+                repo_type = "model",
+                commit_message = f"{commit_message} - config",
+                create_pr = create_pr,
+                revision = revision,
+            )
+
+        # Upload Modelfile if exists
+        if modelfile_location and os.path.exists(modelfile_location):
+            print("Uploading Ollama Modelfile...")
+            api.upload_file(
+                path_or_fileobj = modelfile_location,
+                path_in_repo = "Modelfile",
+                repo_id = full_repo_id,
+                repo_type = "model",
+                commit_message = f"{commit_message} - Ollama Modelfile",
+                create_pr = create_pr,
+                revision = revision,
+            )
+
+        # Create and upload README
+        readme_content = f"""---
+tags:
+- gguf
+- llama.cpp
+- unsloth
+{"- vision-language-model" if is_vlm else ""}
+---
+
+# {repo_id.split("/")[-1]} : GGUF
+
+This model was finetuned and converted to GGUF format using [Unsloth](https://github.com/unslothai/unsloth).
+
+**Example usage**:
+- For text only LLMs:    `llama-cli -hf {repo_id} --jinja`
+- For multimodal models: `llama-mtmd-cli -hf {repo_id} --jinja`
+
+## Available Model files:
+"""
+        for file in all_file_locations:
+            # Fix filename in README too
+            original_name = os.path.basename(file)
+            if cleanup_temp and "unsloth_gguf_" in original_name:
+                quant_suffix = (
+                    original_name.split(".", 1)[1]
+                    if "." in original_name
+                    else original_name
+                )
+                proper_name = f"{model_name}.{quant_suffix}"
+            else:
+                proper_name = original_name.replace(
+                    os.path.basename(save_directory), model_name
+                )
+            readme_content += f"- `{proper_name}`\n"
+
+        # Special note for VLM with Modelfile
+        if is_vlm and modelfile_location:
+            readme_content += "\n## ⚠️ Ollama Note for Vision Models\n"
+            readme_content += "**Important:** Ollama currently does not support separate mmproj files for vision models.\n\n"
+            readme_content += "To create an Ollama model from this vision model:\n"
+            readme_content += "1. Place the `Modelfile` in the same directory as the finetuned bf16 merged model\n"
+            readme_content += "3. Run: `ollama create model_name -f ./Modelfile`\n"
+            readme_content += "   (Replace `model_name` with your desired name)\n\n"
+            readme_content += (
+                "This will create a unified bf16 model that Ollama can use.\n"
+            )
+        elif modelfile_location:
+            readme_content += "\n## Ollama\n"
+            readme_content += "An Ollama Modelfile is included for easy deployment.\n"
+
+        if fix_bos_token:
+            readme_content += "\n## Note\n"
+            readme_content += (
+                "The model's BOS token behavior was adjusted for GGUF compatibility.\n"
+            )
+
+        readme_content += (
+            "This was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth)\n"
+            '[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)\n'
+        )
+
+        readme_path = os.path.join(actual_save_directory, "README.md")
+        with open(readme_path, "w") as f:
+            f.write(readme_content)
+
+        api.upload_file(
+            path_or_fileobj = readme_path,
+            path_in_repo = "README.md",
+            repo_id = full_repo_id,
+            repo_type = "model",
+            commit_message = "Add README",
+            create_pr = create_pr,
+            revision = revision,
+        )
+
+        print(
+            f"Unsloth: Successfully uploaded GGUF to https://huggingface.co/{full_repo_id}"
+        )
+
+        # Add tags
+        if tags is None:
+            tags = []
+        tags.extend(["gguf", "llama-cpp", "unsloth"])
+        if is_vlm:
+            tags.append("vision-language-model")
+
+        try:
+            api.add_tags(
+                repo_id = full_repo_id,
+                tags = tags,
+                repo_type = "model",
+            )
+        except:
             pass
-        pass
-    pass
 
-    # Use old chat template if the bos is removed
-    if fix_bos_token:
-        tokenizer.chat_template = old_chat_template
-    pass
+        if datasets:
+            try:
+                from huggingface_hub import metadata_update
 
-    for _ in range(3):
-        gc.collect()
+                metadata_update(
+                    full_repo_id, {"datasets": datasets}, overwrite = True, token = token
+                )
+            except Exception as e:
+                logger.warning_once(
+                    f"Unsloth: Could not update datasets metadata for {full_repo_id}: {e}"
+                )
 
-    model_dtype = self.config.torch_dtype
-    model_type  = self.config.model_type
-    if type(model_dtype) is str:
-        assert(model_dtype == "float16" or model_dtype == "bfloat16")
-    elif model_dtype == torch.float16:
-        model_dtype = "float16"
-    elif model_dtype == torch.bfloat16:
-        model_dtype = "bfloat16"
-    else:
-        raise TypeError("Unsloth: Model dtype can only be float16 or bfloat16")
-    pass
+    except Exception as e:
+        raise RuntimeError(f"Failed to upload to Hugging Face Hub: {e}")
 
-    is_sentencepiece_model = check_if_sentencepiece_model(self)
+    finally:
+        # Clean up temporary directory
+        if cleanup_temp:
+            print("Unsloth: Cleaning up temporary files...")
+            import shutil
 
-    # Save to GGUF
-    all_file_locations, want_full_precision = save_to_gguf(
-        model_type, model_dtype, is_sentencepiece_model, 
-        new_save_directory, quantization_method, first_conversion, makefile,
-    )
+            for d in [save_directory, f"{save_directory}_gguf"]:
+                if os.path.exists(d):
+                    try:
+                        shutil.rmtree(d)
+                    except:
+                        pass
 
-    # Save Ollama modelfile
-    modelfile = create_ollama_modelfile(tokenizer, all_file_locations[0])
-    modelfile_location = None
-    if modelfile is not None:
-        modelfile_location = os.path.join(new_save_directory, "Modelfile")
-        with open(modelfile_location, "w") as file:
-            file.write(modelfile)
-        pass
-        print(f"Unsloth: Saved Ollama Modelfile to {modelfile_location}")
-    pass
+    return full_repo_id
 
-    # If not needing full precision, skip the first
-    if not want_full_precision: all_file_locations = all_file_locations[1:]
-    
-    for file_location in all_file_locations:
-        print("Unsloth: Uploading GGUF to Huggingface Hub...")
-        username = upload_to_huggingface(
-            self, repo_id, token,
-            "GGUF converted", "gguf", file_location, old_username, private,
-        )
-        link = f"{username}/{new_save_directory.lstrip('/.')}" \
-            if username not in new_save_directory else \
-            new_save_directory.lstrip('/.')
-
-        print(f"Saved GGUF to https://huggingface.co/{link}")
-    pass
-
-    # Save modelfile
-    if modelfile_location is not None:
-        username = upload_to_huggingface(
-            self, repo_id, token,
-            "GGUF converted", "gguf", modelfile_location, old_username, private,
-        )
-        print(f"Saved Ollama Modelfile to https://huggingface.co/{link}")
-    pass
-
-    if fix_bos_token:
-        logger.warning(
-            "Unsloth: ##### The current model auto adds a BOS token.\n"\
-            "Unsloth: ##### We removed it in GGUF's chat template for you."
-        )
-    pass
-pass
 
 # Corrected function to save LoRA to a custom directory
 def save_lora_to_custom_dir(model, tokenizer, save_directory):
     # Create the custom directory if it doesn't exist
-    os.makedirs(save_directory, exist_ok=True)
+    os.makedirs(save_directory, exist_ok = True)
 
     # Call the unsloth_save_model function with the custom directory
     unsloth_save_model(
         model,
         tokenizer,
-        save_directory=save_directory,
-        save_method="lora",
-        push_to_hub=False,
+        save_directory = save_directory,
+        save_method = "lora",
+        push_to_hub = False,
     )
 
+
 # Corrected method within the model class to convert LoRA to GGML and push to Hugging Face Hub
 def unsloth_convert_lora_to_ggml_and_push_to_hub(
     self,
@@ -1932,7 +2477,7 @@ def unsloth_convert_lora_to_ggml_and_push_to_hub(
         if IS_KAGGLE_ENVIRONMENT:
             python_install = install_python_non_blocking(["protobuf"])
             python_install.wait()
-            install_llama_cpp_blocking(use_cuda=False)
+            install_llama_cpp_blocking(use_cuda = False)
             makefile = None
         else:
             git_clone = install_llama_cpp_clone_non_blocking()
@@ -1952,17 +2497,26 @@ def unsloth_convert_lora_to_ggml_and_push_to_hub(
     model_type = self.config.model_type
     output_file = os.path.join(lora_directory_push, "ggml-adapter-model.bin")
 
-    print(f"Unsloth: Converting auto-saved LoRA adapters at {lora_directory_push} to GGML format.")
+    print(
+        f"Unsloth: Converting auto-saved LoRA adapters at {lora_directory_push} to GGML format."
+    )
     print(f"The output file will be {output_file}")
 
     command = f"python3 llama.cpp/convert-lora-to-ggml.py {lora_directory_push} {output_file} llama"
 
     try:
-        with subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=1, universal_newlines=True) as sp:
+        with subprocess.Popen(
+            command,
+            shell = True,
+            stdout = subprocess.PIPE,
+            stderr = subprocess.PIPE,
+            bufsize = 1,
+            universal_newlines = True,
+        ) as sp:
             for line in sp.stdout:
-                print(line, end="", flush=True)
+                print(line, end = "", flush = True)
             for line in sp.stderr:
-                print(line, end="", flush=True)
+                print(line, end = "", flush = True)
             sp.wait()
             if sp.returncode != 0:
                 raise subprocess.CalledProcessError(sp.returncode, command)
@@ -1974,18 +2528,27 @@ def unsloth_convert_lora_to_ggml_and_push_to_hub(
 
     print("Unsloth: Uploading GGML file to Hugging Face Hub...")
     username = upload_to_huggingface(
-        self, repo_id, token,
-        "GGML converted LoRA", "ggml", output_file, None, private,
+        self,
+        repo_id,
+        token,
+        "GGML converted LoRA",
+        "ggml",
+        output_file,
+        None,
+        private,
     )
     link = f"{repo_id.lstrip('/')}"
     print("Unsloth: Done.")
     print(f"Converted LoRA to GGML and uploaded to https://huggingface.co/{link}")
-    print("\nThis GGML making function was made by Maheswar. Ping him @Maheswar on the Unsloth Discord or on HuggingFace (@mahiatlinux) if you like this!")
+    print(
+        "\nThis GGML making function was made by Maheswar. Ping him @Maheswar on the Unsloth Discord or on HuggingFace (@mahiatlinux) if you like this!"
+    )
+
 
 def unsloth_convert_lora_to_ggml_and_save_locally(
     self,
-    save_directory: str, # Added parameter for the folder name 
-    tokenizer, 
+    save_directory: str,  # Added parameter for the folder name
+    tokenizer,
     temporary_location: str = "_unsloth_temporary_saved_buffers",
     maximum_memory_usage: float = 0.85,
 ):
@@ -1993,7 +2556,7 @@ def unsloth_convert_lora_to_ggml_and_save_locally(
         if IS_KAGGLE_ENVIRONMENT:
             python_install = install_python_non_blocking(["protobuf"])
             python_install.wait()
-            install_llama_cpp_blocking(use_cuda=False)
+            install_llama_cpp_blocking(use_cuda = False)
             makefile = None
         else:
             git_clone = install_llama_cpp_clone_non_blocking()
@@ -2013,17 +2576,26 @@ def unsloth_convert_lora_to_ggml_and_save_locally(
     model_type = self.config.model_type
     output_file = os.path.join(save_directory, "ggml-adapter-model.bin")
 
-    print(f"Unsloth: Converting auto-saved LoRA adapters at {save_directory} to GGML format.")
+    print(
+        f"Unsloth: Converting auto-saved LoRA adapters at {save_directory} to GGML format."
+    )
     print(f"The output file will be {output_file}")
 
     command = f"python3 llama.cpp/convert-lora-to-ggml.py {save_directory} {output_file} llama"
 
     try:
-        with subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=1, universal_newlines=True) as sp:
+        with subprocess.Popen(
+            command,
+            shell = True,
+            stdout = subprocess.PIPE,
+            stderr = subprocess.PIPE,
+            bufsize = 1,
+            universal_newlines = True,
+        ) as sp:
             for line in sp.stdout:
-                print(line, end="", flush=True)
+                print(line, end = "", flush = True)
             for line in sp.stderr:
-                print(line, end="", flush=True)
+                print(line, end = "", flush = True)
             sp.wait()
             if sp.returncode != 0:
                 raise subprocess.CalledProcessError(sp.returncode, command)
@@ -2032,92 +2604,211 @@ def unsloth_convert_lora_to_ggml_and_save_locally(
         return
     print("Unsloth: Done.")
     print(f"Unsloth: Conversion completed! Output file: {output_file}")
-    print("\nThis GGML making function was made by Maheswar. Ping him @Maheswar on the Unsloth Discord or on HuggingFace (@mahiatlinux) if you like this!")
-pass
+    print(
+        "\nThis GGML making function was made by Maheswar. Ping him @Maheswar on the Unsloth Discord or on HuggingFace (@mahiatlinux) if you like this!"
+    )
 
 
 from .models.loader_utils import get_model_name
-from unsloth_zoo.saving_utils import merge_and_overwrite_lora
+from unsloth_zoo.saving_utils import (
+    merge_and_overwrite_lora,
+    prepare_saving,
+)
+from unsloth_zoo.llama_cpp import (
+    install_llama_cpp,
+    convert_to_gguf as _convert_to_gguf,
+)
+
+
+@torch.inference_mode
+def save_to_gguf_generic(
+    model,
+    save_directory,
+    tokenizer,
+    quantization_method = None,
+    quantization_type = "Q8_0",
+    repo_id = None,
+    token = None,
+):
+    if token is None and repo_id is not None:
+        token = get_token()
+    if repo_id is not None and token is None:
+        raise RuntimeError("Unsloth: Please specify a token for uploading!")
+
+    if not os.path.exists(os.path.join("llama.cpp", "unsloth_convert_hf_to_gguf.py")):
+        install_llama_cpp(just_clone_repo = True)
+
+    # Use old style quantization_method
+    new_quantization_methods = []
+    if quantization_method is not None:
+        # Convert quantization_method to list
+        if isinstance(quantization_method, list):
+            pass
+        elif isinstance(quantization_method, str):
+            quantization_method = [
+                quantization_method,
+            ]
+        elif isinstance(quantization_method, tuple):
+            quantization_method = list(quantization_method)
+        else:
+            raise TypeError(
+                "Unsloth: quantization_method can only be a string or a list of strings"
+            )
+        for i, quant_method in enumerate(quantization_method):
+            quant_method = quant_method.lower()
+            if quant_method == "not_quantized":
+                quant_method = "f16"
+            elif quant_method == "fast_quantized":
+                quant_method = "q8_0"
+            elif quant_method == "quantized":
+                quant_method = "q4_k_m"
+            elif quant_method is None:
+                quant_method = "q8_0"
+            new_quantization_methods.append(quant_method.lower())
+    else:
+        new_quantization_methods.append(quantization_type.lower())
+    # Check if wrong method
+    for quant_method in new_quantization_methods:
+        if quant_method not in ALLOWED_QUANTS.keys():
+            error = f"Unsloth: Quant method = [{quant_method}] not supported. Choose from below:\n"
+            for key, value in ALLOWED_QUANTS.items():
+                error += f"[{key}] => {value}\n"
+            raise RuntimeError(error)
+
+    # Go through all types and save individually - somewhat inefficient
+    # since we save F16 / BF16 multiple times
+    for quantization_type in new_quantization_methods:
+        metadata = _convert_to_gguf(
+            save_directory,
+            print_output = True,
+            quantization_type = quantization_type,
+        )
+        if repo_id is not None:
+            prepare_saving(
+                model,
+                repo_id,
+                push_to_hub = True,
+                max_shard_size = "50GB",
+                private = True,
+                token = token,
+            )
+
+            from huggingface_hub import HfApi
+
+            api = HfApi(token = token)
+            api.upload_folder(
+                folder_path = save_directory,
+                repo_id = repo_id,
+                repo_type = "model",
+                allow_patterns = ["*.gguf"],
+            )
+    return metadata
+
 
 @torch.inference_mode
 def unsloth_generic_save(
     model,
     tokenizer,
-    save_directory       : Union[str, os.PathLike] = "unsloth_finetuned_merge",
-    save_method          : str = "lora", # ["lora", "merged_16bit", "merged_4bit"]
-    push_to_hub          : bool = False,
-    token                : Optional[Union[str, bool]] = None,
-    is_main_process      : bool = True,
-    state_dict           : Optional[dict] = None,
-    save_function        : Callable = torch.save,
-    max_shard_size       : Union[int, str] = "5GB",
-    safe_serialization   : bool = True,
-    variant              : Optional[str] = None,
-    save_peft_format     : bool = True,
-
+    save_directory: Union[str, os.PathLike] = "unsloth_finetuned_merge",
+    save_method: str = "lora",  # ["lora", "merged_16bit", "merged_4bit"]
+    push_to_hub: bool = False,
+    token: Optional[Union[str, bool]] = None,
+    is_main_process: bool = True,
+    state_dict: Optional[dict] = None,
+    save_function: Callable = torch.save,
+    max_shard_size: Union[int, str] = "5GB",
+    safe_serialization: bool = True,
+    variant: Optional[str] = None,
+    save_peft_format: bool = True,
     # Push to hub
-    use_temp_dir         : Optional[bool] = None,
-    commit_message       : Optional[str] = "Trained with Unsloth",
-    private              : Optional[bool] = None,
-    create_pr            : bool = False,
-    revision             : str = None,
-    commit_description   : str = "Upload model trained with Unsloth 2x faster",
-    tags                 : List[str] = None,
-
+    use_temp_dir: Optional[bool] = None,
+    commit_message: Optional[str] = "Trained with Unsloth",
+    private: Optional[bool] = None,
+    create_pr: bool = False,
+    revision: str = None,
+    commit_description: str = "Upload model trained with Unsloth 2x faster",
+    tags: List[str] = None,
     # Our functions
-    temporary_location   : str = "_unsloth_temporary_saved_buffers",
-    maximum_memory_usage : float = 0.9,
+    temporary_location: str = "_unsloth_temporary_saved_buffers",
+    maximum_memory_usage: float = 0.9,
+    datasets: Optional[List[str]] = None,
 ):
-    if token is None and push_to_hub: token = get_token()
+    if token is None and push_to_hub:
+        token = get_token()
+
+    if save_method == "merged_4bit":
+        raise RuntimeError(
+            "Unsloth: Merging into 4bit will cause your model to lose accuracy if you plan\n"
+            "to merge to GGUF or others later on. I suggest you to do this as a final step\n"
+            "if you're planning to do multiple saves.\n"
+            "If you are certain, change `save_method` to `merged_4bit_forced`."
+        )
+    elif save_method == "merged_4bit_forced":
+        save_method = "merged_4bit"
+
     merge_and_overwrite_lora(
         get_model_name,
-        model                = model,
-        tokenizer            = tokenizer,
-        save_directory       = save_directory,
-        push_to_hub          = push_to_hub,
-        private              = private,
-        token                = token,
-        output_dtype         = None,
-        low_disk_space_usage = False,
-        use_temp_file        = False,
+        model = model,
+        tokenizer = tokenizer,
+        save_directory = save_directory,
+        push_to_hub = push_to_hub,
+        private = private,
+        token = token,
+        save_method = save_method,
+        output_dtype = None,
+        low_disk_space_usage = True,
+        use_temp_file = False,
     )
+
+    if push_to_hub and datasets:
+        try:
+            from huggingface_hub import metadata_update
+
+            save_dir, _ = _determine_username(save_directory, None, token)
+            metadata_update(
+                save_dir, {"datasets": datasets}, overwrite = True, token = token
+            )
+        except Exception as e:
+            logger.warning_once(
+                f"Unsloth: Could not update datasets metadata for {save_directory}: {e}"
+            )
+
     return
-pass
 
 
 def unsloth_generic_save_pretrained_merged(
     self,
-    save_directory       : Union[str, os.PathLike],
-    tokenizer            = None,
-    save_method          : str = "merged_16bit", # ["lora", "merged_16bit", "merged_4bit"]
-    push_to_hub          : bool = False,
-    token                : Optional[Union[str, bool]] = None,
-    is_main_process      : bool = True,
-    state_dict           : Optional[dict] = None,
-    save_function        : Callable = torch.save,
-    max_shard_size       : Union[int, str] = "5GB",
-    safe_serialization   : bool = True,
-    variant              : Optional[str] = None,
-    save_peft_format     : bool = True,
-    tags                 : List[str] = None,
-    temporary_location   : str = "_unsloth_temporary_saved_buffers",
-    maximum_memory_usage : float = 0.75,
-):   
+    save_directory: Union[str, os.PathLike],
+    tokenizer = None,
+    save_method: str = "merged_16bit",  # ["lora", "merged_16bit", "merged_4bit"]
+    push_to_hub: bool = False,
+    token: Optional[Union[str, bool]] = None,
+    is_main_process: bool = True,
+    state_dict: Optional[dict] = None,
+    save_function: Callable = torch.save,
+    max_shard_size: Union[int, str] = "5GB",
+    safe_serialization: bool = True,
+    variant: Optional[str] = None,
+    save_peft_format: bool = True,
+    tags: List[str] = None,
+    temporary_location: str = "_unsloth_temporary_saved_buffers",
+    maximum_memory_usage: float = 0.75,
+    datasets: Optional[List[str]] = None,
+):
     """
-        Same as .push_to_hub(...) except 4bit weights are auto
-        converted to float16 with as few overhead as possible.
+    Same as .push_to_hub(...) except 4bit weights are auto
+    converted to float16 with as few overhead as possible.
 
-        Choose for `save_method` to be either:
-        1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp.
-        2.  `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference.
-        3.  `lora`: Save LoRA adapters with no merging. Useful for HF inference.
+    Choose for `save_method` to be either:
+    1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp.
+    2.  `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference.
+    3.  `lora`: Save LoRA adapters with no merging. Useful for HF inference.
     """
     if tokenizer is None:
         logger.warning_once(
-            "Unsloth: You're not saving a tokenizer as well?\n"\
+            "Unsloth: You're not saving a tokenizer as well?\n"
             "You can do it separately via `tokenizer.save_pretrained(...)`"
         )
-    pass
 
     arguments = dict(locals())
     arguments["model"] = self
@@ -2125,58 +2816,266 @@ def unsloth_generic_save_pretrained_merged(
     unsloth_generic_save(**arguments)
     for _ in range(3):
         gc.collect()
-pass
 
 
 def unsloth_generic_push_to_hub_merged(
     self,
-    repo_id              : str,
-    tokenizer            = None,
-    save_method          : str = "merged_16bit", # ["lora", "merged_16bit", "merged_4bit"]
-    use_temp_dir         : Optional[bool] = None,
-    commit_message       : Optional[str] = "Trained with Unsloth",
-    private              : Optional[bool] = None,
-    token                : Union[bool, str, None] = None,
-    max_shard_size       : Union[int, str, None] = "5GB",
-    create_pr            : bool = False,
-    safe_serialization   : bool = True,
-    revision             : str = None,
-    commit_description   : str = "Upload model trained with Unsloth 2x faster",
-    tags                 : Optional[List[str]] = None,
-    temporary_location   : str = "_unsloth_temporary_saved_buffers",
-    maximum_memory_usage : float = 0.75,
+    repo_id: str,
+    tokenizer = None,
+    save_method: str = "merged_16bit",  # ["lora", "merged_16bit", "merged_4bit"]
+    use_temp_dir: Optional[bool] = None,
+    commit_message: Optional[str] = "Trained with Unsloth",
+    private: Optional[bool] = None,
+    token: Union[bool, str, None] = None,
+    max_shard_size: Union[int, str, None] = "5GB",
+    create_pr: bool = False,
+    safe_serialization: bool = True,
+    revision: str = None,
+    commit_description: str = "Upload model trained with Unsloth 2x faster",
+    tags: Optional[List[str]] = None,
+    temporary_location: str = "_unsloth_temporary_saved_buffers",
+    maximum_memory_usage: float = 0.75,
+    datasets: Optional[List[str]] = None,
 ):
     """
-        Same as .push_to_hub(...) except 4bit weights are auto
-        converted to float16 with as few overhead as possible.
+    Same as .push_to_hub(...) except 4bit weights are auto
+    converted to float16 with as few overhead as possible.
 
-        Choose for `save_method` to be either:
-        1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp.
-        2.  `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference.
-        3.  `lora`: Save LoRA adapters with no merging. Useful for HF inference.
+    Choose for `save_method` to be either:
+    1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp.
+    2.  `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference.
+    3.  `lora`: Save LoRA adapters with no merging. Useful for HF inference.
     """
     if tokenizer is None:
         logger.warning_once(
-            "Unsloth: You're not saving a tokenizer as well?\n"\
+            "Unsloth: You're not saving a tokenizer as well?\n"
             "You can do it separately via `tokenizer.push_to_hub(...)`"
         )
-    pass
 
     arguments = dict(locals())
-    arguments["model"]          = self
+    arguments["model"] = self
     arguments["save_directory"] = repo_id
-    arguments["push_to_hub"]    = True
+    arguments["push_to_hub"] = True
     del arguments["self"]
     del arguments["repo_id"]
     unsloth_generic_save(**arguments)
     for _ in range(3):
         gc.collect()
-pass
+
+
+def _unsloth_save_torchao_with_attached_config(
+    model,
+    save_directory: Union[str, os.PathLike],
+    tokenizer,
+    push_to_hub: bool = False,
+    token: Optional[Union[str, bool]] = None,
+):
+    """Save a QAT-trained model by converting fake-quantized weights to real quantized weights."""
+    # Convert QAT fake-quantized weights to real quantized weights
+    _convert_torchao_model(model)
+    # PEFT models also might come here, so parse it
+    if isinstance(model, PeftModelForCausalLM):
+        _unsloth_save_torchao_with_given_config(
+            model = model,
+            save_directory = save_directory,
+            tokenizer = tokenizer,
+            torchao_config = model.config.quantization_config,
+            push_to_hub = push_to_hub,
+            token = token,
+        )
+        return
+
+    # TorchAO does not support safe_serialization reliably
+    safe_serialization = False
+
+    if push_to_hub:
+        model.push_to_hub(
+            save_directory, safe_serialization = safe_serialization, token = token
+        )
+        tokenizer.push_to_hub(save_directory, token = token)
+    else:
+        model.save_pretrained(save_directory, safe_serialization = safe_serialization)
+        tokenizer.save_pretrained(save_directory)
+
+
+def _unsloth_save_torchao_with_given_config(
+    model,
+    save_directory: Union[str, os.PathLike],
+    tokenizer,
+    torchao_config,
+    push_to_hub: bool = False,
+    token: Optional[Union[str, bool]] = None,
+):
+    """Quantizes the model with torchao and saves a torchao quantized checkpoint
+
+    Args
+      `save_directory`: local folder path or huggingface hub ID when `push_to_hub` is set to True, e.g. `my_model`
+      `torchao_config` (TorchAOBaseConfig): configuration for torchao quantization, full list: https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize
+      `push_to_hub` (bool): whether to push the checkpoint to huggingface hub or save locally
+    """
+
+    if push_to_hub:
+        assert token is not None, "Unsloth: Please specify a token for uploading!"
+
+    assert (
+        torchao_config is not None
+    ), "Unsloth: Please specify a torchao_config for post-training quantization!"
+
+    # first merge the lora weights
+    arguments = dict(locals())
+    arguments["push_to_hub"] = False  # We save ourselves
+    arguments["save_method"] = "merged_16bit"  # Must be 16bit
+    del arguments["torchao_config"]
+
+    if not isinstance(model, PeftModelForCausalLM) and not isinstance(model, PeftModel):
+        model.save_pretrained(save_directory)
+        tokenizer.save_pretrained(save_directory)
+    else:
+        unsloth_generic_save(**arguments)
+
+    for _ in range(3):
+        gc.collect()
+
+    from transformers import (
+        AutoModelForCausalLM,
+        AutoTokenizer,
+        TorchAoConfig,
+        AutoModelForImageTextToText,
+        AutoProcessor,
+    )
+    from torchao import quantize_
+
+    if isinstance(torchao_config, TorchAoConfig):
+        quantization_config = torchao_config
+    else:
+        quantization_config = TorchAoConfig(quant_type = torchao_config)
+
+    # Determine if this is a VLM
+    is_vlm = False
+    if hasattr(model, "config") and hasattr(model.config, "architectures"):
+        is_vlm = any(
+            x.endswith(("ForConditionalGeneration", "ForVisionText2Text"))
+            for x in model.config.architectures
+        )
+        is_vlm = is_vlm or hasattr(model.config, "vision_config")
+    auto_model = AutoModelForImageTextToText if is_vlm else AutoModelForCausalLM
+    auto_processor = AutoProcessor if is_vlm else AutoTokenizer
+
+    tokenizer = auto_processor.from_pretrained(save_directory)
+
+    # TorchAO must only use bfloat16 for loading (float16 fails)
+    if HAS_TORCH_DTYPE:
+        kwargs = {"torch_dtype": torch.bfloat16}
+    else:
+        kwargs = {"dtype": torch.bfloat16}
+
+    # Reload with quantization applied
+    quantized_model = auto_model.from_pretrained(
+        save_directory,
+        device_map = "auto",
+        quantization_config = quantization_config,
+        **kwargs,
+    )
+
+    torchao_save_directory = save_directory + "-torchao"
+
+    # TorchAO does not support safe_serialization right now 0.14.0 seems broken!
+    safe_serialization = Version(importlib_version("torchao")) > Version("0.14.0")
+    safe_serialization = False
+
+    if push_to_hub:
+        quantized_model.push_to_hub(
+            torchao_save_directory, safe_serialization = safe_serialization, token = token
+        )
+        tokenizer.push_to_hub(torchao_save_directory, token = token)
+    else:
+        quantized_model.save_pretrained(
+            torchao_save_directory, safe_serialization = safe_serialization
+        )
+        tokenizer.save_pretrained(torchao_save_directory)
+
+    # Clean up the intermediate unquantized model
+    if os.path.exists(save_directory):
+        try:
+            shutil.rmtree(save_directory)
+        except:
+            pass
+
+
+def unsloth_save_pretrained_torchao(
+    self,
+    save_directory: Union[str, os.PathLike],
+    tokenizer = None,
+    torchao_config = None,
+    push_to_hub: bool = False,
+    token: Optional[Union[str, bool]] = None,
+):
+    """Saves a torchao quantized model checkpoint.
+
+    This function handles two mutually exclusive workflows:
+
+    1. **QAT (Quantization-Aware Training)**: If the model was trained with `qat_scheme`
+       parameter, do NOT pass `torchao_config`. The function will convert the QAT
+       fake-quantized weights to real quantized weights and save directly.
+
+    2. **PTQ (Post-Training Quantization)**: If you want to apply quantization to a
+       regular model, pass a `torchao_config`. The model must NOT have been trained
+       with `qat_scheme`.
+
+    Args:
+      `save_directory`: local folder path or huggingface hub ID when `push_to_hub` is True
+      `tokenizer`: the tokenizer to save alongside the model
+      `torchao_config` (TorchAOBaseConfig): configuration for torchao quantization.
+          Required for PTQ, must be None for QAT models.
+          Options: https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize
+      `push_to_hub` (bool): whether to push to huggingface hub or save locally
+      `token`: HuggingFace token for pushing to hub
+    """
+    if token is None and push_to_hub:
+        token = get_token()
+
+    has_qat_config = (
+        hasattr(self, "_torchao_config") and self._torchao_config is not None
+    )
+
+    if torchao_config is not None:
+        # PTQ path: user provided a config, model must NOT have QAT config unless PEFT
+        assert not has_qat_config, (
+            "Unsloth: You passed `torchao_config` but this model was trained with `qat_scheme`. "
+            "For QAT models, do not pass `torchao_config` - the quantization config is already "
+            "attached to the model from training."
+        )
+        _unsloth_save_torchao_with_given_config(
+            model = self,
+            save_directory = save_directory,
+            tokenizer = tokenizer,
+            torchao_config = torchao_config,
+            push_to_hub = push_to_hub,
+            token = token,
+        )
+    else:
+        # QAT path: no config provided, model must have QAT config
+        assert has_qat_config, (
+            "Unsloth: No `torchao_config` provided and model was not trained with `qat_scheme`. "
+            "Either train with `qat_scheme` parameter, or provide a `torchao_config` for "
+            "post-training quantization."
+        )
+        _unsloth_save_torchao_with_attached_config(
+            model = self,
+            save_directory = save_directory,
+            tokenizer = tokenizer,
+            push_to_hub = push_to_hub,
+            token = token,
+        )
+
+    for _ in range(3):
+        gc.collect()
 
 
 def not_implemented_save(*args, **kwargs):
-    raise NotImplementedError("Unsloth: Sorry GGUF is currently not supported for vision models!")
-pass
+    raise NotImplementedError(
+        "Unsloth: Sorry GGUF is currently not supported for vision models!"
+    )
 
 
 def patch_saving_functions(model, vision = False):
@@ -2189,7 +3088,6 @@ def patch_saving_functions(model, vision = False):
         original_push_to_hub = model.original_push_to_hub
     else:
         original_push_to_hub = model.push_to_hub
-    pass
 
     signature = str(inspect.signature(original_push_to_hub)).replace("NoneType", "None")
     signature = signature[1:]
@@ -2254,60 +3152,63 @@ def patch_saving_functions(model, vision = False):
 
     original_model = model
     while True:
-
-        if original_model.push_to_hub.__name__ != "unsloth_push_to_hub":
+        # Check if push_to_hub exists before accessing its __name__
+        if (
+            hasattr(original_model, "push_to_hub")
+            and original_model.push_to_hub.__name__ != "unsloth_push_to_hub"
+        ):
             original_model.original_push_to_hub = original_model.push_to_hub
-            original_model.push_to_hub = types.MethodType(unsloth_push_to_hub, original_model)
+            original_model.push_to_hub = types.MethodType(
+                unsloth_push_to_hub, original_model
+            )
             if hasattr(original_model, "add_model_tags"):
-                original_model.add_model_tags(["unsloth",])
-            pass
-        pass
+                original_model.add_model_tags(
+                    [
+                        "unsloth",
+                    ]
+                )
 
-        if hasattr(original_model, "model"): original_model = original_model.model
-        else: break
-    pass
+        if hasattr(original_model, "model"):
+            original_model = original_model.model
+        else:
+            break
 
     # Add saving methods to top level model
     if not vision:
         if hasattr(model, "config"):
             # Counteract tokenizers
-            model.push_to_hub_merged     = types.MethodType(unsloth_push_to_hub_merged,                    model)
-            model.save_pretrained_merged = types.MethodType(unsloth_save_pretrained_merged,                model)
-            model.push_to_hub_gguf       = types.MethodType(unsloth_push_to_hub_gguf,                      model)
-            model.save_pretrained_gguf   = types.MethodType(unsloth_save_pretrained_gguf,                  model)
-            model.push_to_hub_ggml       = types.MethodType(unsloth_convert_lora_to_ggml_and_push_to_hub,  model)
-            model.save_pretrained_ggml   = types.MethodType(unsloth_convert_lora_to_ggml_and_save_locally, model)
-        pass
+            model.push_to_hub_merged = types.MethodType(
+                unsloth_generic_push_to_hub_merged, model
+            )
+            model.save_pretrained_merged = types.MethodType(
+                unsloth_generic_save_pretrained_merged, model
+            )
+            model.push_to_hub_gguf = types.MethodType(unsloth_push_to_hub_gguf, model)
+            model.save_pretrained_gguf = types.MethodType(
+                unsloth_save_pretrained_gguf, model
+            )
+            model.save_pretrained_torchao = types.MethodType(
+                unsloth_save_pretrained_torchao, model
+            )
+            model.push_to_hub_ggml = types.MethodType(
+                unsloth_convert_lora_to_ggml_and_push_to_hub, model
+            )
+            model.save_pretrained_ggml = types.MethodType(
+                unsloth_convert_lora_to_ggml_and_save_locally, model
+            )
     else:
         # Vision only 1 option
-        model.push_to_hub_merged     = types.MethodType(unsloth_generic_push_to_hub_merged,     model)
-        model.save_pretrained_merged = types.MethodType(unsloth_generic_save_pretrained_merged, model)
-        model.push_to_hub_gguf       = types.MethodType(not_implemented_save,                   model)
-        model.save_pretrained_gguf   = types.MethodType(not_implemented_save,                   model)
-    pass
+        model.push_to_hub_merged = types.MethodType(
+            unsloth_generic_push_to_hub_merged, model
+        )
+        model.save_pretrained_merged = types.MethodType(
+            unsloth_generic_save_pretrained_merged, model
+        )
+        model.push_to_hub_gguf = types.MethodType(unsloth_push_to_hub_gguf, model)
+        model.save_pretrained_gguf = types.MethodType(
+            unsloth_save_pretrained_gguf, model
+        )
+        model.save_pretrained_torchao = types.MethodType(
+            unsloth_save_pretrained_torchao, model
+        )
     return model
-pass
-
-def export_model_to_local(model, tokenizer, save_directory, drive_directory):
-    """
-    Export a fine-tuned model from Colab to your local machine.
-
-    Args:
-        model: The fine-tuned model to be exported.
-        tokenizer: The tokenizer associated with the model.
-        save_directory: The directory where the model will be saved in Colab.
-        drive_directory: The directory in Google Drive where the model will be saved.
-    """
-    # Save the model in Colab
-    model.save_pretrained(save_directory)
-    tokenizer.save_pretrained(save_directory)
-
-    # Mount Google Drive
-    from google.colab import drive
-    drive.mount('/content/drive')
-
-    # Copy the model files to Google Drive
-    import shutil
-    shutil.copytree(save_directory, drive_directory)
-
-    print(f"Model saved to {drive_directory} in Google Drive. You can now download it to your local machine.")