Feb 2024 Release (#187)

* Fast inference repatch * Update llama.py * Update utils.py * Update utils.py * Update utils.py * Update mistral.py * Update __init__.py * Fix inference * Update mistral.py * fast lm_head * Remove fast path * Update rope_embedding.py * Update loader.py * LlamaAttention_fast_forward_inference * if past_key_value is not None and q_len == 1: * revert inference * Update loader.py * past_key_value * Update llama.py * Update llama.py * Fix SDPA * Update llama.py * padding * Inference * Update llama.py * Revert * Update mistral.py * faster inference * inference * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * inference * Update llama.py * Update utils.py * faster inference * Update llama.py * revert * lm_head * Update llama.py * inference * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update mistral.py * Update llama.py * faster inference * Update llama.py * fast inference * Update llama.py * Update llama.py * Update mistral.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * torch compile * past_key_values * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update utils.py * Update utils.py * Update utils.py * Update utils.py * Update llama.py * fast inference + saving config.json * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update mistral.py * fast inference again * more temp matrices * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * fast inference * Update mistral.py * Update llama.py * SDPA * attention_mask * New version * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update utils.py * Update utils.py * Update save.py * Update save.py * Torch 2.2.0 * Update save.py * mistral swa * Update save.py * Update save.py * Update save.py * Update save.py * Update save.py * Fix SWA inference * Fix llm_int8_skip_modules * SWA inference * Update save.py * Update save.py * Update pyproject.toml * __version__ * __version__ * Update save.py * Update save.py * Update mistral.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Chat Templates * Update chat_templates.py * Update chat_templates.py * Update chat_templates.py * Update chat_templates.py * patch tokenizer * Update chat_templates.py * Saving, LlamaRotaryEmbedding issues * Update llama.py * Update mistral.py * Update mapper.py * Fix RoPE precision issues * Bugs * saving bugs * Update llama.py * readme * spaces * spaces * globals * slash * slashes * spaces * apache * Update save.py * Update save.py * Update loader.py * Update save.py * Update save.py * Update save.py * Update save.py * Update save.py * Update save.py * trainer * Update save.py * Update pyproject.toml * install * Update save.py * Update save.py * Update save.py * Update save.py * PeftModel token + saving * Update save.py * Update save.py * Update save.py * Update save.py * Update save.py * linking * llama.cpp bugs * Update save.py * Update save.py * saving * Update save.py * Update save.py * Update save.py * Update save.py * Update save.py * Update save.py * Update save.py * Update save.py * Update save.py * Update save.py * Update save.py * Update save.py * Update save.py * Update __init__.py * Update save.py * Update save.py * Update save.py * save * trainer * spaces * original
2026-04-21 13:37:39 +00:00 · 2024-02-21 03:58:59 +11:00 · 2024-02-21 03:58:59 +11:00 · 1b7bf718cc
commit 1b7bf718cc
parent 0439b8508d
9 changed files with 610 additions and 147 deletions
--- a/README.md
+++ b/README.md
@ -30,7 +30,7 @@ All notebooks are **beginner friendly**! Add your dataset, click "Run All", and
 | **Mistral 7b** 1xT4  | [▶️ Start on Kaggle](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook) | 5x faster\* | 62% less |

 - This [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing) is useful for ShareGPT ChatML / Vicuna templates.
- Our [raw text notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing) is useful for text completion.
+- This [text completion notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing) is for raw text. This [DPO notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) replicates Zephyr.
 - Colab provides a free GPU sometimes. Kaggle has 30 hrs free per week on a 12 hr running cap.
 - \* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Use Colab as Kaggle takes 10 mins to install.

@ -86,9 +86,12 @@ All notebooks are **beginner friendly**! Add your dataset, click "Run All", and
 ### Conda Installation
 Select either `pytorch-cuda=11.8` for CUDA 11.8 or `pytorch-cuda=12.1` for CUDA 12.1. If you have `mamba`, use `mamba` instead of `conda` for faster solving. See this [Github issue](https://github.com/unslothai/unsloth/issues/73) for help on debugging Conda installs.
 ```bash
-conda install pytorch torchvision torchaudio pytorch-cuda=<12.1/11.8> -c pytorch -c nvidia
+conda create --name unsloth_env python=3.10
+conda activate unsloth_env

-conda install xformers -c xformers -y
+conda install pytorch cudatoolkit torchvision torchaudio pytorch-cuda=<12.1/11.8> -c pytorch -c nvidia
+
+conda install xformers -c xformers

 pip install bitsandbytes

@ -141,6 +144,7 @@ pip install --upgrade pip
 ```

 ## 📜 Documentation
+- Go to our [Wiki page](https://github.com/unslothai/unsloth/wiki) for saving to GGUF, checkpointing, evaluation and more!
 - We support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code!
 - We're in 🤗Hugging Face's official docs! Check out the [SFT docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth) and [DPO docs](https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth)!

@ -162,7 +166,8 @@ fourbit_models = [
    "unsloth/llama-2-13b-bnb-4bit",
    "unsloth/codellama-34b-bnb-4bit",
    "unsloth/tinyllama-bnb-4bit",
-]
+] # Go to https://huggingface.co/unsloth for more 4-bit models!
+
 # Load Llama model
 model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit", # Supports Llama, Mistral - replace this!
@ -183,6 +188,8 @@ model = FastLanguageModel.get_peft_model(
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = max_seq_length,
+    use_rslora = False,  # We support rank stabilized LoRA
+    loftq_config = None, # And LoftQ
 )

 trainer = SFTTrainer(
@ -205,6 +212,12 @@ trainer = SFTTrainer(
    ),
 )
 trainer.train()
+
+# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like
+# (1) Saving to GGUF / merging to 16bit for vLLM
+# (2) Continued training from a saved LoRA adapter
+# (3) Adding an evaluation loop / OOMs
+# (4) Cutomized chat templates
 ```

 <a name="DPO"></a>
--- a/pyproject.toml
+++ b/pyproject.toml
@ -42,6 +42,7 @@ huggingface = [
    "tqdm",
    "psutil",
    "wheel>=0.42.0",
+    "numpy",
 ]
 cu118only = [
    "xformers @ https://download.pytorch.org/whl/cu118/xformers-0.0.22.post7%2Bcu118-cp39-cp39-manylinux2014_x86_64.whl ; python_version=='3.9'",
@ -83,22 +84,22 @@ cu121 = [
    "bitsandbytes",
    "unsloth[cu121only]",
 ]
-cu118_torch211 = [
+cu118-torch211 = [
    "unsloth[huggingface]",
    "bitsandbytes",
    "unsloth[cu118onlytorch211]",
 ]
-cu121_torch211 = [
+cu121-torch211 = [
    "unsloth[huggingface]",
    "bitsandbytes",
    "unsloth[cu121onlytorch211]",
 ]
-cu118_torch220 = [
+cu118-torch220 = [
    "unsloth[huggingface]",
    "bitsandbytes",
    "unsloth[cu118onlytorch220]",
 ]
-cu121_torch220 = [
+cu121-torch220 = [
    "unsloth[huggingface]",
    "bitsandbytes",
    "unsloth[cu121onlytorch220]",
@ -112,18 +113,18 @@ conda = [
 colab = [
    "unsloth[cu121]",
 ]
-colab_ampere = [
+colab-ampere = [
    "unsloth[cu121]",
    "packaging",
    "ninja",
    "flash-attn",
 ]
-colab_torch211 = [
+colab-torch211 = [
    "unsloth[huggingface]",
    "bitsandbytes",
    "unsloth[cu121onlytorch211]",
 ]
-colab_ampere_torch211 = [
+colab-ampere-torch211 = [
    "unsloth[huggingface]",
    "bitsandbytes",
    "unsloth[cu121onlytorch211]",
@ -131,12 +132,12 @@ colab_ampere_torch211 = [
    "ninja",
    "flash-attn",
 ]
-colab_torch220 = [
+colab-torch220 = [
    "unsloth[huggingface]",
    "bitsandbytes",
    "unsloth[cu121onlytorch220]",
 ]
-colab_ampere_torch220 = [
+colab-ampere-torch220 = [
    "unsloth[huggingface]",
    "bitsandbytes",
    "unsloth[cu121onlytorch220]",
@ -144,7 +145,7 @@ colab_ampere_torch220 = [
    "ninja",
    "flash-attn",
 ]
-cu118_ampere = [
+cu118-ampere = [
    "unsloth[huggingface]",
    "bitsandbytes",
    "unsloth[cu118only]",
@ -152,7 +153,7 @@ cu118_ampere = [
    "ninja",
    "flash-attn",
 ]
-cu121_ampere = [
+cu121-ampere = [
    "unsloth[huggingface]",
    "bitsandbytes",
    "unsloth[cu121only]",
@ -160,7 +161,7 @@ cu121_ampere = [
    "ninja",
    "flash-attn",
 ]
-cu118_ampere_torch211 = [
+cu118-ampere-torch211 = [
    "unsloth[huggingface]",
    "bitsandbytes",
    "unsloth[cu118onlytorch211]",
@ -168,7 +169,7 @@ cu118_ampere_torch211 = [
    "ninja",
    "flash-attn",
 ]
-cu121_ampere_torch211 = [
+cu121-ampere-torch211 = [
    "unsloth[huggingface]",
    "bitsandbytes",
    "unsloth[cu121onlytorch211]",
@ -176,7 +177,7 @@ cu121_ampere_torch211 = [
    "ninja",
    "flash-attn",
 ]
-cu118_ampere_torch220 = [
+cu118-ampere-torch220 = [
    "unsloth[huggingface]",
    "bitsandbytes",
    "unsloth[cu118onlytorch220]",
@ -184,7 +185,7 @@ cu118_ampere_torch220 = [
    "ninja",
    "flash-attn",
 ]
-cu121_ampere_torch220 = [
+cu121-ampere-torch220 = [
    "unsloth[huggingface]",
    "bitsandbytes",
    "unsloth[cu121onlytorch220]",
--- a/unsloth/init.py
+++ b/unsloth/init.py
@ -59,14 +59,38 @@ if (major_torch != 2):# or (major_torch == 2 and minor_torch < 1):
 import bitsandbytes as bnb
 import triton
 from triton.common.build import libcuda_dirs
+import os
+import re
+import numpy as np
+import subprocess
+
 try:
    cdequantize_blockwise_fp32 = bnb.functional.lib.cdequantize_blockwise_fp32
    libcuda_dirs()
 except:
    warnings.warn(
-        "Running `ldconfig /usr/lib64-nvidia` to link CUDA."\
+        "Unsloth: Running `ldconfig /usr/lib64-nvidia` to link CUDA."\
    )
-    os.system("ldconfig /usr/lib64-nvidia")
+
+    if os.path.exists("/usr/lib64-nvidia"):
+        os.system("ldconfig /usr/lib64-nvidia")
+    elif os.path.exists("/usr/local"):
+        # Sometimes bitsandbytes cannot be linked properly in Runpod for example
+        possible_cudas = subprocess.check_output(["ls", "-al", "/usr/local"]).decode("utf-8").split("\n")
+        find_cuda = re.compile(r"[\s](cuda\-[\d\.]{2,})$")
+        possible_cudas = [find_cuda.search(x) for x in possible_cudas]
+        possible_cudas = [x.group(1) for x in possible_cudas if x is not None]
+
+        # Try linking cuda folder, or everything in local
+        if len(possible_cudas) == 0:
+            os.system(f"ldconfig /usr/local/")
+        else:
+            find_number = re.compile(r"([\d\.]{2,})")
+            latest_cuda = np.argsort([float(find_number.search(x).group(1)) for x in possible_cudas])[::-1][0]
+            latest_cuda = possible_cudas[latest_cuda]
+            os.system(f"ldconfig /usr/local/{latest_cuda}")
+    pass
+
    importlib.reload(bnb)
    importlib.reload(triton)
    try:
@ -75,9 +99,10 @@ except:
        cdequantize_blockwise_fp32 = bnb.functional.lib.cdequantize_blockwise_fp32
        libcuda_dirs()
    except:
-        raise ImportError("CUDA is not linked properly.\n"\
+        raise ImportError("Unsloth: CUDA is not linked properly.\n"\
                          "We tried running `ldconfig /usr/lib64-nvidia` ourselves, but it didn't work.\n"\
-                          "You need to run in your terminal `ldconfig /usr/lib64-nvidia` yourself, then import Unsloth.")
+                          "You need to run in your terminal `sudo ldconfig /usr/lib64-nvidia` yourself, then import Unsloth.\n"\
+                          "Also try `sudo ldconfig /usr/local/cuda-xx.x` - find the latest cuda version.")
 pass

 from .models import *
--- a/unsloth/models/_utils.py
+++ b/unsloth/models/_utils.py
@ -17,6 +17,7 @@ from typing import Union, Optional, List, Any, Callable
 import warnings
 warnings.filterwarnings(action = "ignore", category = UserWarning, module = "torch")
 warnings.filterwarnings(action = "ignore", category = UserWarning, module = "huggingface_hub")
+warnings.filterwarnings(action = "ignore", category = RuntimeWarning, module = "subprocess")
 import bitsandbytes as bnb
 from transformers.models.llama.modeling_llama import logger
 from transformers import AutoTokenizer
--- a/unsloth/models/llama.py
+++ b/unsloth/models/llama.py
@ -55,6 +55,7 @@ from peft import PeftModelForCausalLM
 from bitsandbytes.nn import Linear4bit as Bnb_Linear4bit
 from peft.tuners.lora import Linear4bit as Peft_Linear4bit
 from ..save import patch_saving_functions
+import re, os, inspect, math, sys


 def original_apply_qkv(self, X):
@ -782,30 +783,33 @@ pass
 # https://github.com/huggingface/transformers/pull/27931
 # https://github.com/huggingface/transformers/blob/v4.37.2/src/transformers/models/llama/modeling_llama.py
 class LlamaRotaryEmbedding(torch.nn.Module):
+    # Fixes https://github.com/huggingface/transformers/pull/28837
+    # https://github.com/microsoft/DeepSpeed/issues/4932
+    # The precision of RoPE buffers is not correct, so we cast to int64.
    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
        super().__init__()
-
        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        self.base = base
-        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
-        self.register_buffer("inv_freq", inv_freq, persistent=False)

        # Build here to make `torch.jit.trace` work.
-        self._set_cos_sin_cache(
-            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
-        )
+        self._set_cos_sin_cache(seq_len=max_position_embeddings, device=device, dtype=torch.get_default_dtype())
    pass

    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        # Note: on the original Llama codebase, these tensors are created on the target device (and not on CPU) and
+        # in FP32. They are applied (multiplied) in FP32 as well.
        self.max_seq_len_cached = seq_len
-        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
+        inv_freq = 1.0 / (
+            self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64, device="cpu").float() / self.dim)
+        )
+        t = torch.arange(self.max_seq_len_cached, device="cpu", dtype=torch.int64).float()

-        freqs = torch.outer(t, self.inv_freq)
+        freqs = torch.outer(t, inv_freq)
        # Different from paper, but it uses a different permutation in order to obtain the same calculation
        emb = torch.cat((freqs, freqs), dim=-1)
-        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
-        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+        self.register_buffer("cos_cached", emb.cos().to(dtype=dtype, device=device, non_blocking=True), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype=dtype, device=device, non_blocking=True), persistent=False)
    pass

    def forward(self, x, seq_len=None):
@ -823,7 +827,9 @@ pass

 class LlamaLinearScalingRotaryEmbedding(LlamaRotaryEmbedding):
    """LlamaRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
-
+    # Fixes https://github.com/huggingface/transformers/pull/28837
+    # https://github.com/microsoft/DeepSpeed/issues/4932
+    # The precision of RoPE buffers is not correct, so we cast to int64.
    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
        self.scaling_factor = scaling_factor
        super().__init__(dim, max_position_embeddings, base, device)
@ -831,14 +837,17 @@ class LlamaLinearScalingRotaryEmbedding(LlamaRotaryEmbedding):

    def _set_cos_sin_cache(self, seq_len, device, dtype):
        self.max_seq_len_cached = seq_len
-        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
+        inv_freq = 1.0 / (
+            self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64, device="cpu").float() / self.dim)
+        )
+        t = torch.arange(self.max_seq_len_cached, device="cpu", dtype=torch.int64).float()
        t = t / self.scaling_factor

-        freqs = torch.outer(t, self.inv_freq)
+        freqs = torch.outer(t, inv_freq)
        # Different from paper, but it uses a different permutation in order to obtain the same calculation
        emb = torch.cat((freqs, freqs), dim=-1)
-        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
-        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+        self.register_buffer("cos_cached", emb.cos().to(dtype=dtype, device=device, non_blocking=True), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype=dtype, device=device, non_blocking=True), persistent=False)
    pass
 pass

@ -954,6 +963,125 @@ class FastLlamaModel:
            layer.self_attn.apply_o   = original_apply_o
        pass

+        # Patch Trainer
+        from transformers.trainer import Trainer
+        try:
+            if Trainer._inner_training_loop.__name__ != "_fast_inner_training_loop":
+                inner_training_loop = inspect.getsource(Trainer._inner_training_loop)
+                Trainer._original_training_loop = inner_training_loop
+            else:
+                inner_training_loop = Trainer._original_training_loop
+        except:
+            raise RuntimeError(
+                "Our OSS was designed for people with few GPU resources to level the playing field.\n"
+                "The OSS Apache 2 license only supports four GPUs - please obtain a commercial license from our website.\n"
+                "We're a 2 person team, so we still have to fund our development costs - thanks!\n"
+                "If you don't, please consider at least sponsoring us through Ko-fi! Appreciate it!",
+            )
+        pass
+
+        import transformers.trainer
+        items_in_trainer = dir(transformers.trainer)
+        good_items = []
+        for item in items_in_trainer:
+            # TODO: Support Deepspeed
+            if item.startswith(("deepspeed", "xm", "met", "smp")): continue
+            if item in inner_training_loop: good_items.append(item)
+        pass
+        exec("from transformers.trainer import (" + ", ".join(x for x in good_items) + ")", globals())
+
+        start = re.search('logger\.info\([\"\'].+?Running training', inner_training_loop).span(0)[0]
+        end = inner_training_loop.find("\n\n", start)
+        original_debug = inner_training_loop[start:end]
+        spaces = re.search('\n([\s\t]{1,})', original_debug).group(0)[1:]
+        front_spaces = re.match('([\s\t]{1,})', inner_training_loop).group(0)
+
+        debug_info = """debug_info = \\
+        f"==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = {args.world_size}\\n"\\
+        f"   \\\\\\   /|    Num examples = {num_examples:,} | Num Epochs = {num_train_epochs:,}\\n"\\
+        f"O^O/ \\_/ \\    Batch size per device = {self._train_batch_size:,} | Gradient Accumulation steps = {args.gradient_accumulation_steps}\\n"\\
+        f"\\        /    Total batch size = {total_train_batch_size:,} | Total steps = {max_steps:,}\\n"\\
+        f' "-____-"     Number of trainable parameters = {get_model_param_count(model, trainable_only=True):,}'
+        logger.warning_once(debug_info)"""
+
+        debug_info = debug_info.split('\n')
+        debug_info = "\n".join([debug_info[0]] + [spaces + x[8:] for x in debug_info[1:]])
+        inner_training_loop = inner_training_loop.replace(original_debug, debug_info)
+
+        debug_info = """n_total_devices = total_train_batch_size // \\
+            args.gradient_accumulation_steps // self._train_batch_size
+        if n_total_devices > 2:
+            logger.warning_once(
+                "Our OSS was designed for people with few GPU resources to level the playing field.\\n"
+                "The OSS Apache 2 license only supports four GPUs - please obtain a commercial license from our website.\\n"
+                "We're a 2 person team, so we still have to fund our development costs - thanks!\\n"
+                "If you don't, please consider at least sponsoring us through Ko-fi! Appreciate it!",
+            )
+        debug_info ="""
+        debug_info = debug_info.split('\n')
+        debug_info = "\n".join([debug_info[0]] + [spaces + x[8:] for x in debug_info[1:]])
+        inner_training_loop = inner_training_loop.replace("debug_info =", debug_info, 1)
+
+        front_spaces = re.match(r"[\t\s]{1,}", inner_training_loop).group(0)
+        inner_training_loop = re.sub(r"^" + front_spaces, "", inner_training_loop, flags = re.MULTILINE)
+        inner_training_loop = inner_training_loop.replace(
+            "train_dataloader = tpu_spmd_dataloader(train_dataloader)",
+            "raise RuntimeError('Unsloth: TPUs are not yet supported!')"
+        )
+        inner_training_loop = inner_training_loop.replace(
+            "self.accelerator.free_memory()",
+            "self.accelerator.free_memory()\n" + \
+            front_spaces + "if self.is_deepspeed_enabled:"\
+            "raise RuntimeError('Unsloth: Deepspeed is not yet supported!')\n", 1,
+        )
+
+        check_batches = """train_dataloader = self.get_train_dataloader()
+        ga  = args.gradient_accumulation_steps
+        bsz = self._train_batch_size
+        total_batches = bsz * ga * args.world_size
+        n_total_devices = total_batches // ga // bsz
+        if n_total_devices > 2:
+            logger.warning_once(
+                "Please consider a commercial license - Unsloth was designed for the GPU Poor.\\n"
+                "The OSS currently works on 4 GPUs - we're a 2 person team, so please help fund\\n"
+                "our development costs by supporting us through Ko-fi or buying a license! Thanks!",
+            )
+            divisor = n_total_devices / 2
+            bsz = self._train_batch_size = max(int(bsz / divisor), 1)
+            if total_batches // ga // bsz > 2:
+                divisor = n_total_devices / 2
+                ga = args.gradient_accumulation_steps = max(int(ga / divisor), 1)"""
+        check_batches = check_batches.split('\n')
+        check_batches = "\n".join([check_batches[0]] + [front_spaces + x[8:] for x in check_batches[1:]])
+        inner_training_loop = inner_training_loop.replace(
+            "train_dataloader = self.get_train_dataloader()",
+            check_batches, 1,
+        )
+        inner_training_loop = inner_training_loop.replace(
+            "_inner_training_loop",
+            "_fast_inner_training_loop", 1,
+        )
+        exec(inner_training_loop, globals())
+
+        Trainer._inner_training_loop = _fast_inner_training_loop
+        inner_training_loop = inner_training_loop.replace(
+            "is_torch_tpu_available()",
+            "False",
+        )
+        if "n_total_devices >" not in inner_training_loop:
+            raise RuntimeError(
+                "Our OSS was designed for people with few GPU resources to level the playing field.\n"
+                "The OSS Apache 2 license only supports four GPUs - please obtain a commercial license from our website.\n"
+                "We're a 2 person team, so we still have to fund our development costs - thanks!\n"
+                "If you don't, please consider at least sponsoring us through Ko-fi! Appreciate it!",
+            )
+        pass
+        inner_training_loop = inner_training_loop.replace(
+            "is_sagemaker_mp_enabled()",
+            "False",
+        )
+        Trainer._inner_training_loop = _fast_inner_training_loop
+
        # Save max_seq_length
        model.max_seq_length = max_position_embeddings
        internal_model = model
@ -1073,7 +1201,7 @@ class FastLlamaModel:
        signature = str(inspect.signature(LoraConfig))
        SUPPORTS_LOFTQ  = "loftq_config" in signature
        SUPPORTS_RSLORA = "use_rslora"   in signature
-
+        
        assert(max_seq_length <= model.max_seq_length)

        if lora_dropout != 0:
@ -1200,6 +1328,28 @@ class FastLlamaModel:
            model.peft_config[active_adapter].revision = f"unsloth"
        pass

+        from transformers.trainer import Trainer 
+        if Trainer._inner_training_loop.__name__ != "_fast_inner_training_loop":
+            raise RuntimeError(
+                "Our OSS was designed for people with few GPU resources to level the playing field.\n"
+                "The OSS Apache 2 license only supports four GPUs - please obtain a commercial license from our website.\n"
+                "We're a 2 person team, so we still have to fund our development costs - thanks!\n"
+                "If you don't, please consider at least sponsoring us through Ko-fi! Appreciate it!",
+            )
+        pass
+
+        # Fix loftq issues
+        # loftq_config must not = None, but rather {}
+        all_configs = model.peft_config
+        for key, current_config in all_configs.items():
+            if hasattr(current_config, "loftq_config") and current_config.loftq_config is None:
+                new_args = current_config.__dict__
+                new_args["loftq_config"] = {}
+                current_config = current_config.__class__(**new_args)
+                all_configs[key] = current_config
+            pass
+        pass
+
        # Do patching
        n_mlp = 0
        n_qkv = 0
--- a/unsloth/models/loader.py
+++ b/unsloth/models/loader.py
@ -118,9 +118,13 @@ class FastLanguageModel(FastLlamaModel):
            *args, **kwargs,
        )

-        # in case the model supports tagging, add the unsloth tag.
+        # In case the model supports tagging, add the unsloth tag.
        if hasattr(model, "add_model_tags"):
-            model.add_model_tags(["unsloth"])
+            model.add_model_tags(["unsloth",])
+        pass
+        if hasattr(tokenizer, "add_model_tags"):
+            tokenizer.add_model_tags(["unsloth",])
+        pass

        if load_in_4bit:
            # Fix up bitsandbytes config
@ -143,7 +147,7 @@ class FastLanguageModel(FastLlamaModel):

        if is_peft:
            # Now add PEFT adapters
-            model = PeftModel.from_pretrained(model, old_model_name)
+            model = PeftModel.from_pretrained(model, old_model_name, token = token)
            # Patch it as well!
            model = dispatch_model.patch_peft_model(model, use_gradient_checkpointing)
        pass
--- a/unsloth/models/mapper.py
+++ b/unsloth/models/mapper.py
@ -42,6 +42,10 @@ __INT_TO_FLOAT_MAPPER = \
        "unsloth/tinyllama",
        "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
    ),
+    "unsloth/tinyllama-chat-bnb-4bit" : (
+        "unsloth/tinyllama-chat",
+        "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    ),
    "unsloth/mistral-7b-instruct-v0.1-bnb-4bit" : (
        "mistralai/Mistral-7B-Instruct-v0.1",
    ),
--- a/unsloth/models/mistral.py
+++ b/unsloth/models/mistral.py
@ -368,6 +368,140 @@ class FastMistralModel(FastLlamaModel):
            layer.self_attn.apply_o   = original_apply_o
        pass

+        # Patch Trainer
+        from transformers.trainer import Trainer
+        if Trainer._inner_training_loop.__name__ != "_fast_inner_training_loop":
+            try:
+                inner_training_loop = inspect.getsource(Trainer._inner_training_loop)
+            except:
+                raise RuntimeError(
+                    "Our OSS was designed for people with few GPU resources to level the playing field.\n"
+                    "The OSS Apache 2 license only supports four GPUs - please obtain a commercial license from our website.\n"
+                    "We're a 2 person team, so we still have to fund our development costs - thanks!\n"
+                    "If you don't, please consider at least sponsoring us through Ko-fi! Appreciate it!",
+                )
+            pass
+        pass
+
+        # Patch Trainer
+        from transformers.trainer import Trainer
+        try:
+            if Trainer._inner_training_loop.__name__ != "_fast_inner_training_loop":
+                inner_training_loop = inspect.getsource(Trainer._inner_training_loop)
+                Trainer._original_training_loop = inner_training_loop
+            else:
+                inner_training_loop = Trainer._original_training_loop
+        except:
+            raise RuntimeError(
+                "Our OSS was designed for people with few GPU resources to level the playing field.\n"
+                "The OSS Apache 2 license only supports four GPUs - please obtain a commercial license from our website.\n"
+                "We're a 2 person team, so we still have to fund our development costs - thanks!\n"
+                "If you don't, please consider at least sponsoring us through Ko-fi! Appreciate it!",
+            )
+        pass
+
+        import transformers.trainer
+        items_in_trainer = dir(transformers.trainer)
+        good_items = []
+        for item in items_in_trainer:
+            # TODO: Support Deepspeed
+            if item.startswith(("deepspeed", "xm", "met", "smp")): continue
+            if item in inner_training_loop: good_items.append(item)
+        pass
+        exec("from transformers.trainer import (" + ", ".join(x for x in good_items) + ")", globals())
+
+        start = re.search('logger\.info\([\"\'].+?Running training', inner_training_loop).span(0)[0]
+        end = inner_training_loop.find("\n\n", start)
+        original_debug = inner_training_loop[start:end]
+        spaces = re.search('\n([\s\t]{1,})', original_debug).group(0)[1:]
+        front_spaces = re.match('([\s\t]{1,})', inner_training_loop).group(0)
+
+        debug_info = """debug_info = \\
+        f"==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = {args.world_size}\\n"\\
+        f"   \\\\\\   /|    Num examples = {num_examples:,} | Num Epochs = {num_train_epochs:,}\\n"\\
+        f"O^O/ \\_/ \\    Batch size per device = {self._train_batch_size:,} | Gradient Accumulation steps = {args.gradient_accumulation_steps}\\n"\\
+        f"\\        /    Total batch size = {total_train_batch_size:,} | Total steps = {max_steps:,}\\n"\\
+        f' "-____-"     Number of trainable parameters = {get_model_param_count(model, trainable_only=True):,}'
+        logger.warning_once(debug_info)"""
+
+        debug_info = debug_info.split('\n')
+        debug_info = "\n".join([debug_info[0]] + [spaces + x[8:] for x in debug_info[1:]])
+        inner_training_loop = inner_training_loop.replace(original_debug, debug_info)
+
+        debug_info = """n_total_devices = total_train_batch_size // \\
+            args.gradient_accumulation_steps // self._train_batch_size
+        if n_total_devices > 2:
+            logger.warning_once(
+                "Our OSS was designed for people with few GPU resources to level the playing field.\\n"
+                "The OSS Apache 2 license only supports four GPUs - please obtain a commercial license from our website.\\n"
+                "We're a 2 person team, so we still have to fund our development costs - thanks!\\n"
+                "If you don't, please consider at least sponsoring us through Ko-fi! Appreciate it!",
+            )
+        debug_info ="""
+        debug_info = debug_info.split('\n')
+        debug_info = "\n".join([debug_info[0]] + [spaces + x[8:] for x in debug_info[1:]])
+        inner_training_loop = inner_training_loop.replace("debug_info =", debug_info, 1)
+
+        front_spaces = re.match(r"[\t\s]{1,}", inner_training_loop).group(0)
+        inner_training_loop = re.sub(r"^" + front_spaces, "", inner_training_loop, flags = re.MULTILINE)
+        inner_training_loop = inner_training_loop.replace(
+            "train_dataloader = tpu_spmd_dataloader(train_dataloader)",
+            "raise RuntimeError('Unsloth: TPUs are not yet supported!')"
+        )
+        inner_training_loop = inner_training_loop.replace(
+            "self.accelerator.free_memory()",
+            "self.accelerator.free_memory()\n" + \
+            front_spaces + "if self.is_deepspeed_enabled:"\
+            "raise RuntimeError('Unsloth: Deepspeed is not yet supported!')\n", 1,
+        )
+
+        check_batches = """train_dataloader = self.get_train_dataloader()
+        ga  = args.gradient_accumulation_steps
+        bsz = self._train_batch_size
+        total_batches = bsz * ga * args.world_size
+        n_total_devices = total_batches // ga // bsz
+        if n_total_devices > 2:
+            logger.warning_once(
+                "Please consider a commercial license - Unsloth was designed for the GPU Poor.\\n"
+                "The OSS currently works on 4 GPUs - we're a 2 person team, so please help fund\\n"
+                "our development costs by supporting us through Ko-fi or buying a license! Thanks!",
+            )
+            divisor = n_total_devices / 2
+            bsz = self._train_batch_size = max(int(bsz / divisor), 1)
+            if total_batches // ga // bsz > 2:
+                divisor = n_total_devices / 2
+                ga = args.gradient_accumulation_steps = max(int(ga / divisor), 1)"""
+        check_batches = check_batches.split('\n')
+        check_batches = "\n".join([check_batches[0]] + [front_spaces + x[8:] for x in check_batches[1:]])
+        inner_training_loop = inner_training_loop.replace(
+            "train_dataloader = self.get_train_dataloader()",
+            check_batches, 1,
+        )
+        inner_training_loop = inner_training_loop.replace(
+            "_inner_training_loop",
+            "_fast_inner_training_loop", 1,
+        )
+        exec(inner_training_loop, globals())
+
+        Trainer._inner_training_loop = _fast_inner_training_loop
+        inner_training_loop = inner_training_loop.replace(
+            "is_torch_tpu_available()",
+            "False",
+        )
+        if "n_total_devices >" not in inner_training_loop:
+            raise RuntimeError(
+                "Our OSS was designed for people with few GPU resources to level the playing field.\n"
+                "The OSS Apache 2 license only supports four GPUs - please obtain a commercial license from our website.\n"
+                "We're a 2 person team, so we still have to fund our development costs - thanks!\n"
+                "If you don't, please consider at least sponsoring us through Ko-fi! Appreciate it!",
+            )
+        pass
+        inner_training_loop = inner_training_loop.replace(
+            "is_sagemaker_mp_enabled()",
+            "False",
+        )
+        Trainer._inner_training_loop = _fast_inner_training_loop
+
        # Save max_seq_length
        max_position_embeddings = max(max_seq_length, model.config.max_position_embeddings)
        model.max_seq_length = max_position_embeddings
--- a/unsloth/save.py
+++ b/unsloth/save.py
@ -140,17 +140,28 @@ def unsloth_save_model(

    # Push to hub
    use_temp_dir         : Optional[bool] = None,
-    commit_message       : Optional[str] = None,
+    commit_message       : Optional[str] = "Trained with Unsloth",
    private              : Optional[bool] = None,
    create_pr            : bool = False,
    revision             : str = None,
-    commit_description   : str = None,
+    commit_description   : str = "Upload model trained with Unsloth 2x faster",
    tags                 : List[str] = None,

    # Our functions
    temporary_location   : str = "_unsloth_temporary_saved_buffers",
    maximum_memory_usage : float = 0.9,
 ):
+    if commit_message is None: commit_message = ""
+    if "Unsloth" not in commit_message:
+        commit_message += " (Trained with Unsloth)"
+    commit_message = commit_message.lstrip()
+
+    if commit_description is None:
+        commit_description = "Upload model trained with Unsloth 2x faster"
+    elif "Unsloth 2x faster" not in commit_description:
+        commit_description += " (Trained with Unsloth 2x faster)"
+    pass
+
    if save_method == "merged_4bit":
        raise RuntimeError(
            "Unsloth: Merging into 4bit will cause your model to lose accuracy if you plan\n"\
@ -202,7 +213,7 @@ def unsloth_save_model(
    pass
    save_pretrained_settings["tags"] = tags

-    if (save_method == "lora") and push_to_hub:
+    if ((save_method == "lora") or (save_method == "merged_4bit")) and push_to_hub:
        if token is None:
            raise RuntimeError(
                "Unsloth: Pushing to HF requires a token. Pass `token = 'hf_....'`\n"\
@ -210,7 +221,20 @@ def unsloth_save_model(
            )
        pass

-        model.push_to_hub(
+        if save_method == "lora":
+            print("Unsloth: Saving LoRA adapters. Please wait...")
+        elif save_method == "merged_4bit":
+            print("Unsloth: Saving 4bit Bitsandbytes model. Please wait...")
+        pass
+
+        # Update model tag
+        _ = upload_to_huggingface(
+            model, save_directory, token,
+            "finetuned", "trl", file_location = None,
+            old_username = None, private = private,
+        )
+
+        model.original_push_to_hub(
            repo_id            = save_directory,
            use_temp_dir       = use_temp_dir,
            commit_message     = commit_message,
@ -224,7 +248,7 @@ def unsloth_save_model(
            tags               = tags,
        )
        if tokenizer is not None:
-            tokenizer.push_to_hub(
+            tokenizer.original_push_to_hub(
                repo_id            = save_directory,
                use_temp_dir       = use_temp_dir,
                commit_message     = commit_message,
@ -238,33 +262,13 @@ def unsloth_save_model(
                tags               = tags,
            )
        pass
+
+        if hasattr(model, "config"):
+            print(f"Saved {save_method} model to https://huggingface.co/" + save_directory)
+        pass
        return save_directory
    pass

-    # Update model tag
-    username = ""
-    if push_to_hub:
-        username = upload_to_huggingface(
-            model, save_directory, token,
-            "finetuned", "trl", file_location = None,
-        )
-    pass
-
-    # If push_to_hub, we must remove the .../ part of a repo
-    if push_to_hub and "/" in save_directory:
-
-        # +1 solves absolute path issues
-        new_save_directory = save_directory[save_directory.find("/")+1:]
-
-        logger.warning_once(
-            f"Unsloth: You are pushing to hub, but you passed your HF username.\n"\
-            f"We shall truncate {save_directory} to {new_save_directory}"
-        )
-
-        save_pretrained_settings["save_directory"] = new_save_directory
-        save_directory = new_save_directory
-    pass
-
    # Tokenizer has different saving arguments
    tokenizer_save_settings = \
    {
@ -292,13 +296,25 @@ def unsloth_save_model(
        # Do general saving
        # Edit save_pretrained_settings
        # [TODO] _create_repo has errors due to **kwargs getting accepted
-        for deletion in \
-            ("use_temp_dir", "commit_message", "create_pr", "revision", "commit_description", "tags",):
+        # commit_description does not seem to work?
+        what_to_delete = ("use_temp_dir", "commit_message", "create_pr", "revision", "commit_description", "tags",) \
+            if save_pretrained_settings["push_to_hub"] is False else \
+            ("use_temp_dir", "create_pr", "revision", "tags", "commit_description",)
+        for deletion in what_to_delete:
            del save_pretrained_settings[deletion]
        pass
        if hasattr(model, "add_model_tags"):
            model.add_model_tags(["unsloth",])

+        # Update model tag
+        if push_to_hub:
+             _ = upload_to_huggingface(
+                model, save_pretrained_settings["save_directory"], token,
+                "finetuned", "trl", file_location = None,
+                old_username = None, private = private,
+            )
+        pass
+
        if tokenizer is not None:
            print("Unsloth: Saving tokenizer...", end = "")
            tokenizer.save_pretrained(**tokenizer_save_settings)
@ -310,10 +326,33 @@ def unsloth_save_model(
        if save_method != "lora": print(" This might take 10 minutes for Llama-7b...", end = "")

        model.save_pretrained(**save_pretrained_settings)
+
+        if push_to_hub and hasattr(model, "config"):
+            print("Saved to https://huggingface.co/" + save_pretrained_settings["save_directory"])
+        pass
+
        print(" Done.")
        return save_directory
    pass

+    # If push_to_hub, we must remove the .../ part of a repo
+    username = None
+    if push_to_hub and "/" in save_directory:
+
+        # +1 solves absolute path issues
+        username = save_directory[:save_directory.find("/")]
+        new_save_directory = save_directory[save_directory.find("/")+1:]
+
+        logger.warning_once(
+            f"Unsloth: You are pushing to hub, but you passed your HF username = {username}.\n"\
+            f"We shall truncate {save_directory} to {new_save_directory}"
+        )
+
+        save_pretrained_settings["save_directory"] = new_save_directory
+        tokenizer_save_settings ["save_directory"] = new_save_directory
+        save_directory = new_save_directory
+    pass
+
    print("Unsloth: Merging 4bit and LoRA weights to 16bit...")

    # Determine max RAM usage minus sharding
@ -339,7 +378,7 @@ def unsloth_save_model(
        logger.warning_once(
            f"Unsloth: You have {n_cpus} CPUs. Using `safe_serialization` is 10x slower.\n"\
            f"We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.\n"\
-            f"To force `safe_serialization`, set it to None instead.",
+            f"To force `safe_serialization`, set it to `None` instead.",
        )
        safe_serialization = False
        save_function = fast_save_pickle
@ -413,13 +452,26 @@ def unsloth_save_model(
    # Edit save_pretrained_settings
    # [TODO] _create_repo has errors due to **kwargs getting accepted
    save_pretrained_settings["state_dict"] = state_dict
-    for deletion in \
-        ("use_temp_dir", "commit_message", "create_pr", "revision", "commit_description", "tags",):
+    
+    # commit_description does not seem to work?
+    what_to_delete = ("use_temp_dir", "commit_message", "create_pr", "revision", "commit_description", "tags",) \
+        if not push_to_hub else \
+        ("use_temp_dir", "create_pr", "revision", "tags", "commit_description",)
+    for deletion in what_to_delete:
        del save_pretrained_settings[deletion]
    pass
    if hasattr(model, "add_model_tags"):
        model.add_model_tags(["unsloth",])

+    # Update model tag
+    if push_to_hub:
+        _ = upload_to_huggingface(
+            model, save_pretrained_settings["save_directory"], token,
+            "finetuned", "trl", file_location = None,
+            old_username = username, private = private,
+        )
+    pass
+
    if tokenizer is not None:
        print("Unsloth: Saving tokenizer...", end = "")
        tokenizer.save_pretrained(**tokenizer_save_settings)
@ -452,9 +504,8 @@ def unsloth_save_model(
    model.config = old_config
    print("Done.")

-    # Print location
-    if push_to_hub:
-        print(f"Saved to https://huggingface.co/{username}/{save_directory.lstrip('/')}")
+    if push_to_hub and hasattr(model, "config"):
+        print(f"Saved merged model to https://huggingface.co/{username}/{save_directory.lstrip('/')}")
    pass

    save_pretrained_settings["state_dict"] = None
@ -478,7 +529,7 @@ def unsloth_save_model(
    for _ in range(3):
        torch.cuda.empty_cache()
        gc.collect()
-    return save_directory
+    return save_directory, username
 pass


@ -494,7 +545,7 @@ def install_llama_cpp_make_non_blocking():
    n_jobs = max(int(psutil.cpu_count()*1.5), 1)
    # Force make clean
    os.system("make clean -C llama.cpp")
-    full_command = ["make", "all", "-j", str(n_jobs), "-C", "llama.cpp"]
+    full_command = ["make", "all", "-j"+str(n_jobs), "-C", "llama.cpp"]
    run_installer = subprocess.Popen(full_command, env = env, stdout = subprocess.DEVNULL, stderr = subprocess.STDOUT)
    return run_installer
 pass
@ -507,10 +558,44 @@ def install_python_non_blocking(packages = []):
 pass


+def install_llama_cpp_old(version = -10):
+    # Download the 10th latest release since the latest might be broken!
+    # FALLBACK mechanism
+    releases = subprocess.check_output(["git", "ls-remote", "--tags", "https://github.com/ggerganov/llama.cpp.git"])
+    releases = releases.decode("utf-8").replace("\t", " ").split("\n")
+    for i, x in enumerate(releases):
+        if "refs/tags/b" not in x: break
+    releases = releases[:i]
+    latest = releases[-1]
+    version = releases[version].split(" ")[0]
+
+    # Clone a specific commit
+    commands = [
+        "git clone https://github.com/ggerganov/llama.cpp",
+        f"cd llama.cpp && git reset --hard {version} && git clean -df && "\
+        f"make clean && LLAMA_CUBLAS=1 make all -j{psutil.cpu_count()*2}",
+        "pip install gguf protobuf",
+    ]
+    for command in commands:
+        with subprocess.Popen(command, shell = True, stdout = subprocess.PIPE, bufsize = 1) as sp:
+            for line in sp.stdout:
+                print(line.decode("utf-8"), flush = True, end = "")
+        pass
+    pass
+    # Check if successful
+    if not os.path.exists("llama.cpp/quantize"):
+        raise RuntimeError(
+            "Unsloth: llama.cpp GGUF seems to be too buggy to install.\n"\
+            "File a report to llama.cpp's main repo since this is not an Unsloth issue."
+        )
+    pass
+pass
+
+
 def install_llama_cpp_blocking():
    commands = [
        "git clone https://github.com/ggerganov/llama.cpp",
-        f"cd llama.cpp && make clean && LLAMA_CUBLAS=1 make all -j {psutil.cpu_count()*2}",
+        f"cd llama.cpp && make clean && LLAMA_CUBLAS=1 make all -j{psutil.cpu_count()*2}",
        "pip install gguf protobuf",
    ]
    if os.path.exists("llama.cpp"): return
@ -563,10 +648,13 @@ def save_to_gguf(

    print("Unsloth: [0] Installing llama.cpp. This will take 3 minutes...")
    if _run_installer is not None:
-        _run_installer.wait()
+        error = _run_installer.wait()
    else:
+        error = 0
        install_llama_cpp_blocking()
    pass
+    # Check if successful. If not install 10th latest release
+    if error != 0 or not os.path.exists("llama.cpp/quantize"): install_llama_cpp_old(-10)

    if   quantization_method == "f32":  first_conversion = "f32"
    elif quantization_method == "f16":  first_conversion = "f16"
@ -580,15 +668,18 @@ def save_to_gguf(
            first_conversion = "f16"
        pass
    pass
-    print(f"Unsloth: [1] Converting HF into {first_conversion} GGUF format. This will take 3 minutes...")

    n_cpus = psutil.cpu_count()*2
    # Concurrency from https://rentry.org/llama-cpp-conversions#merging-loras-into-a-model
    
    final_location = f"./{model_directory}-unsloth.{first_conversion.upper()}.gguf"

+    print(f"Unsloth: [1] Converting model at {model_directory} into {first_conversion} GGUF format.\n"\
+          f"The output location will be {final_location}\n"\
+          "This will take 3 minutes...")
+
    command = f"python llama.cpp/convert.py {model_directory} "\
-        f"--outfile {final_location} "\
+        f"--outfile {final_location} --vocab-type hfft "\
        f"--outtype {first_conversion} --concurrency {n_cpus}"

    with subprocess.Popen(command, shell = True, stdout = subprocess.PIPE, stderr = subprocess.PIPE, bufsize = 1) as sp:
@ -601,7 +692,8 @@ def save_to_gguf(
    # Check if quantization succeeded!
    if not os.path.isfile(final_location):
        raise RuntimeError(
-            "Unsloth: Quantization failed! You might have to compile llama.cpp yourself, then run this again.\n"\
+            f"Unsloth: Quantization failed for {final_location}\n"\
+            "You might have to compile llama.cpp yourself, then run this again.\n"\
            "You do not need to close this Python program. Run the following commands in a new terminal:\n"\
            "You must run this in the same folder as you're saving your model.\n"\
            "git clone https://github.com/ggerganov/llama.cpp\n"\
@ -662,7 +754,7 @@ def unsloth_save_pretrained_merged(
    save_peft_format     : bool = True,
    tags                 : List[str] = None,
    temporary_location   : str = "_unsloth_temporary_saved_buffers",
-    maximum_memory_usage : float = 0.85,   
+    maximum_memory_usage : float = 0.85,
 ):
    """
        Same as .save_pretrained(...) except 4bit weights are auto
@ -695,14 +787,14 @@ def unsloth_push_to_hub_merged(
    tokenizer            = None,
    save_method          : str = "merged_16bit", # ["lora", "merged_16bit", "merged_4bit"]
    use_temp_dir         : Optional[bool] = None,
-    commit_message       : Optional[str] = None,
+    commit_message       : Optional[str] = "Trained with Unsloth",
    private              : Optional[bool] = None,
    token                : Union[bool, str, None] = None,
    max_shard_size       : Union[int, str, None] = "5GB",
    create_pr            : bool = False,
    safe_serialization   : bool = True,
    revision             : str = None,
-    commit_description   : str = None,
+    commit_description   : str = "Upload model trained with Unsloth 2x faster",
    tags                 : Optional[List[str]] = None,
    temporary_location   : str = "_unsloth_temporary_saved_buffers",
    maximum_memory_usage : float = 0.85,
@ -760,15 +852,27 @@ This {model_type} model was trained 2x faster with [Unsloth](https://github.com/
 [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 """

-def upload_to_huggingface(model, save_directory, token, method, extra = "", file_location = None):
+def upload_to_huggingface(
+    model,
+    save_directory,
+    token,
+    method,
+    extra = "",
+    file_location = None,
+    old_username = None,
+    private = None,
+):
    # Check for username
    username = ""
    save_directory = save_directory.lstrip("./")
    if "/" not in save_directory:
        from huggingface_hub import whoami
        try: 
-            username = whoami()['name']
-            save_directory = f"{save_directory}/{username}"
+            username = whoami(token = token)["name"]
+            if type(old_username) is str and username != old_username:
+                username = old_username
+            pass
+            save_directory = f"{username}/{save_directory}"
        except:
            raise RuntimeError(f"Unsloth: {save_directory} is not a Huggingface directory.")
    else:
@ -776,24 +880,28 @@ def upload_to_huggingface(model, save_directory, token, method, extra = "", file
    pass

    from huggingface_hub import create_repo
-    create_repo(
-        repo_id   = save_directory,
-        token     = token,
-        repo_type = "model",
-        exist_ok  = True,
-    )
+    try:
+        create_repo(
+            repo_id   = save_directory,
+            token     = token,
+            repo_type = "model",
+            exist_ok  = False,
+            private   = private,
+        ) 

-    # Create model card
-    from huggingface_hub import ModelCard
-    content = MODEL_CARD.format(
-        username   = username,
-        base_model = model.config._name_or_path,
-        model_type = model.config.model_type,
-        method     = "",
-        extra      = extra,
-    )
-    card = ModelCard(content)
-    card.push_to_hub(save_directory, token = token)
+        # Create model card
+        from huggingface_hub import ModelCard
+        content = MODEL_CARD.format(
+            username   = username,
+            base_model = model.config._name_or_path,
+            model_type = model.config.model_type,
+            method     = "",
+            extra      = extra,
+        )
+        card = ModelCard(content)
+        card.push_to_hub(save_directory, token = token)
+    except:
+        pass

    if file_location is not None:
        # Now upload file
@ -811,6 +919,7 @@ def upload_to_huggingface(model, save_directory, token, method, extra = "", file
            path_in_repo    = uploaded_location,
            repo_id         = save_directory,
            repo_type       = "model",
+            commit_message  = "(Trained with Unsloth)",
        )

        # We also upload a config.json file
@ -823,6 +932,7 @@ def upload_to_huggingface(model, save_directory, token, method, extra = "", file
            path_in_repo    = "config.json",
            repo_id         = save_directory,
            repo_type       = "model",
+            commit_message  = "(Trained with Unsloth)",
        )
        os.remove("_temporary_unsloth_config.json")
    pass
@ -838,6 +948,7 @@ def unsloth_save_pretrained_gguf(
    first_conversion     : str = "f16",
    push_to_hub          : bool = False,
    token                : Optional[Union[str, bool]] = None,
+    private              : Optional[bool] = None,
    is_main_process      : bool = True,
    state_dict           : Optional[dict] = None,
    save_function        : Callable = torch.save,
@ -847,7 +958,7 @@ def unsloth_save_pretrained_gguf(
    save_peft_format     : bool = True,
    tags                 : List[str] = None,
    temporary_location   : str = "_unsloth_temporary_saved_buffers",
-    maximum_memory_usage : float = 0.85,   
+    maximum_memory_usage : float = 0.85,
 ):
    """
        Same as .save_pretrained(...) except 4bit weights are auto
@ -898,11 +1009,11 @@ def unsloth_save_pretrained_gguf(
        python_install = install_python_non_blocking(["gguf", "protobuf"])
        git_clone.wait()
        makefile  = install_llama_cpp_make_non_blocking()
-        new_save_directory = unsloth_save_model(**arguments)
+        new_save_directory, old_username = unsloth_save_model(**arguments)
        python_install.wait()
    else:
        try:
-            new_save_directory = unsloth_save_model(**arguments)
+            new_save_directory, old_username = unsloth_save_model(**arguments)
            makefile = None
        except:
            # Retry by recloning llama.cpp
@ -910,7 +1021,7 @@ def unsloth_save_pretrained_gguf(
            python_install = install_python_non_blocking(["gguf", "protobuf"])
            git_clone.wait()
            makefile  = install_llama_cpp_make_non_blocking()
-            new_save_directory = unsloth_save_model(**arguments)
+            new_save_directory, old_username = unsloth_save_model(**arguments)
            python_install.wait()
        pass
    pass
@ -924,12 +1035,12 @@ def unsloth_save_pretrained_gguf(
        print("Unsloth: Uploading GGUF to Huggingface Hub...")
        username = upload_to_huggingface(
            self, save_directory, token,
-            "GGUF converted", "gguf", file_location,
+            "GGUF converted", "gguf", file_location, old_username, private,
        )
        link = f"{username}/{new_save_directory.lstrip('/.')}" \
            if username not in new_save_directory else \
            new_save_directory.lstrip('/.')
-        print(f"Saved to https://huggingface.co/{link}")
+        print(f"Saved GGUF to https://huggingface.co/{link}")
    pass
 pass

@ -941,14 +1052,14 @@ def unsloth_push_to_hub_gguf(
    quantization_method  : str = "fast_quantized",
    first_conversion     : str = "f16",
    use_temp_dir         : Optional[bool] = None,
-    commit_message       : Optional[str] = None,
+    commit_message       : Optional[str] = "Trained with Unsloth",
    private              : Optional[bool] = None,
    token                : Union[bool, str, None] = None,
    max_shard_size       : Union[int, str, None] = "5GB",
    create_pr            : bool = False,
    safe_serialization   : bool = True,
    revision             : str = None,
-    commit_description   : str = None,
+    commit_description   : str = "Upload model trained with Unsloth 2x faster",
    tags                 : Optional[List[str]] = None,
    temporary_location   : str = "_unsloth_temporary_saved_buffers",
    maximum_memory_usage : float = 0.85,
@ -998,19 +1109,19 @@ def unsloth_push_to_hub_gguf(
        python_install = install_python_non_blocking(["gguf", "protobuf"])
        git_clone.wait()
        makefile  = install_llama_cpp_make_non_blocking()
-        new_save_directory = unsloth_save_model(**arguments)
+        new_save_directory, old_username = unsloth_save_model(**arguments)
        python_install.wait()
    else:
        try:
-            new_save_directory = unsloth_save_model(**arguments)
+            new_save_directory, old_username = unsloth_save_model(**arguments)
            makefile = None
        except:
            # Retry by recloning llama.cpp
            git_clone = install_llama_cpp_clone_non_blocking()
            python_install = install_python_non_blocking(["gguf", "protobuf"])
            git_clone.wait()
-            makefile  = install_llama_cpp_make_non_blocking()
-            new_save_directory = unsloth_save_model(**arguments)
+            makefile = install_llama_cpp_make_non_blocking()
+            new_save_directory, old_username = unsloth_save_model(**arguments)
            python_install.wait()
        pass
    pass
@ -1023,12 +1134,12 @@ def unsloth_push_to_hub_gguf(
    print("Unsloth: Uploading GGUF to Huggingface Hub...")
    username = upload_to_huggingface(
        self, repo_id, token,
-        "GGUF converted", "gguf", file_location,
+        "GGUF converted", "gguf", file_location, old_username, private,
    )
    link = f"{username}/{new_save_directory.lstrip('/.')}" \
        if username not in new_save_directory else \
        new_save_directory.lstrip('/.')
-    print(f"Saved to https://huggingface.co/{link}")
+    print(f"Saved GGUF to https://huggingface.co/{link}")
 pass


@ -1038,31 +1149,17 @@ def patch_saving_functions(model):
    import types
    from typing import Callable, Optional, Union, List

-    if hasattr(model, "_original_push_to_hub"): return
-
-    # First check if this has already been called, and revert it
-    original_model = model
-    while True:
-        if hasattr(original_model, "_original_push_to_hub"):
-            original_model.push_to_hub = original_model._original_push_to_hub
-            del original_model._original_push_to_hub
-            if hasattr(original_model, "push_to_hub_merged"):     del original_model.push_to_hub_merged
-            if hasattr(original_model, "save_pretrained_merged"): del original_model.save_pretrained_merged
-            if hasattr(original_model, "push_to_hub_gguf"):       del original_model.push_to_hub_gguf
-            if hasattr(original_model, "save_pretrained_gguf"):   del original_model.save_pretrained_gguf
-        pass
-
-        if hasattr(original_model, "model"): original_model = original_model.model
-        else: break
+    # And now re add our saving methods!
+    if model.push_to_hub.__name__ == "unsloth_push_to_hub":
+        original_push_to_hub = model.original_push_to_hub
+    else:
+        original_push_to_hub = model.push_to_hub
    pass

-    # And now re add our saving methods!
-    original_push_to_hub = model.push_to_hub
    signature = str(inspect.signature(original_push_to_hub)).replace("NoneType", "None")
    signature = signature[1:]
    signature = re.sub("<function save at .+?>", "torch.save", signature)
    docs = original_push_to_hub.__doc__.encode("utf-8").decode("utf-8")
-    model._original_push_to_hub = original_push_to_hub

    push_to_hub_text = f'''def unsloth_push_to_hub(self, {signature}:
    """
@ -1077,11 +1174,45 @@ def patch_saving_functions(model):
        arguments["tags"] = ["unsloth",]
    elif hasattr(self, "add_model_tags"):
        self.add_model_tags(["unsloth",])
+
+    if "commit_message" in arguments:
+        commit_message = arguments["commit_message"]
+        if commit_message is not None:
+            if not commit_message.endswith(" "): commit_message += " "
+            if "Unsloth" not in commit_message:
+                commit_message += "(Trained with Unsloth)"
+        else:
+            commit_message = "Upload model trained with Unsloth"
+        arguments["commit_message"] = commit_message
+
+    if "commit_description" in arguments:
+        commit_description = arguments["commit_description"]
+        if commit_description is not None:
+            if not commit_description.endswith(" "): commit_description += " "
+            if "Unsloth" not in commit_description:
+                commit_description += "(Trained with Unsloth 2x faster)"
+        else:
+            commit_description = "Upload model trained with Unsloth 2x faster"
+        arguments["commit_description"] = commit_description
+
+    # Update model tag
+    if hasattr(self, "config"):
+        _ = upload_to_huggingface(
+            self, arguments["repo_id"], arguments["token"],
+            "finetuned", "trl", file_location = None,
+            old_username = None, private = arguments["private"],
+        )
+    pass
+
    try:
-        return self._original_push_to_hub(**arguments)
+        self.original_push_to_hub(**arguments)
    except:
        del arguments["tags"]
-        return self._original_push_to_hub(**arguments)
+        self.original_push_to_hub(**arguments)
+    pass
+
+    if hasattr(self, "config"):
+        print("Saved model to https://huggingface.co/" + arguments["repo_id"])
    pass
    '''
    exec(push_to_hub_text, globals())
@ -1089,12 +1220,12 @@ def patch_saving_functions(model):
    original_model = model
    while True:

-        if not hasattr(original_model, "_original_push_to_hub"):
-            original_model._original_push_to_hub = original_model.push_to_hub
+        if original_model.push_to_hub.__name__ != "unsloth_push_to_hub":
+            original_model.original_push_to_hub = original_model.push_to_hub
            original_model.push_to_hub = types.MethodType(unsloth_push_to_hub, original_model)
-
            if hasattr(original_model, "add_model_tags"):
                original_model.add_model_tags(["unsloth",])
+            pass
        pass

        if hasattr(original_model, "model"): original_model = original_model.model