diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index abbaedd5b..eb60a5a20 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -3,30 +3,27 @@ Thank you for not only using Unsloth but also for being interested in helping out! We value all contributions, whether they come in the form of code, ideas, support for others or just by simply spreading the word of Unsloth! 💕 - **[Support the Community](https://github.com/unslothai/unsloth/issues)**: Answer questions, review pull requests, or assist others in discussions. -- **Fix Bugs**: Identify and resolve issues with the existing codebase. -- **Submit Ideas**: Request new features or share enhancements you'd like to see. +- **Fix Bugs**: Identify and resolve issues with the existing codebase. +- **Submit Ideas**: Request new features or share enhancements you'd like to see. - **Develop Features**: Implement new functionality or improve existing tools which can be done via PRs. - **[Improve Documentation](https://docs.unsloth.ai/)**: Help by creating guides, FAQs, or enhancing clarity. One of the best ways to support us is by spreading the word about Unsloth! Share how it’s powering your amazing projects in blog posts or social media, and inspire others to explore its potential. Even a simple star on our repo goes a long way in showing your support and helping the community grow. 🌟 -## Submitting Issues -If you find a bug or have a feature idea, we’d love to hear from you! Here’s how to make your submission stand out: +## Submitting Issues +If you find a bug or have a feature idea, we’d love to hear from you! Here’s how to make your submission stand out: -### Reporting Bugs -1. **Search First**: Check if the issue has already been reported using GitHub’s search bar under Issues. -2. **Details Matter**: Is this on Google Colab, Kaggle, or on another platform service? Are you using Unsloth's official notebook? Include your OS, Python version, and other relevant details. For bugs, a concise code snippet that reproduces the issue is incredibly helpful. +### Reporting Bugs +1. **Search First**: Check if the issue has already been reported using GitHub’s search bar under Issues. +2. **Details Matter**: Is this on Google Colab, Kaggle, or on another platform service? Are you using Unsloth's official notebook? Include your OS, Python version, and other relevant details. For bugs, a concise code snippet that reproduces the issue is incredibly helpful. 3. **Be Thorough**: Attach screenshots, traceback logs, or any additional information that might speed up resolution. ## Spread the Word -Your support extends beyond code: -- Spread the word by writing about Unsloth in blogs or social media. -- Share how Unsloth powers your projects. -- Star our repository to show your appreciation. +Your support extends beyond code: +- Spread the word by writing about Unsloth in blogs or social media. +- Share how Unsloth powers your projects. +- Star our repository to show your appreciation. -## Note -We have added a new section in the `README.md` under "✨ Finetune for Free" titled "Exporting Models from Colab to Local Machine" with detailed steps. Please refer to it for guidance on exporting models from Colab to your local machine. - -Finally, please be mindful of our [Code of Conduct](https://github.com/unslothai/unsloth/tree/main/unsloth/CODE_OF_CONDUCT.md) to ensure a welcoming and inclusive environment for everyone. +Finally, please be mindful of our [Code of Conduct](https://github.com/unslothai/unsloth/blob/main/CODE_OF_CONDUCT.md) to ensure a welcoming and inclusive environment for everyone. Thank you so much for reading and we hope you have lots of fun using Unsloth! 🦥 diff --git a/README.md b/README.md index 83eb45ad1..1314cb1c5 100644 --- a/README.md +++ b/README.md @@ -1,152 +1,191 @@
-
-
+
+
+
-### Finetune Llama 3.3, Mistral, Phi-4, Qwen 2.5 & Gemma 2x faster with 80% less memory!
+### Train gpt-oss, DeepSeek, Gemma, Qwen & Llama 2x faster with 70% less VRAM!

+* Supports **full-finetuning**, pretraining, 4-bit, 16-bit and **FP8** training
+* Supports **all models** including [TTS](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning), multimodal, [embedding](https://unsloth.ai/docs/new/embedding-finetuning) and more! Any model that works in transformers, works in Unsloth.
+* The most efficient library for [Reinforcement Learning (RL)](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide), using 80% less VRAM. Supports GRPO, GSPO, DrGRPO, DAPO etc.
+* **0% loss in accuracy** - no approximation methods - all exact.
+* Export and [deploy your model](https://unsloth.ai/docs/basics/inference-and-deployment) to [GGUF](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf) llama.cpp, [vLLM](https://unsloth.ai/docs/basics/inference-and-deployment/vllm-guide), [SGLang](https://unsloth.ai/docs/basics/inference-and-deployment/sglang-guide) and Hugging Face.
+* Supports NVIDIA (since 2018), [AMD](https://unsloth.ai/docs/get-started/install/amd) and [Intel](https://unsloth.ai/docs/get-started/install/intel) GPUs. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc)
+* Works on **Linux**, WSL and **[Windows](https://unsloth.ai/docs/get-started/install/windows-installation)**
+* All kernels written in OpenAI's Triton language. Manual backprop engine.
+* If you trained a model with 🦥Unsloth, you can use this cool sticker!
-## 🥇 Performance Benchmarking
-- For our most detailed benchmarks, read our [Llama 3.3 Blog](https://unsloth.ai/blog/llama3-3).
-- Benchmarking of Unsloth was also conducted by [🤗Hugging Face](https://huggingface.co/blog/unsloth-trl).
+## 💾 Install Unsloth
+You can also see our docs for more detailed installation and updating instructions [here](https://unsloth.ai/docs/get-started/install).
-We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down):
-
-| Model | VRAM | 🦥 Unsloth speed | 🦥 VRAM reduction | 🦥 Longer context | 😊 Hugging Face + FA2 |
-|----------------|-------|-----------------|----------------|----------------|--------------------|
-| Llama 3.3 (70B)| 80GB | 2x | >75% | 13x longer | 1x |
-| Llama 3.1 (8B) | 80GB | 2x | >70% | 12x longer | 1x |
+Unsloth supports Python 3.13 or lower.
-
](https://github.com/unslothai/unsloth)
"""
@@ -1316,19 +1485,19 @@ def _determine_username(save_directory, old_username, token):
save_directory = save_directory.lstrip("./")
if "/" not in save_directory:
from huggingface_hub import whoami
- try:
+
+ try:
username = whoami(token = token)["name"]
if type(old_username) is str and username != old_username:
username = old_username
- pass
save_directory = f"{username}/{save_directory}"
except:
- raise RuntimeError(f"Unsloth: {save_directory} is not a Huggingface directory.")
+ raise RuntimeError(
+ f"Unsloth: {save_directory} is not a Huggingface directory."
+ )
else:
username = save_directory.split("/")[0]
- pass
return save_directory, username
-pass
def create_huggingface_repo(
@@ -1336,38 +1505,52 @@ def create_huggingface_repo(
save_directory,
token = None,
private = False,
+ datasets = None,
):
- if token is None :
+ if token is None:
token = get_token()
- pass
- save_directory, username = _determine_username(save_directory, "", token)
+ save_directory, username = _determine_username(save_directory, None, token)
from huggingface_hub import create_repo
+
try:
create_repo(
- repo_id = save_directory,
- token = token,
+ repo_id = save_directory,
+ token = token,
repo_type = "model",
- exist_ok = False,
- private = private,
- )
+ exist_ok = False,
+ private = private,
+ )
# Create model card
from huggingface_hub import ModelCard
+
content = MODEL_CARD.format(
- username = username,
+ username = username,
base_model = model.config._name_or_path,
model_type = model.config.model_type,
- method = "",
- extra = "unsloth",
+ method = "",
+ extra = "unsloth",
)
card = ModelCard(content)
+ if datasets:
+ card.data.datasets = datasets
card.push_to_hub(save_directory, token = token)
except:
- pass
+ # Repo already exists — update datasets metadata separately
+ if datasets:
+ try:
+ from huggingface_hub import metadata_update
+
+ metadata_update(
+ save_directory, {"datasets": datasets}, overwrite = True, token = token
+ )
+ except Exception as e:
+ logger.warning_once(
+ f"Unsloth: Could not update datasets metadata for {save_directory}: {e}"
+ )
hf_api = HfApi(token = token)
return save_directory, hf_api
-pass
def upload_to_huggingface(
@@ -1380,539 +1563,901 @@ def upload_to_huggingface(
old_username = None,
private = None,
create_config = True,
+ datasets = None,
):
save_directory, username = _determine_username(save_directory, old_username, token)
from huggingface_hub import create_repo
+
try:
create_repo(
- repo_id = save_directory,
- token = token,
+ repo_id = save_directory,
+ token = token,
repo_type = "model",
- exist_ok = False,
- private = private,
- )
+ exist_ok = False,
+ private = private,
+ )
# Create model card
from huggingface_hub import ModelCard
+
content = MODEL_CARD.format(
- username = username,
+ username = username,
base_model = model.config._name_or_path,
model_type = model.config.model_type,
- method = "",
- extra = extra,
+ method = "",
+ extra = extra,
)
card = ModelCard(content)
+ if datasets:
+ card.data.datasets = datasets
card.push_to_hub(save_directory, token = token)
except:
- pass
+ # Repo already exists — update datasets metadata separately
+ if datasets:
+ try:
+ from huggingface_hub import metadata_update
+
+ metadata_update(
+ save_directory, {"datasets": datasets}, overwrite = True, token = token
+ )
+ except Exception as e:
+ logger.warning_once(
+ f"Unsloth: Could not update datasets metadata for {save_directory}: {e}"
+ )
if file_location is not None:
# Now upload file
hf_api = HfApi(token = token)
if "/" in file_location:
- uploaded_location = file_location[file_location.rfind("/")+1:]
+ uploaded_location = file_location[file_location.rfind("/") + 1 :]
else:
uploaded_location = file_location
- pass
# find ftevent file from tensorboard and upload it
import glob
+
ftevent_files = glob.glob("*out.tfevents*", recursive = True)
if len(ftevent_files) > 0:
- print("Unsloth: Uploading tensorboard files... Please wait...", file_location + "*out.tfevents*")
+ print(
+ "Unsloth: Uploading tensorboard files... Please wait...",
+ file_location + "*out.tfevents*",
+ )
for ftevent_file in ftevent_files:
hf_api.upload_file(
path_or_fileobj = ftevent_file,
- path_in_repo = ftevent_file.replace(file_location, ""),
- repo_id = save_directory,
- repo_type = "model",
- commit_message = "(Trained with Unsloth)",
+ path_in_repo = ftevent_file.replace(file_location, ""),
+ repo_id = save_directory,
+ repo_type = "model",
+ commit_message = "(Trained with Unsloth)",
)
- pass
- pass
hf_api.upload_file(
path_or_fileobj = file_location,
- path_in_repo = uploaded_location,
- repo_id = save_directory,
- repo_type = "model",
- commit_message = "(Trained with Unsloth)",
+ path_in_repo = uploaded_location,
+ repo_id = save_directory,
+ repo_type = "model",
+ commit_message = "(Trained with Unsloth)",
)
# We also upload a config.json file
if create_config:
import json
- with open("_temporary_unsloth_config.json", "w") as file:
- json.dump({"model_type" : model.config.model_type}, file, indent = 4)
- pass
+
+ with open("_temporary_unsloth_config.json", "w", encoding = "utf-8") as file:
+ json.dump({"model_type": model.config.model_type}, file, indent = 4)
hf_api.upload_file(
path_or_fileobj = "_temporary_unsloth_config.json",
- path_in_repo = "config.json",
- repo_id = save_directory,
- repo_type = "model",
- commit_message = "(Trained with Unsloth)",
+ path_in_repo = "config.json",
+ repo_id = save_directory,
+ repo_type = "model",
+ commit_message = "(Trained with Unsloth)",
)
os.remove("_temporary_unsloth_config.json")
- pass
- pass
return username
-pass
def fix_tokenizer_bos_token(tokenizer):
# Check if BOS added already, then warn
fix_bos_token = False
chat_template = getattr(tokenizer, "chat_template", None)
-
- if (tokenizer("A").input_ids[0] == getattr(tokenizer, "bos_token_id", None)):
- if chat_template is not None and \
- (
- tokenizer.bos_token in chat_template or \
- "{bos_token}" in chat_template.replace(" ", "") or \
- "{bos_token+" in chat_template.replace(" ", "")
- ):
+ if tokenizer("A").input_ids[0] == getattr(tokenizer, "bos_token_id", None):
+ if chat_template is not None and (
+ tokenizer.bos_token in chat_template
+ or "{bos_token}" in chat_template.replace(" ", "")
+ or "{bos_token+" in chat_template.replace(" ", "")
+ ):
fix_bos_token = True
logger.warning(
- "Unsloth: ##### The current model auto adds a BOS token.\n"\
+ "Unsloth: ##### The current model auto adds a BOS token.\n"
"Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily."
)
# Remove {{bos_token}}
- new_chat_template = re.sub(r"\{[\s]{0,}\{[\s]{0,}bos\_token[\s]{0,}\}[\s]{0,}\}", "", chat_template)
+ new_chat_template = re.sub(
+ r"\{[\s]{0,}\{[\s]{0,}bos\_token[\s]{0,}\}[\s]{0,}\}", "", chat_template
+ )
# Remove {{bos_token +
- new_chat_template = re.sub(r"\{[\s]{0,}\{[\s]{0,}bos\_token[\s]{0,}\+[\s]{0,}", "", new_chat_template)
-
+ new_chat_template = re.sub(
+ r"\{[\s]{0,}\{[\s]{0,}bos\_token[\s]{0,}\+[\s]{0,}",
+ "",
+ new_chat_template,
+ )
+
tokenizer.chat_template = new_chat_template
- pass
- pass
return fix_bos_token, chat_template
-pass
-def create_ollama_modelfile(tokenizer, gguf_location):
+def create_ollama_modelfile(tokenizer, base_model_name, model_location):
"""
- Creates an Ollama Modelfile.
- Use ollama.create(model = "new_ollama_model", modelfile = modelfile)
+ Creates an Ollama Modelfile.
+ Use ollama.create(model = "new_ollama_model", modelfile = modelfile)
"""
- modelfile = getattr(tokenizer, "_ollama_modelfile", None)
- if modelfile is None: return None
+ ollama_template_name = MODEL_TO_OLLAMA_TEMPLATE_MAPPER.get(base_model_name)
+ if not ollama_template_name:
+ print(
+ f"Unsloth: No Ollama template mapping found for model '{base_model_name}'. Skipping Ollama Modelfile"
+ )
+ return None
+ ollama_modelfile = OLLAMA_TEMPLATES.get(ollama_template_name)
+ if not ollama_modelfile:
+ print(
+ f"Unsloth: No Ollama template mapping found for model '{base_model_name}'. Skipping Ollama Modelfile"
+ )
+ return None
+ tokenizer._ollama_modelfile = (
+ ollama_modelfile # This comes from the unpacking above
+ )
+ modelfile = ollama_modelfile
FILE_LOCATION_REPLACER = "⚫@✅#🦥__FILE_LOCATION__⚡@🦥#⛵"
- EOS_TOKEN_REPLACER = "⚫@✅#🦥__EOS_TOKEN__⚡@🦥#⛵"
- LEFT_BRACKET_REPLACER = "⚫@✅#🦥"
+ EOS_TOKEN_REPLACER = "⚫@✅#🦥__EOS_TOKEN__⚡@🦥#⛵"
+ LEFT_BRACKET_REPLACER = "⚫@✅#🦥"
RIGHT_BRACKET_REPLACER = "⚡@🦥#⛵"
# Fixes https://github.com/unslothai/unsloth/issues/1087
# We must convert all {'s and }'s but keep {__FILE_LOCATION__} intact
- modelfile = modelfile\
- .replace("{__FILE_LOCATION__}", FILE_LOCATION_REPLACER)\
- .replace("{__EOS_TOKEN__}", EOS_TOKEN_REPLACER)\
- .replace("{", LEFT_BRACKET_REPLACER)\
+ modelfile = (
+ modelfile.replace("{__FILE_LOCATION__}", FILE_LOCATION_REPLACER)
+ .replace("{__EOS_TOKEN__}", EOS_TOKEN_REPLACER)
+ .replace("{", LEFT_BRACKET_REPLACER)
.replace("}", RIGHT_BRACKET_REPLACER)
+ )
# Revert {__FILE_LOCATION__} back
- modelfile = modelfile\
- .replace(FILE_LOCATION_REPLACER, "{__FILE_LOCATION__}")\
- .replace(EOS_TOKEN_REPLACER, "{__EOS_TOKEN__}")
-
+ modelfile = modelfile.replace(
+ FILE_LOCATION_REPLACER, "{__FILE_LOCATION__}"
+ ).replace(EOS_TOKEN_REPLACER, "{__EOS_TOKEN__}")
+
if "__EOS_TOKEN__" in modelfile:
modelfile = modelfile.format(
- __FILE_LOCATION__ = gguf_location,
- __EOS_TOKEN__ = tokenizer.eos_token,
+ __FILE_LOCATION__ = model_location,
+ __EOS_TOKEN__ = tokenizer.eos_token,
)
else:
modelfile = modelfile.format(
- __FILE_LOCATION__ = gguf_location,
+ __FILE_LOCATION__ = model_location,
)
- pass
-
- modelfile = modelfile\
- .replace("⚫@✅#🦥", "{")\
- .replace("⚡@🦥#⛵", "}")\
- .rstrip()
+
+ modelfile = modelfile.replace("⚫@✅#🦥", "{").replace("⚡@🦥#⛵", "}").rstrip()
return modelfile
-pass
+
+
+def create_ollama_model(username: str, model_name: str, tag: str, modelfile_path: str):
+ try:
+ init_check = subprocess.run(
+ ["curl", "http://localhost:11434"],
+ capture_output = True,
+ text = True,
+ timeout = 3,
+ )
+ if init_check.returncode == 0:
+ print(init_check.stdout.strip())
+ else:
+ print("Ollama Server is not Running")
+ except subprocess.TimeoutExpired:
+ return "Ollama Request Timeout"
+
+ process = subprocess.Popen(
+ [
+ "ollama",
+ "create",
+ f"{username}/{model_name}:{tag}",
+ "-f",
+ f"{modelfile_path}",
+ ],
+ stdout = subprocess.PIPE,
+ stderr = subprocess.STDOUT,
+ text = True,
+ bufsize = 1,
+ universal_newlines = True,
+ )
+
+ for line in iter(process.stdout.readline, ""):
+ print(line, end = "")
+ sys.stdout.flush()
+
+ return_code = process.wait()
+
+ if return_code != 0:
+ print(f"\nMODEL CREATED FAILED WITH RETURN CODE {return_code}")
+ else:
+ print("\nMODEL CREATED SUCCESSFULLY")
+
+
+def push_to_ollama_hub(username: str, model_name: str, tag: str):
+ try:
+ init_check = subprocess.run(
+ ["curl", "http://localhost:11434"],
+ capture_output = True,
+ text = True,
+ timeout = 3,
+ )
+ if init_check.returncode == 0:
+ print(init_check.stdout.strip())
+ else:
+ print("Ollama Server is not Running")
+ except subprocess.TimeoutExpired:
+ return "Ollama Request Timeout"
+
+ process = subprocess.Popen(
+ ["ollama", "push", f"{username}/{model_name}:{tag}"],
+ stdout = subprocess.PIPE,
+ stderr = subprocess.STDOUT,
+ text = True,
+ bufsize = 1,
+ universal_newlines = True,
+ )
+
+ for line in iter(process.stdout.readline, ""):
+ print(line, end = "")
+ sys.stdout.flush()
+
+ return_code = process.wait()
+
+ if return_code != 0:
+ print(f"\nMODEL PUBLISHED FAILED WITH RETURN CODE {return_code}")
+ else:
+ print("\nMODEL PUBLISHED SUCCESSFULLY")
+
+
+def push_to_ollama(tokenizer, gguf_location, username: str, model_name: str, tag: str):
+ model_file = create_ollama_modelfile(
+ tokenizer = tokenizer, gguf_location = gguf_location
+ )
+
+ with open(f"Modelfile_{model_name}", "w", encoding = "utf-8") as f:
+ f.write(model_file)
+ f.close()
+
+ create_ollama_model(
+ username = username,
+ model_name = model_name,
+ tag = tag,
+ modelfile_path = f"Modelfile_{model_name}",
+ )
+
+ push_to_ollama_hub(username = username, model_name = model_name, tag = tag)
+
+ print("Successfully pushed to ollama")
def unsloth_save_pretrained_gguf(
self,
- save_directory : Union[str, os.PathLike],
- tokenizer = None,
- quantization_method : str = "fast_quantized",
- first_conversion : str = None,
- push_to_hub : bool = False,
- token : Optional[Union[str, bool]] = None,
- private : Optional[bool] = None,
- is_main_process : bool = True,
- state_dict : Optional[dict] = None,
- save_function : Callable = torch.save,
- max_shard_size : Union[int, str] = "5GB",
- safe_serialization : bool = True,
- variant : Optional[str] = None,
- save_peft_format : bool = True,
- tags : List[str] = None,
- temporary_location : str = "_unsloth_temporary_saved_buffers",
- maximum_memory_usage : float = 0.85,
+ save_directory: Union[str, os.PathLike],
+ tokenizer = None,
+ quantization_method = "fast_quantized",
+ first_conversion: str = None,
+ push_to_hub: bool = False,
+ token: Optional[Union[str, bool]] = None,
+ private: Optional[bool] = None,
+ is_main_process: bool = True,
+ state_dict: Optional[dict] = None,
+ save_function: Callable = torch.save,
+ max_shard_size: Union[int, str] = "5GB",
+ safe_serialization: bool = True,
+ variant: Optional[str] = None,
+ save_peft_format: bool = True,
+ tags: List[str] = None,
+ temporary_location: str = "_unsloth_temporary_saved_buffers",
+ maximum_memory_usage: float = 0.85,
):
"""
- Same as .save_pretrained(...) except 4bit weights are auto
- converted to float16 then converted to GGUF / llama.cpp format.
+ Same as .save_pretrained(...) except 4bit weights are auto
+ converted to float16 then converted to GGUF / llama.cpp format.
- Choose for `quantization_method` to be:
- "not_quantized" : "Recommended. Fast conversion. Slow inference, big files.",
- "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
- "quantized" : "Recommended. Slow conversion. Fast inference, small files.",
- "f32" : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
- "f16" : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
- "q8_0" : "Fast conversion. High resource use, but generally acceptable.",
- "q4_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
- "q5_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
- "q2_k" : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
- "q3_k_l" : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
- "q3_k_m" : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
- "q3_k_s" : "Uses Q3_K for all tensors",
- "q4_0" : "Original quant method, 4-bit.",
- "q4_1" : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
- "q4_k_s" : "Uses Q4_K for all tensors",
- "q4_k" : "alias for q4_k_m",
- "q5_k" : "alias for q5_k_m",
- "q5_0" : "Higher accuracy, higher resource usage and slower inference.",
- "q5_1" : "Even higher accuracy, resource usage and slower inference.",
- "q5_k_s" : "Uses Q5_K for all tensors",
- "q6_k" : "Uses Q8_K for all tensors",
- "iq2_xxs" : "2.06 bpw quantization",
- "iq2_xs" : "2.31 bpw quantization",
- "iq3_xxs" : "3.06 bpw quantization",
- "q3_k_xs" : "3-bit extra small quantization",
+ Choose for `quantization_method` to be:
+ "not_quantized" : "Recommended. Fast conversion. Slow inference, big files.",
+ "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
+ "quantized" : "Recommended. Slow conversion. Fast inference, small files.",
+ "f32" : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
+ "f16" : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
+ "q8_0" : "Fast conversion. High resource use, but generally acceptable.",
+ "q4_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
+ "q5_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
+ "q2_k" : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
+ "q3_k_l" : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
+ "q3_k_m" : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
+ "q3_k_s" : "Uses Q3_K for all tensors",
+ "q4_0" : "Original quant method, 4-bit.",
+ "q4_1" : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
+ "q4_k_s" : "Uses Q4_K for all tensors",
+ "q4_k" : "alias for q4_k_m",
+ "q5_k" : "alias for q5_k_m",
+ "q5_0" : "Higher accuracy, higher resource usage and slower inference.",
+ "q5_1" : "Even higher accuracy, resource usage and slower inference.",
+ "q5_k_s" : "Uses Q5_K for all tensors",
+ "q6_k" : "Uses Q8_K for all tensors",
+ "iq2_xxs" : "2.06 bpw quantization",
+ "iq2_xs" : "2.31 bpw quantization",
+ "iq3_xxs" : "3.06 bpw quantization",
+ "q3_k_xs" : "3-bit extra small quantization",
"""
if tokenizer is None:
raise ValueError("Unsloth: Saving to GGUF must have a tokenizer.")
+ try:
+ base_model_name = get_model_name(self.config._name_or_path, load_in_4bit = False)
+ model_name = base_model_name.split("/")[-1]
+ except:
+ base_model_name = self.config._name_or_path
+ model_name = base_model_name.split("/")[-1]
+
+ # Check if push_to_hub is requested
+ if push_to_hub:
+ raise ValueError(
+ "Unsloth: Please use .push_to_hub_gguf() instead of .save_pretrained_gguf() with push_to_hub=True"
+ )
+
+ # Step 1: Check if this is a VLM (Vision-Language Model) and check if gpt-oss
+ is_vlm = False
+ if hasattr(self, "config") and hasattr(self.config, "architectures"):
+ is_vlm = any(
+ x.endswith(("ForConditionalGeneration", "ForVisionText2Text"))
+ for x in self.config.architectures
+ )
+ is_vlm = is_vlm or hasattr(self.config, "vision_config")
+
+ is_processor = is_vlm and isinstance(tokenizer, ProcessorMixin)
+
+ is_gpt_oss = (
+ True
+ if (
+ hasattr(self.config, "architectures")
+ and self.config.architectures == "GptOssForCausalLM"
+ )
+ or (
+ hasattr(self.config, "model_type")
+ and self.config.model_type in ["gpt-oss", "gpt_oss"]
+ )
+ else False
+ )
+ # Step 2: Prepare arguments for model saving
arguments = dict(locals())
- arguments["model"] = self
- arguments["tokenizer"] = tokenizer
- arguments["push_to_hub"] = False # We save ourselves
- arguments["save_method"] = "merged_16bit" # Must be 16bit
+ arguments["model"] = self
+ arguments["tokenizer"] = tokenizer
+ arguments["push_to_hub"] = False # We handle upload ourselves
+ # GPT-OSS needs mxfp4 save method
+ if is_gpt_oss:
+ if quantization_method is not None:
+ _qm = (
+ quantization_method
+ if isinstance(quantization_method, (list, tuple))
+ else [quantization_method]
+ )
+ _ignored = [q for q in _qm if str(q).lower() != "mxfp4"]
+ if _ignored:
+ logger.warning_once(
+ f"Unsloth: GPT-OSS does not support GGUF quantization "
+ f"(requested: {', '.join(str(q) for q in _ignored)}). "
+ f"Overriding to MXFP4 format. "
+ f"Pass quantization_method=None to suppress this warning."
+ )
+ arguments["save_method"] = "mxfp4"
+ else:
+ arguments["save_method"] = "merged_16bit"
del arguments["self"]
del arguments["quantization_method"]
del arguments["first_conversion"]
+ del arguments["is_vlm"]
+ del arguments["is_gpt_oss"]
+ del arguments["model_name"]
+ del arguments["base_model_name"]
+ del arguments["is_processor"]
- # Fix tokenizer adding an extra BOS token at the front
- fix_bos_token, old_chat_template = fix_tokenizer_bos_token(tokenizer)
-
- # Non blocking install GGUF first
- if not os.path.exists("llama.cpp"):
-
- if IS_KAGGLE_ENVIRONMENT:
- # Kaggle is weird - no blocking installs, and no CUDA?
- python_install = install_python_non_blocking(["gguf", "protobuf"])
- python_install.wait()
- install_llama_cpp_blocking(use_cuda = False)
- new_save_directory, old_username = unsloth_save_model(**arguments)
- makefile = None
- else:
- git_clone = install_llama_cpp_clone_non_blocking()
- python_install = install_python_non_blocking(["gguf", "protobuf"])
- git_clone.wait()
- makefile = install_llama_cpp_make_non_blocking()
- new_save_directory, old_username = unsloth_save_model(**arguments)
- python_install.wait()
- pass
+ # Step 3: Fix tokenizer BOS token if needed
+ if is_processor:
+ fix_bos_token, old_chat_template = fix_tokenizer_bos_token(tokenizer.tokenizer)
else:
- try:
- new_save_directory, old_username = unsloth_save_model(**arguments)
- makefile = None
- except:
- # Retry by recloning llama.cpp
- if IS_KAGGLE_ENVIRONMENT:
- # Kaggle is weird - no blocking installs, and no CUDA?
- python_install = install_python_non_blocking(["gguf", "protobuf"])
- python_install.wait()
- install_llama_cpp_blocking(use_cuda = False)
- new_save_directory, old_username = unsloth_save_model(**arguments)
- makefile = None
- else:
- git_clone = install_llama_cpp_clone_non_blocking()
- python_install = install_python_non_blocking(["gguf", "protobuf"])
- git_clone.wait()
- makefile = install_llama_cpp_make_non_blocking()
- new_save_directory, old_username = unsloth_save_model(**arguments)
- python_install.wait()
- pass
- pass
- pass
+ fix_bos_token, old_chat_template = fix_tokenizer_bos_token(tokenizer)
+
+ # Step 4: Save/merge model to 16-bit format
+ print(
+ f'Unsloth: Merging model weights to {"mxfp4" if is_gpt_oss else "16-bit"} format...'
+ )
+ try:
+ # Call unsloth_generic_save directly (it's in the same file)
+ unsloth_generic_save(**arguments)
+
+ except Exception as e:
+ raise RuntimeError(f"Failed to save/merge model: {e}")
+
+ if is_processor:
+ tokenizer = tokenizer.tokenizer
# Use old chat template if the bos is removed
if fix_bos_token:
tokenizer.chat_template = old_chat_template
- pass
+ # Step 6: Clean up memory
for _ in range(3):
+ import gc
+
gc.collect()
+ if torch.cuda.is_available():
+ torch.cuda.empty_cache()
- model_dtype = self.config.torch_dtype
- model_type = self.config.model_type
- if type(model_dtype) is str:
- assert(model_dtype == "float16" or model_dtype == "bfloat16")
- elif model_dtype == torch.float16:
+ # Step 7: Get model dtype and type
+ try:
+ model_dtype = dtype_from_config(self.config)
+ model_type = self.config.model_type
+ if type(model_dtype) is str:
+ assert model_dtype == "float16" or model_dtype == "bfloat16"
+ elif model_dtype == torch.float16:
+ model_dtype = "float16"
+ elif model_dtype == torch.bfloat16:
+ model_dtype = "bfloat16"
+ else:
+ raise TypeError("Unsloth: Model dtype can only be float16 or bfloat16")
+ except Exception as e:
+ # Fallback if dtype_from_config fails
+ print(f"Unsloth: Could not determine dtype ({e}), defaulting to float16")
model_dtype = "float16"
- elif model_dtype == torch.bfloat16:
- model_dtype = "bfloat16"
- else:
- raise TypeError("Unsloth: Model dtype can only be float16 or bfloat16")
- pass
- is_sentencepiece_model = check_if_sentencepiece_model(self)
+ # Step 8: Convert to GGUF format
+ print("Unsloth: Converting to GGUF format...")
- # Save to GGUF
- all_file_locations, want_full_precision = save_to_gguf(
- model_type, model_dtype, is_sentencepiece_model,
- new_save_directory, quantization_method, first_conversion, makefile,
- )
+ # Convert quantization_method to list if string
+ # Use old style quantization_method
+ quantization_methods = []
+ if quantization_method is not None:
+ # Convert quantization_method to list
+ if isinstance(quantization_method, list):
+ pass
+ elif isinstance(quantization_method, str):
+ quantization_method = [
+ quantization_method,
+ ]
+ elif isinstance(quantization_method, tuple):
+ quantization_method = list(quantization_method)
+ else:
+ raise TypeError(
+ "Unsloth: quantization_method can only be a string or a list of strings"
+ )
+ for i, quant_method in enumerate(quantization_method):
+ quant_method = quant_method.lower()
+ if quant_method == "not_quantized":
+ quant_method = "f16"
+ elif quant_method == "fast_quantized":
+ quant_method = "q8_0"
+ elif quant_method == "quantized":
+ quant_method = "q4_k_m"
+ elif quant_method is None:
+ quant_method = "q8_0"
+ quantization_methods.append(quant_method.lower())
- # Save Ollama modelfile
- modelfile = create_ollama_modelfile(tokenizer, all_file_locations[0])
+ try:
+ all_file_locations, want_full_precision, is_vlm_update = save_to_gguf(
+ model_name = model_name,
+ model_type = model_type,
+ model_dtype = model_dtype,
+ is_sentencepiece = False,
+ model_directory = save_directory,
+ quantization_method = quantization_methods,
+ first_conversion = first_conversion,
+ is_vlm = is_vlm, # Pass VLM flag
+ is_gpt_oss = is_gpt_oss, # Pass gpt_oss Flag
+ )
+ except Exception as e:
+ if IS_KAGGLE_ENVIRONMENT:
+ raise RuntimeError(
+ f"Unsloth: GGUF conversion failed in Kaggle environment.\n"
+ f"This is likely due to the 20GB disk space limit.\n"
+ f"Try saving to /tmp directory or use a smaller model.\n"
+ f"Error: {e}"
+ )
+ else:
+ raise RuntimeError(f"Unsloth: GGUF conversion failed: {e}")
+
+ # Step 9: Create Ollama modelfile
+ gguf_directory = f"{save_directory}_gguf"
modelfile_location = None
- if modelfile is not None:
- modelfile_location = os.path.join(new_save_directory, "Modelfile")
- with open(modelfile_location, "w") as file:
- file.write(modelfile)
- pass
- print(f"Unsloth: Saved Ollama Modelfile to {modelfile_location}")
- pass
+ ollama_success = False
+ if all_file_locations:
+ try:
+ if is_vlm_update:
+ modelfile = create_ollama_modelfile(tokenizer, base_model_name, ".")
+ else:
+ modelfile = create_ollama_modelfile(
+ tokenizer,
+ base_model_name,
+ os.path.basename(all_file_locations[0]),
+ )
+ if modelfile is not None:
+ modelfile_location = os.path.join(gguf_directory, "Modelfile")
+ with open(modelfile_location, "w", encoding = "utf-8") as file:
+ file.write(modelfile)
+ ollama_success = True
+ except Exception as e:
+ print(f"Warning: Could not create Ollama modelfile: {e}")
+ # Step 10: Show BOS token warning if applicable
if fix_bos_token:
logger.warning(
- "Unsloth: ##### The current model auto adds a BOS token.\n"\
+ "Unsloth: ##### The current model auto adds a BOS token.\n"
"Unsloth: ##### We removed it in GGUF's chat template for you."
)
- pass
- if push_to_hub:
- print("Unsloth: Uploading GGUF to Huggingface Hub...")
+ _exe = ".exe" if IS_WINDOWS else ""
+ if IS_WINDOWS:
+ _bin_dir = os.path.join(LLAMA_CPP_DEFAULT_DIR, "build", "bin", "Release")
+ else:
+ _bin_dir = LLAMA_CPP_DEFAULT_DIR
- # If not needing full precision, skip the first
- if not want_full_precision: all_file_locations = all_file_locations[1:]
+ if is_vlm_update:
+ print("\n")
+ print(
+ f"Unsloth: example usage for Multimodal LLMs: {os.path.join(_bin_dir, 'llama-mtmd-cli' + _exe)} -m {all_file_locations[0]} --mmproj {all_file_locations[-1]}"
+ )
+ print("Unsloth: load image inside llama.cpp runner: /image test_image.jpg")
+ print("Unsloth: Prompt model to describe the image")
+ else:
+ print(
+ f'Unsloth: example usage for text only LLMs: {os.path.join(_bin_dir, "llama-cli" + _exe)} --model {all_file_locations[0]} -p "why is the sky blue?"'
+ )
- for file_location in all_file_locations:
- username = upload_to_huggingface(
- self, save_directory, token,
- "GGUF converted", "gguf", file_location, old_username, private,
- )
- link = f"{username}/{new_save_directory.lstrip('/.')}" \
- if username not in new_save_directory else \
- new_save_directory.lstrip('/.')
- print(f"Saved GGUF to https://huggingface.co/{link}")
- pass
+ if ollama_success:
+ print(f"Unsloth: Saved Ollama Modelfile to {modelfile_location}")
+ print(
+ f"Unsloth: convert model to ollama format by running - ollama create model_name -f {modelfile_location}"
+ )
- # Save modelfile
- if modelfile_location is not None:
- username = upload_to_huggingface(
- self, save_directory, token,
- "GGUF converted", "gguf", modelfile_location, old_username, private,
- )
- print(f"Saved Ollama Modelfile to https://huggingface.co/{link}")
- pass
- pass
-pass
+ # Return a dict with all needed info for push_to_hub
+ return {
+ "save_directory": save_directory,
+ "gguf_directory": gguf_directory,
+ "gguf_files": all_file_locations,
+ "modelfile_location": modelfile_location,
+ "want_full_precision": want_full_precision,
+ "is_vlm": is_vlm_update,
+ "fix_bos_token": fix_bos_token,
+ }
def unsloth_push_to_hub_gguf(
self,
- repo_id : str,
- tokenizer = None,
- quantization_method : str = "fast_quantized",
- first_conversion : str = None,
- use_temp_dir : Optional[bool] = None,
- commit_message : Optional[str] = "Trained with Unsloth",
- private : Optional[bool] = None,
- token : Union[bool, str, None] = None,
- max_shard_size : Union[int, str, None] = "5GB",
- create_pr : bool = False,
- safe_serialization : bool = True,
- revision : str = None,
- commit_description : str = "Upload model trained with Unsloth 2x faster",
- tags : Optional[List[str]] = None,
- temporary_location : str = "_unsloth_temporary_saved_buffers",
- maximum_memory_usage : float = 0.85,
+ repo_id: str,
+ tokenizer = None,
+ quantization_method = "fast_quantized",
+ first_conversion: str = None,
+ use_temp_dir: Optional[bool] = None,
+ commit_message: Optional[str] = "Trained with Unsloth",
+ private: Optional[bool] = None,
+ token: Union[bool, str, None] = None,
+ max_shard_size: Union[int, str, None] = "5GB",
+ create_pr: bool = False,
+ safe_serialization: bool = True,
+ revision: str = None,
+ commit_description: str = "Upload model trained with Unsloth 2x faster",
+ tags: Optional[List[str]] = None,
+ temporary_location: str = "_unsloth_temporary_saved_buffers",
+ maximum_memory_usage: float = 0.85,
+ datasets: Optional[List[str]] = None,
):
"""
- Same as .push_to_hub(...) except 4bit weights are auto
- converted to float16 then converted to GGUF / llama.cpp format.
+ Same as .push_to_hub(...) except 4bit weights are auto
+ converted to float16 then converted to GGUF / llama.cpp format.
- Choose for `quantization_method` to be:
- "not_quantized" : "Recommended. Fast conversion. Slow inference, big files.",
- "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
- "quantized" : "Recommended. Slow conversion. Fast inference, small files.",
- "f32" : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
- "f16" : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
- "q8_0" : "Fast conversion. High resource use, but generally acceptable.",
- "q4_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
- "q5_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
- "q2_k" : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
- "q3_k_l" : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
- "q3_k_m" : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
- "q3_k_s" : "Uses Q3_K for all tensors",
- "q4_0" : "Original quant method, 4-bit.",
- "q4_1" : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
- "q4_k_s" : "Uses Q4_K for all tensors",
- "q4_k" : "alias for q4_k_m",
- "q5_k" : "alias for q5_k_m",
- "q5_0" : "Higher accuracy, higher resource usage and slower inference.",
- "q5_1" : "Even higher accuracy, resource usage and slower inference.",
- "q5_k_s" : "Uses Q5_K for all tensors",
- "q6_k" : "Uses Q8_K for all tensors",
+ Choose for `quantization_method` to be:
+ "not_quantized" : "Recommended. Fast conversion. Slow inference, big files.",
+ "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
+ "quantized" : "Recommended. Slow conversion. Fast inference, small files.",
+ "f32" : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
+ "f16" : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
+ "q8_0" : "Fast conversion. High resource use, but generally acceptable.",
+ "q4_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
+ "q5_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
+ "q2_k" : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
+ "q3_k_l" : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
+ "q3_k_m" : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
+ "q3_k_s" : "Uses Q3_K for all tensors",
+ "q4_0" : "Original quant method, 4-bit.",
+ "q4_1" : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
+ "q4_k_s" : "Uses Q4_K for all tensors",
+ "q5_0" : "Higher accuracy, higher resource usage and slower inference.",
+ "q5_1" : "Even higher accuracy, resource usage and slower inference.",
+ "q5_k_s" : "Uses Q5_K for all tensors",
+ "q6_k" : "Uses Q8_K for all tensors",
"""
if tokenizer is None:
raise ValueError("Unsloth: Saving to GGUF must have a tokenizer.")
- arguments = dict(locals())
- arguments["model"] = self
- arguments["tokenizer"] = tokenizer
- arguments["save_directory"] = repo_id
- arguments["push_to_hub"] = False # We save ourselves
- arguments["save_method"] = "merged_16bit" # Must be 16bit
- del arguments["self"]
- del arguments["repo_id"]
- del arguments["quantization_method"]
- del arguments["first_conversion"]
+ # Step 1: Determine save directory
+ model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
- # Fix tokenizer adding an extra BOS token at the front
- fix_bos_token, old_chat_template = fix_tokenizer_bos_token(tokenizer)
+ if use_temp_dir or use_temp_dir is None:
+ import tempfile
- # Non blocking install GGUF first
- if not os.path.exists("llama.cpp"):
+ temp_dir = tempfile.mkdtemp(prefix = "unsloth_gguf_")
+ save_directory = temp_dir
+ cleanup_temp = True
+ else:
+ save_directory = model_name # Use model name, not repo_id
+ cleanup_temp = False
- if IS_KAGGLE_ENVIRONMENT:
- # Kaggle is weird - no blocking installs, and no CUDA?
- python_install = install_python_non_blocking(["gguf", "protobuf"])
- python_install.wait()
- install_llama_cpp_blocking(use_cuda = False)
- new_save_directory, old_username = unsloth_save_model(**arguments)
- makefile = None
+ # Step 2: Call save_pretrained_gguf to do the conversion
+ print(f"Unsloth: Converting model to GGUF format...")
+
+ try:
+ # Call save_pretrained_gguf - it returns all the info we need
+ result = unsloth_save_pretrained_gguf(
+ self = self,
+ save_directory = save_directory,
+ tokenizer = tokenizer,
+ quantization_method = quantization_method,
+ first_conversion = first_conversion,
+ push_to_hub = False, # Never push from here
+ token = None, # Don't need token for local save
+ max_shard_size = max_shard_size,
+ safe_serialization = safe_serialization,
+ temporary_location = temporary_location,
+ maximum_memory_usage = maximum_memory_usage,
+ )
+
+ # Extract results
+ all_file_locations = result["gguf_files"]
+ modelfile_location = result["modelfile_location"]
+ want_full_precision = result["want_full_precision"]
+ is_vlm = result["is_vlm"]
+ fix_bos_token = result["fix_bos_token"]
+ actual_save_directory = result["save_directory"]
+
+ except Exception as e:
+ if cleanup_temp:
+ import shutil
+
+ for d in [save_directory, f"{save_directory}_gguf"]:
+ try:
+ shutil.rmtree(d)
+ except:
+ pass
+ raise RuntimeError(f"Failed to convert model to GGUF: {e}")
+
+ # Step 3: Upload to HuggingFace Hub
+ print("Unsloth: Uploading GGUF to Huggingface Hub...")
+
+ try:
+ from huggingface_hub import HfApi
+
+ api = HfApi(token = token)
+
+ # Get full repo id
+ if "/" not in repo_id:
+ username = api.whoami()["name"]
+ full_repo_id = f"{username}/{repo_id}"
else:
- git_clone = install_llama_cpp_clone_non_blocking()
- python_install = install_python_non_blocking(["gguf", "protobuf"])
- git_clone.wait()
- makefile = install_llama_cpp_make_non_blocking()
- new_save_directory, old_username = unsloth_save_model(**arguments)
- python_install.wait()
- pass
- else:
- try:
- new_save_directory, old_username = unsloth_save_model(**arguments)
- makefile = None
- except:
- # Retry by recloning llama.cpp
- if IS_KAGGLE_ENVIRONMENT:
- # Kaggle is weird - no blocking installs, and no CUDA?
- python_install = install_python_non_blocking(["gguf", "protobuf"])
- python_install.wait()
- install_llama_cpp_blocking(use_cuda = False)
- new_save_directory, old_username = unsloth_save_model(**arguments)
- makefile = None
+ full_repo_id = repo_id
+
+ # Create repo
+ api.create_repo(
+ repo_id = full_repo_id,
+ repo_type = "model",
+ private = private,
+ exist_ok = True,
+ )
+
+ # Upload GGUF files
+ for file_location in all_file_locations:
+ original_name = os.path.basename(file_location)
+ # Replace temp directory name with proper model name
+ if cleanup_temp and "unsloth_gguf_" in original_name:
+ # Extract the quantization part (e.g., ".Q8_0.gguf" or ".Q8_0-mmproj.gguf")
+ quant_suffix = (
+ original_name.split(".", 1)[1]
+ if "." in original_name
+ else original_name
+ )
+ proper_name = f"{model_name}.{quant_suffix}"
else:
- git_clone = install_llama_cpp_clone_non_blocking()
- python_install = install_python_non_blocking(["gguf", "protobuf"])
- git_clone.wait()
- makefile = install_llama_cpp_make_non_blocking()
- new_save_directory, old_username = unsloth_save_model(**arguments)
- python_install.wait()
+ proper_name = original_name.replace(
+ os.path.basename(save_directory), model_name
+ )
+
+ print(f"Uploading {proper_name}...")
+
+ api.upload_file(
+ path_or_fileobj = file_location,
+ path_in_repo = proper_name,
+ repo_id = full_repo_id,
+ repo_type = "model",
+ commit_message = commit_message,
+ commit_description = commit_description,
+ create_pr = create_pr,
+ revision = revision,
+ )
+
+ # Upload config.json if exists
+ config_path = os.path.join(actual_save_directory, "config.json")
+ if os.path.exists(config_path):
+ print("Uploading config.json...")
+ api.upload_file(
+ path_or_fileobj = config_path,
+ path_in_repo = "config.json",
+ repo_id = full_repo_id,
+ repo_type = "model",
+ commit_message = f"{commit_message} - config",
+ create_pr = create_pr,
+ revision = revision,
+ )
+
+ # Upload Modelfile if exists
+ if modelfile_location and os.path.exists(modelfile_location):
+ print("Uploading Ollama Modelfile...")
+ api.upload_file(
+ path_or_fileobj = modelfile_location,
+ path_in_repo = "Modelfile",
+ repo_id = full_repo_id,
+ repo_type = "model",
+ commit_message = f"{commit_message} - Ollama Modelfile",
+ create_pr = create_pr,
+ revision = revision,
+ )
+
+ # Create and upload README
+ readme_content = f"""---
+tags:
+- gguf
+- llama.cpp
+- unsloth
+{"- vision-language-model" if is_vlm else ""}
+---
+
+# {repo_id.split("/")[-1]} : GGUF
+
+This model was finetuned and converted to GGUF format using [Unsloth](https://github.com/unslothai/unsloth).
+
+**Example usage**:
+- For text only LLMs: `llama-cli -hf {repo_id} --jinja`
+- For multimodal models: `llama-mtmd-cli -hf {repo_id} --jinja`
+
+## Available Model files:
+"""
+ for file in all_file_locations:
+ # Fix filename in README too
+ original_name = os.path.basename(file)
+ if cleanup_temp and "unsloth_gguf_" in original_name:
+ quant_suffix = (
+ original_name.split(".", 1)[1]
+ if "." in original_name
+ else original_name
+ )
+ proper_name = f"{model_name}.{quant_suffix}"
+ else:
+ proper_name = original_name.replace(
+ os.path.basename(save_directory), model_name
+ )
+ readme_content += f"- `{proper_name}`\n"
+
+ # Special note for VLM with Modelfile
+ if is_vlm and modelfile_location:
+ readme_content += "\n## ⚠️ Ollama Note for Vision Models\n"
+ readme_content += "**Important:** Ollama currently does not support separate mmproj files for vision models.\n\n"
+ readme_content += "To create an Ollama model from this vision model:\n"
+ readme_content += "1. Place the `Modelfile` in the same directory as the finetuned bf16 merged model\n"
+ readme_content += "3. Run: `ollama create model_name -f ./Modelfile`\n"
+ readme_content += " (Replace `model_name` with your desired name)\n\n"
+ readme_content += (
+ "This will create a unified bf16 model that Ollama can use.\n"
+ )
+ elif modelfile_location:
+ readme_content += "\n## Ollama\n"
+ readme_content += "An Ollama Modelfile is included for easy deployment.\n"
+
+ if fix_bos_token:
+ readme_content += "\n## Note\n"
+ readme_content += (
+ "The model's BOS token behavior was adjusted for GGUF compatibility.\n"
+ )
+
+ readme_content += (
+ "This was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth)\n"
+ '[
](https://github.com/unslothai/unsloth)\n'
+ )
+
+ readme_path = os.path.join(actual_save_directory, "README.md")
+ with open(readme_path, "w") as f:
+ f.write(readme_content)
+
+ api.upload_file(
+ path_or_fileobj = readme_path,
+ path_in_repo = "README.md",
+ repo_id = full_repo_id,
+ repo_type = "model",
+ commit_message = "Add README",
+ create_pr = create_pr,
+ revision = revision,
+ )
+
+ print(
+ f"Unsloth: Successfully uploaded GGUF to https://huggingface.co/{full_repo_id}"
+ )
+
+ # Add tags
+ if tags is None:
+ tags = []
+ tags.extend(["gguf", "llama-cpp", "unsloth"])
+ if is_vlm:
+ tags.append("vision-language-model")
+
+ try:
+ api.add_tags(
+ repo_id = full_repo_id,
+ tags = tags,
+ repo_type = "model",
+ )
+ except:
pass
- pass
- pass
- # Use old chat template if the bos is removed
- if fix_bos_token:
- tokenizer.chat_template = old_chat_template
- pass
+ if datasets:
+ try:
+ from huggingface_hub import metadata_update
- for _ in range(3):
- gc.collect()
+ metadata_update(
+ full_repo_id, {"datasets": datasets}, overwrite = True, token = token
+ )
+ except Exception as e:
+ logger.warning_once(
+ f"Unsloth: Could not update datasets metadata for {full_repo_id}: {e}"
+ )
- model_dtype = self.config.torch_dtype
- model_type = self.config.model_type
- if type(model_dtype) is str:
- assert(model_dtype == "float16" or model_dtype == "bfloat16")
- elif model_dtype == torch.float16:
- model_dtype = "float16"
- elif model_dtype == torch.bfloat16:
- model_dtype = "bfloat16"
- else:
- raise TypeError("Unsloth: Model dtype can only be float16 or bfloat16")
- pass
+ except Exception as e:
+ raise RuntimeError(f"Failed to upload to Hugging Face Hub: {e}")
- is_sentencepiece_model = check_if_sentencepiece_model(self)
+ finally:
+ # Clean up temporary directory
+ if cleanup_temp:
+ print("Unsloth: Cleaning up temporary files...")
+ import shutil
- # Save to GGUF
- all_file_locations, want_full_precision = save_to_gguf(
- model_type, model_dtype, is_sentencepiece_model,
- new_save_directory, quantization_method, first_conversion, makefile,
- )
+ for d in [save_directory, f"{save_directory}_gguf"]:
+ if os.path.exists(d):
+ try:
+ shutil.rmtree(d)
+ except:
+ pass
- # Save Ollama modelfile
- modelfile = create_ollama_modelfile(tokenizer, all_file_locations[0])
- modelfile_location = None
- if modelfile is not None:
- modelfile_location = os.path.join(new_save_directory, "Modelfile")
- with open(modelfile_location, "w") as file:
- file.write(modelfile)
- pass
- print(f"Unsloth: Saved Ollama Modelfile to {modelfile_location}")
- pass
+ return full_repo_id
- # If not needing full precision, skip the first
- if not want_full_precision: all_file_locations = all_file_locations[1:]
-
- for file_location in all_file_locations:
- print("Unsloth: Uploading GGUF to Huggingface Hub...")
- username = upload_to_huggingface(
- self, repo_id, token,
- "GGUF converted", "gguf", file_location, old_username, private,
- )
- link = f"{username}/{new_save_directory.lstrip('/.')}" \
- if username not in new_save_directory else \
- new_save_directory.lstrip('/.')
-
- print(f"Saved GGUF to https://huggingface.co/{link}")
- pass
-
- # Save modelfile
- if modelfile_location is not None:
- username = upload_to_huggingface(
- self, repo_id, token,
- "GGUF converted", "gguf", modelfile_location, old_username, private,
- )
- print(f"Saved Ollama Modelfile to https://huggingface.co/{link}")
- pass
-
- if fix_bos_token:
- logger.warning(
- "Unsloth: ##### The current model auto adds a BOS token.\n"\
- "Unsloth: ##### We removed it in GGUF's chat template for you."
- )
- pass
-pass
# Corrected function to save LoRA to a custom directory
def save_lora_to_custom_dir(model, tokenizer, save_directory):
# Create the custom directory if it doesn't exist
- os.makedirs(save_directory, exist_ok=True)
+ os.makedirs(save_directory, exist_ok = True)
# Call the unsloth_save_model function with the custom directory
unsloth_save_model(
model,
tokenizer,
- save_directory=save_directory,
- save_method="lora",
- push_to_hub=False,
+ save_directory = save_directory,
+ save_method = "lora",
+ push_to_hub = False,
)
+
# Corrected method within the model class to convert LoRA to GGML and push to Hugging Face Hub
def unsloth_convert_lora_to_ggml_and_push_to_hub(
self,
@@ -1932,7 +2477,7 @@ def unsloth_convert_lora_to_ggml_and_push_to_hub(
if IS_KAGGLE_ENVIRONMENT:
python_install = install_python_non_blocking(["protobuf"])
python_install.wait()
- install_llama_cpp_blocking(use_cuda=False)
+ install_llama_cpp_blocking(use_cuda = False)
makefile = None
else:
git_clone = install_llama_cpp_clone_non_blocking()
@@ -1952,17 +2497,26 @@ def unsloth_convert_lora_to_ggml_and_push_to_hub(
model_type = self.config.model_type
output_file = os.path.join(lora_directory_push, "ggml-adapter-model.bin")
- print(f"Unsloth: Converting auto-saved LoRA adapters at {lora_directory_push} to GGML format.")
+ print(
+ f"Unsloth: Converting auto-saved LoRA adapters at {lora_directory_push} to GGML format."
+ )
print(f"The output file will be {output_file}")
command = f"python3 llama.cpp/convert-lora-to-ggml.py {lora_directory_push} {output_file} llama"
try:
- with subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=1, universal_newlines=True) as sp:
+ with subprocess.Popen(
+ command,
+ shell = True,
+ stdout = subprocess.PIPE,
+ stderr = subprocess.PIPE,
+ bufsize = 1,
+ universal_newlines = True,
+ ) as sp:
for line in sp.stdout:
- print(line, end="", flush=True)
+ print(line, end = "", flush = True)
for line in sp.stderr:
- print(line, end="", flush=True)
+ print(line, end = "", flush = True)
sp.wait()
if sp.returncode != 0:
raise subprocess.CalledProcessError(sp.returncode, command)
@@ -1974,18 +2528,27 @@ def unsloth_convert_lora_to_ggml_and_push_to_hub(
print("Unsloth: Uploading GGML file to Hugging Face Hub...")
username = upload_to_huggingface(
- self, repo_id, token,
- "GGML converted LoRA", "ggml", output_file, None, private,
+ self,
+ repo_id,
+ token,
+ "GGML converted LoRA",
+ "ggml",
+ output_file,
+ None,
+ private,
)
link = f"{repo_id.lstrip('/')}"
print("Unsloth: Done.")
print(f"Converted LoRA to GGML and uploaded to https://huggingface.co/{link}")
- print("\nThis GGML making function was made by Maheswar. Ping him @Maheswar on the Unsloth Discord or on HuggingFace (@mahiatlinux) if you like this!")
+ print(
+ "\nThis GGML making function was made by Maheswar. Ping him @Maheswar on the Unsloth Discord or on HuggingFace (@mahiatlinux) if you like this!"
+ )
+
def unsloth_convert_lora_to_ggml_and_save_locally(
self,
- save_directory: str, # Added parameter for the folder name
- tokenizer,
+ save_directory: str, # Added parameter for the folder name
+ tokenizer,
temporary_location: str = "_unsloth_temporary_saved_buffers",
maximum_memory_usage: float = 0.85,
):
@@ -1993,7 +2556,7 @@ def unsloth_convert_lora_to_ggml_and_save_locally(
if IS_KAGGLE_ENVIRONMENT:
python_install = install_python_non_blocking(["protobuf"])
python_install.wait()
- install_llama_cpp_blocking(use_cuda=False)
+ install_llama_cpp_blocking(use_cuda = False)
makefile = None
else:
git_clone = install_llama_cpp_clone_non_blocking()
@@ -2013,17 +2576,26 @@ def unsloth_convert_lora_to_ggml_and_save_locally(
model_type = self.config.model_type
output_file = os.path.join(save_directory, "ggml-adapter-model.bin")
- print(f"Unsloth: Converting auto-saved LoRA adapters at {save_directory} to GGML format.")
+ print(
+ f"Unsloth: Converting auto-saved LoRA adapters at {save_directory} to GGML format."
+ )
print(f"The output file will be {output_file}")
command = f"python3 llama.cpp/convert-lora-to-ggml.py {save_directory} {output_file} llama"
try:
- with subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=1, universal_newlines=True) as sp:
+ with subprocess.Popen(
+ command,
+ shell = True,
+ stdout = subprocess.PIPE,
+ stderr = subprocess.PIPE,
+ bufsize = 1,
+ universal_newlines = True,
+ ) as sp:
for line in sp.stdout:
- print(line, end="", flush=True)
+ print(line, end = "", flush = True)
for line in sp.stderr:
- print(line, end="", flush=True)
+ print(line, end = "", flush = True)
sp.wait()
if sp.returncode != 0:
raise subprocess.CalledProcessError(sp.returncode, command)
@@ -2032,92 +2604,211 @@ def unsloth_convert_lora_to_ggml_and_save_locally(
return
print("Unsloth: Done.")
print(f"Unsloth: Conversion completed! Output file: {output_file}")
- print("\nThis GGML making function was made by Maheswar. Ping him @Maheswar on the Unsloth Discord or on HuggingFace (@mahiatlinux) if you like this!")
-pass
+ print(
+ "\nThis GGML making function was made by Maheswar. Ping him @Maheswar on the Unsloth Discord or on HuggingFace (@mahiatlinux) if you like this!"
+ )
from .models.loader_utils import get_model_name
-from unsloth_zoo.saving_utils import merge_and_overwrite_lora
+from unsloth_zoo.saving_utils import (
+ merge_and_overwrite_lora,
+ prepare_saving,
+)
+from unsloth_zoo.llama_cpp import (
+ install_llama_cpp,
+ convert_to_gguf as _convert_to_gguf,
+)
+
+
+@torch.inference_mode
+def save_to_gguf_generic(
+ model,
+ save_directory,
+ tokenizer,
+ quantization_method = None,
+ quantization_type = "Q8_0",
+ repo_id = None,
+ token = None,
+):
+ if token is None and repo_id is not None:
+ token = get_token()
+ if repo_id is not None and token is None:
+ raise RuntimeError("Unsloth: Please specify a token for uploading!")
+
+ if not os.path.exists(os.path.join("llama.cpp", "unsloth_convert_hf_to_gguf.py")):
+ install_llama_cpp(just_clone_repo = True)
+
+ # Use old style quantization_method
+ new_quantization_methods = []
+ if quantization_method is not None:
+ # Convert quantization_method to list
+ if isinstance(quantization_method, list):
+ pass
+ elif isinstance(quantization_method, str):
+ quantization_method = [
+ quantization_method,
+ ]
+ elif isinstance(quantization_method, tuple):
+ quantization_method = list(quantization_method)
+ else:
+ raise TypeError(
+ "Unsloth: quantization_method can only be a string or a list of strings"
+ )
+ for i, quant_method in enumerate(quantization_method):
+ quant_method = quant_method.lower()
+ if quant_method == "not_quantized":
+ quant_method = "f16"
+ elif quant_method == "fast_quantized":
+ quant_method = "q8_0"
+ elif quant_method == "quantized":
+ quant_method = "q4_k_m"
+ elif quant_method is None:
+ quant_method = "q8_0"
+ new_quantization_methods.append(quant_method.lower())
+ else:
+ new_quantization_methods.append(quantization_type.lower())
+ # Check if wrong method
+ for quant_method in new_quantization_methods:
+ if quant_method not in ALLOWED_QUANTS.keys():
+ error = f"Unsloth: Quant method = [{quant_method}] not supported. Choose from below:\n"
+ for key, value in ALLOWED_QUANTS.items():
+ error += f"[{key}] => {value}\n"
+ raise RuntimeError(error)
+
+ # Go through all types and save individually - somewhat inefficient
+ # since we save F16 / BF16 multiple times
+ for quantization_type in new_quantization_methods:
+ metadata = _convert_to_gguf(
+ save_directory,
+ print_output = True,
+ quantization_type = quantization_type,
+ )
+ if repo_id is not None:
+ prepare_saving(
+ model,
+ repo_id,
+ push_to_hub = True,
+ max_shard_size = "50GB",
+ private = True,
+ token = token,
+ )
+
+ from huggingface_hub import HfApi
+
+ api = HfApi(token = token)
+ api.upload_folder(
+ folder_path = save_directory,
+ repo_id = repo_id,
+ repo_type = "model",
+ allow_patterns = ["*.gguf"],
+ )
+ return metadata
+
@torch.inference_mode
def unsloth_generic_save(
model,
tokenizer,
- save_directory : Union[str, os.PathLike] = "unsloth_finetuned_merge",
- save_method : str = "lora", # ["lora", "merged_16bit", "merged_4bit"]
- push_to_hub : bool = False,
- token : Optional[Union[str, bool]] = None,
- is_main_process : bool = True,
- state_dict : Optional[dict] = None,
- save_function : Callable = torch.save,
- max_shard_size : Union[int, str] = "5GB",
- safe_serialization : bool = True,
- variant : Optional[str] = None,
- save_peft_format : bool = True,
-
+ save_directory: Union[str, os.PathLike] = "unsloth_finetuned_merge",
+ save_method: str = "lora", # ["lora", "merged_16bit", "merged_4bit"]
+ push_to_hub: bool = False,
+ token: Optional[Union[str, bool]] = None,
+ is_main_process: bool = True,
+ state_dict: Optional[dict] = None,
+ save_function: Callable = torch.save,
+ max_shard_size: Union[int, str] = "5GB",
+ safe_serialization: bool = True,
+ variant: Optional[str] = None,
+ save_peft_format: bool = True,
# Push to hub
- use_temp_dir : Optional[bool] = None,
- commit_message : Optional[str] = "Trained with Unsloth",
- private : Optional[bool] = None,
- create_pr : bool = False,
- revision : str = None,
- commit_description : str = "Upload model trained with Unsloth 2x faster",
- tags : List[str] = None,
-
+ use_temp_dir: Optional[bool] = None,
+ commit_message: Optional[str] = "Trained with Unsloth",
+ private: Optional[bool] = None,
+ create_pr: bool = False,
+ revision: str = None,
+ commit_description: str = "Upload model trained with Unsloth 2x faster",
+ tags: List[str] = None,
# Our functions
- temporary_location : str = "_unsloth_temporary_saved_buffers",
- maximum_memory_usage : float = 0.9,
+ temporary_location: str = "_unsloth_temporary_saved_buffers",
+ maximum_memory_usage: float = 0.9,
+ datasets: Optional[List[str]] = None,
):
- if token is None and push_to_hub: token = get_token()
+ if token is None and push_to_hub:
+ token = get_token()
+
+ if save_method == "merged_4bit":
+ raise RuntimeError(
+ "Unsloth: Merging into 4bit will cause your model to lose accuracy if you plan\n"
+ "to merge to GGUF or others later on. I suggest you to do this as a final step\n"
+ "if you're planning to do multiple saves.\n"
+ "If you are certain, change `save_method` to `merged_4bit_forced`."
+ )
+ elif save_method == "merged_4bit_forced":
+ save_method = "merged_4bit"
+
merge_and_overwrite_lora(
get_model_name,
- model = model,
- tokenizer = tokenizer,
- save_directory = save_directory,
- push_to_hub = push_to_hub,
- private = private,
- token = token,
- output_dtype = None,
- low_disk_space_usage = False,
- use_temp_file = False,
+ model = model,
+ tokenizer = tokenizer,
+ save_directory = save_directory,
+ push_to_hub = push_to_hub,
+ private = private,
+ token = token,
+ save_method = save_method,
+ output_dtype = None,
+ low_disk_space_usage = True,
+ use_temp_file = False,
)
+
+ if push_to_hub and datasets:
+ try:
+ from huggingface_hub import metadata_update
+
+ save_dir, _ = _determine_username(save_directory, None, token)
+ metadata_update(
+ save_dir, {"datasets": datasets}, overwrite = True, token = token
+ )
+ except Exception as e:
+ logger.warning_once(
+ f"Unsloth: Could not update datasets metadata for {save_directory}: {e}"
+ )
+
return
-pass
def unsloth_generic_save_pretrained_merged(
self,
- save_directory : Union[str, os.PathLike],
- tokenizer = None,
- save_method : str = "merged_16bit", # ["lora", "merged_16bit", "merged_4bit"]
- push_to_hub : bool = False,
- token : Optional[Union[str, bool]] = None,
- is_main_process : bool = True,
- state_dict : Optional[dict] = None,
- save_function : Callable = torch.save,
- max_shard_size : Union[int, str] = "5GB",
- safe_serialization : bool = True,
- variant : Optional[str] = None,
- save_peft_format : bool = True,
- tags : List[str] = None,
- temporary_location : str = "_unsloth_temporary_saved_buffers",
- maximum_memory_usage : float = 0.75,
-):
+ save_directory: Union[str, os.PathLike],
+ tokenizer = None,
+ save_method: str = "merged_16bit", # ["lora", "merged_16bit", "merged_4bit"]
+ push_to_hub: bool = False,
+ token: Optional[Union[str, bool]] = None,
+ is_main_process: bool = True,
+ state_dict: Optional[dict] = None,
+ save_function: Callable = torch.save,
+ max_shard_size: Union[int, str] = "5GB",
+ safe_serialization: bool = True,
+ variant: Optional[str] = None,
+ save_peft_format: bool = True,
+ tags: List[str] = None,
+ temporary_location: str = "_unsloth_temporary_saved_buffers",
+ maximum_memory_usage: float = 0.75,
+ datasets: Optional[List[str]] = None,
+):
"""
- Same as .push_to_hub(...) except 4bit weights are auto
- converted to float16 with as few overhead as possible.
+ Same as .push_to_hub(...) except 4bit weights are auto
+ converted to float16 with as few overhead as possible.
- Choose for `save_method` to be either:
- 1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp.
- 2. `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference.
- 3. `lora`: Save LoRA adapters with no merging. Useful for HF inference.
+ Choose for `save_method` to be either:
+ 1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp.
+ 2. `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference.
+ 3. `lora`: Save LoRA adapters with no merging. Useful for HF inference.
"""
if tokenizer is None:
logger.warning_once(
- "Unsloth: You're not saving a tokenizer as well?\n"\
+ "Unsloth: You're not saving a tokenizer as well?\n"
"You can do it separately via `tokenizer.save_pretrained(...)`"
)
- pass
arguments = dict(locals())
arguments["model"] = self
@@ -2125,58 +2816,266 @@ def unsloth_generic_save_pretrained_merged(
unsloth_generic_save(**arguments)
for _ in range(3):
gc.collect()
-pass
def unsloth_generic_push_to_hub_merged(
self,
- repo_id : str,
- tokenizer = None,
- save_method : str = "merged_16bit", # ["lora", "merged_16bit", "merged_4bit"]
- use_temp_dir : Optional[bool] = None,
- commit_message : Optional[str] = "Trained with Unsloth",
- private : Optional[bool] = None,
- token : Union[bool, str, None] = None,
- max_shard_size : Union[int, str, None] = "5GB",
- create_pr : bool = False,
- safe_serialization : bool = True,
- revision : str = None,
- commit_description : str = "Upload model trained with Unsloth 2x faster",
- tags : Optional[List[str]] = None,
- temporary_location : str = "_unsloth_temporary_saved_buffers",
- maximum_memory_usage : float = 0.75,
+ repo_id: str,
+ tokenizer = None,
+ save_method: str = "merged_16bit", # ["lora", "merged_16bit", "merged_4bit"]
+ use_temp_dir: Optional[bool] = None,
+ commit_message: Optional[str] = "Trained with Unsloth",
+ private: Optional[bool] = None,
+ token: Union[bool, str, None] = None,
+ max_shard_size: Union[int, str, None] = "5GB",
+ create_pr: bool = False,
+ safe_serialization: bool = True,
+ revision: str = None,
+ commit_description: str = "Upload model trained with Unsloth 2x faster",
+ tags: Optional[List[str]] = None,
+ temporary_location: str = "_unsloth_temporary_saved_buffers",
+ maximum_memory_usage: float = 0.75,
+ datasets: Optional[List[str]] = None,
):
"""
- Same as .push_to_hub(...) except 4bit weights are auto
- converted to float16 with as few overhead as possible.
+ Same as .push_to_hub(...) except 4bit weights are auto
+ converted to float16 with as few overhead as possible.
- Choose for `save_method` to be either:
- 1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp.
- 2. `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference.
- 3. `lora`: Save LoRA adapters with no merging. Useful for HF inference.
+ Choose for `save_method` to be either:
+ 1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp.
+ 2. `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference.
+ 3. `lora`: Save LoRA adapters with no merging. Useful for HF inference.
"""
if tokenizer is None:
logger.warning_once(
- "Unsloth: You're not saving a tokenizer as well?\n"\
+ "Unsloth: You're not saving a tokenizer as well?\n"
"You can do it separately via `tokenizer.push_to_hub(...)`"
)
- pass
arguments = dict(locals())
- arguments["model"] = self
+ arguments["model"] = self
arguments["save_directory"] = repo_id
- arguments["push_to_hub"] = True
+ arguments["push_to_hub"] = True
del arguments["self"]
del arguments["repo_id"]
unsloth_generic_save(**arguments)
for _ in range(3):
gc.collect()
-pass
+
+
+def _unsloth_save_torchao_with_attached_config(
+ model,
+ save_directory: Union[str, os.PathLike],
+ tokenizer,
+ push_to_hub: bool = False,
+ token: Optional[Union[str, bool]] = None,
+):
+ """Save a QAT-trained model by converting fake-quantized weights to real quantized weights."""
+ # Convert QAT fake-quantized weights to real quantized weights
+ _convert_torchao_model(model)
+ # PEFT models also might come here, so parse it
+ if isinstance(model, PeftModelForCausalLM):
+ _unsloth_save_torchao_with_given_config(
+ model = model,
+ save_directory = save_directory,
+ tokenizer = tokenizer,
+ torchao_config = model.config.quantization_config,
+ push_to_hub = push_to_hub,
+ token = token,
+ )
+ return
+
+ # TorchAO does not support safe_serialization reliably
+ safe_serialization = False
+
+ if push_to_hub:
+ model.push_to_hub(
+ save_directory, safe_serialization = safe_serialization, token = token
+ )
+ tokenizer.push_to_hub(save_directory, token = token)
+ else:
+ model.save_pretrained(save_directory, safe_serialization = safe_serialization)
+ tokenizer.save_pretrained(save_directory)
+
+
+def _unsloth_save_torchao_with_given_config(
+ model,
+ save_directory: Union[str, os.PathLike],
+ tokenizer,
+ torchao_config,
+ push_to_hub: bool = False,
+ token: Optional[Union[str, bool]] = None,
+):
+ """Quantizes the model with torchao and saves a torchao quantized checkpoint
+
+ Args
+ `save_directory`: local folder path or huggingface hub ID when `push_to_hub` is set to True, e.g. `my_model`
+ `torchao_config` (TorchAOBaseConfig): configuration for torchao quantization, full list: https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize
+ `push_to_hub` (bool): whether to push the checkpoint to huggingface hub or save locally
+ """
+
+ if push_to_hub:
+ assert token is not None, "Unsloth: Please specify a token for uploading!"
+
+ assert (
+ torchao_config is not None
+ ), "Unsloth: Please specify a torchao_config for post-training quantization!"
+
+ # first merge the lora weights
+ arguments = dict(locals())
+ arguments["push_to_hub"] = False # We save ourselves
+ arguments["save_method"] = "merged_16bit" # Must be 16bit
+ del arguments["torchao_config"]
+
+ if not isinstance(model, PeftModelForCausalLM) and not isinstance(model, PeftModel):
+ model.save_pretrained(save_directory)
+ tokenizer.save_pretrained(save_directory)
+ else:
+ unsloth_generic_save(**arguments)
+
+ for _ in range(3):
+ gc.collect()
+
+ from transformers import (
+ AutoModelForCausalLM,
+ AutoTokenizer,
+ TorchAoConfig,
+ AutoModelForImageTextToText,
+ AutoProcessor,
+ )
+ from torchao import quantize_
+
+ if isinstance(torchao_config, TorchAoConfig):
+ quantization_config = torchao_config
+ else:
+ quantization_config = TorchAoConfig(quant_type = torchao_config)
+
+ # Determine if this is a VLM
+ is_vlm = False
+ if hasattr(model, "config") and hasattr(model.config, "architectures"):
+ is_vlm = any(
+ x.endswith(("ForConditionalGeneration", "ForVisionText2Text"))
+ for x in model.config.architectures
+ )
+ is_vlm = is_vlm or hasattr(model.config, "vision_config")
+ auto_model = AutoModelForImageTextToText if is_vlm else AutoModelForCausalLM
+ auto_processor = AutoProcessor if is_vlm else AutoTokenizer
+
+ tokenizer = auto_processor.from_pretrained(save_directory)
+
+ # TorchAO must only use bfloat16 for loading (float16 fails)
+ if HAS_TORCH_DTYPE:
+ kwargs = {"torch_dtype": torch.bfloat16}
+ else:
+ kwargs = {"dtype": torch.bfloat16}
+
+ # Reload with quantization applied
+ quantized_model = auto_model.from_pretrained(
+ save_directory,
+ device_map = "auto",
+ quantization_config = quantization_config,
+ **kwargs,
+ )
+
+ torchao_save_directory = save_directory + "-torchao"
+
+ # TorchAO does not support safe_serialization right now 0.14.0 seems broken!
+ safe_serialization = Version(importlib_version("torchao")) > Version("0.14.0")
+ safe_serialization = False
+
+ if push_to_hub:
+ quantized_model.push_to_hub(
+ torchao_save_directory, safe_serialization = safe_serialization, token = token
+ )
+ tokenizer.push_to_hub(torchao_save_directory, token = token)
+ else:
+ quantized_model.save_pretrained(
+ torchao_save_directory, safe_serialization = safe_serialization
+ )
+ tokenizer.save_pretrained(torchao_save_directory)
+
+ # Clean up the intermediate unquantized model
+ if os.path.exists(save_directory):
+ try:
+ shutil.rmtree(save_directory)
+ except:
+ pass
+
+
+def unsloth_save_pretrained_torchao(
+ self,
+ save_directory: Union[str, os.PathLike],
+ tokenizer = None,
+ torchao_config = None,
+ push_to_hub: bool = False,
+ token: Optional[Union[str, bool]] = None,
+):
+ """Saves a torchao quantized model checkpoint.
+
+ This function handles two mutually exclusive workflows:
+
+ 1. **QAT (Quantization-Aware Training)**: If the model was trained with `qat_scheme`
+ parameter, do NOT pass `torchao_config`. The function will convert the QAT
+ fake-quantized weights to real quantized weights and save directly.
+
+ 2. **PTQ (Post-Training Quantization)**: If you want to apply quantization to a
+ regular model, pass a `torchao_config`. The model must NOT have been trained
+ with `qat_scheme`.
+
+ Args:
+ `save_directory`: local folder path or huggingface hub ID when `push_to_hub` is True
+ `tokenizer`: the tokenizer to save alongside the model
+ `torchao_config` (TorchAOBaseConfig): configuration for torchao quantization.
+ Required for PTQ, must be None for QAT models.
+ Options: https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize
+ `push_to_hub` (bool): whether to push to huggingface hub or save locally
+ `token`: HuggingFace token for pushing to hub
+ """
+ if token is None and push_to_hub:
+ token = get_token()
+
+ has_qat_config = (
+ hasattr(self, "_torchao_config") and self._torchao_config is not None
+ )
+
+ if torchao_config is not None:
+ # PTQ path: user provided a config, model must NOT have QAT config unless PEFT
+ assert not has_qat_config, (
+ "Unsloth: You passed `torchao_config` but this model was trained with `qat_scheme`. "
+ "For QAT models, do not pass `torchao_config` - the quantization config is already "
+ "attached to the model from training."
+ )
+ _unsloth_save_torchao_with_given_config(
+ model = self,
+ save_directory = save_directory,
+ tokenizer = tokenizer,
+ torchao_config = torchao_config,
+ push_to_hub = push_to_hub,
+ token = token,
+ )
+ else:
+ # QAT path: no config provided, model must have QAT config
+ assert has_qat_config, (
+ "Unsloth: No `torchao_config` provided and model was not trained with `qat_scheme`. "
+ "Either train with `qat_scheme` parameter, or provide a `torchao_config` for "
+ "post-training quantization."
+ )
+ _unsloth_save_torchao_with_attached_config(
+ model = self,
+ save_directory = save_directory,
+ tokenizer = tokenizer,
+ push_to_hub = push_to_hub,
+ token = token,
+ )
+
+ for _ in range(3):
+ gc.collect()
def not_implemented_save(*args, **kwargs):
- raise NotImplementedError("Unsloth: Sorry GGUF is currently not supported for vision models!")
-pass
+ raise NotImplementedError(
+ "Unsloth: Sorry GGUF is currently not supported for vision models!"
+ )
def patch_saving_functions(model, vision = False):
@@ -2189,7 +3088,6 @@ def patch_saving_functions(model, vision = False):
original_push_to_hub = model.original_push_to_hub
else:
original_push_to_hub = model.push_to_hub
- pass
signature = str(inspect.signature(original_push_to_hub)).replace("NoneType", "None")
signature = signature[1:]
@@ -2254,60 +3152,63 @@ def patch_saving_functions(model, vision = False):
original_model = model
while True:
-
- if original_model.push_to_hub.__name__ != "unsloth_push_to_hub":
+ # Check if push_to_hub exists before accessing its __name__
+ if (
+ hasattr(original_model, "push_to_hub")
+ and original_model.push_to_hub.__name__ != "unsloth_push_to_hub"
+ ):
original_model.original_push_to_hub = original_model.push_to_hub
- original_model.push_to_hub = types.MethodType(unsloth_push_to_hub, original_model)
+ original_model.push_to_hub = types.MethodType(
+ unsloth_push_to_hub, original_model
+ )
if hasattr(original_model, "add_model_tags"):
- original_model.add_model_tags(["unsloth",])
- pass
- pass
+ original_model.add_model_tags(
+ [
+ "unsloth",
+ ]
+ )
- if hasattr(original_model, "model"): original_model = original_model.model
- else: break
- pass
+ if hasattr(original_model, "model"):
+ original_model = original_model.model
+ else:
+ break
# Add saving methods to top level model
if not vision:
if hasattr(model, "config"):
# Counteract tokenizers
- model.push_to_hub_merged = types.MethodType(unsloth_push_to_hub_merged, model)
- model.save_pretrained_merged = types.MethodType(unsloth_save_pretrained_merged, model)
- model.push_to_hub_gguf = types.MethodType(unsloth_push_to_hub_gguf, model)
- model.save_pretrained_gguf = types.MethodType(unsloth_save_pretrained_gguf, model)
- model.push_to_hub_ggml = types.MethodType(unsloth_convert_lora_to_ggml_and_push_to_hub, model)
- model.save_pretrained_ggml = types.MethodType(unsloth_convert_lora_to_ggml_and_save_locally, model)
- pass
+ model.push_to_hub_merged = types.MethodType(
+ unsloth_generic_push_to_hub_merged, model
+ )
+ model.save_pretrained_merged = types.MethodType(
+ unsloth_generic_save_pretrained_merged, model
+ )
+ model.push_to_hub_gguf = types.MethodType(unsloth_push_to_hub_gguf, model)
+ model.save_pretrained_gguf = types.MethodType(
+ unsloth_save_pretrained_gguf, model
+ )
+ model.save_pretrained_torchao = types.MethodType(
+ unsloth_save_pretrained_torchao, model
+ )
+ model.push_to_hub_ggml = types.MethodType(
+ unsloth_convert_lora_to_ggml_and_push_to_hub, model
+ )
+ model.save_pretrained_ggml = types.MethodType(
+ unsloth_convert_lora_to_ggml_and_save_locally, model
+ )
else:
# Vision only 1 option
- model.push_to_hub_merged = types.MethodType(unsloth_generic_push_to_hub_merged, model)
- model.save_pretrained_merged = types.MethodType(unsloth_generic_save_pretrained_merged, model)
- model.push_to_hub_gguf = types.MethodType(not_implemented_save, model)
- model.save_pretrained_gguf = types.MethodType(not_implemented_save, model)
- pass
+ model.push_to_hub_merged = types.MethodType(
+ unsloth_generic_push_to_hub_merged, model
+ )
+ model.save_pretrained_merged = types.MethodType(
+ unsloth_generic_save_pretrained_merged, model
+ )
+ model.push_to_hub_gguf = types.MethodType(unsloth_push_to_hub_gguf, model)
+ model.save_pretrained_gguf = types.MethodType(
+ unsloth_save_pretrained_gguf, model
+ )
+ model.save_pretrained_torchao = types.MethodType(
+ unsloth_save_pretrained_torchao, model
+ )
return model
-pass
-
-def export_model_to_local(model, tokenizer, save_directory, drive_directory):
- """
- Export a fine-tuned model from Colab to your local machine.
-
- Args:
- model: The fine-tuned model to be exported.
- tokenizer: The tokenizer associated with the model.
- save_directory: The directory where the model will be saved in Colab.
- drive_directory: The directory in Google Drive where the model will be saved.
- """
- # Save the model in Colab
- model.save_pretrained(save_directory)
- tokenizer.save_pretrained(save_directory)
-
- # Mount Google Drive
- from google.colab import drive
- drive.mount('/content/drive')
-
- # Copy the model files to Google Drive
- import shutil
- shutil.copytree(save_directory, drive_directory)
-
- print(f"Model saved to {drive_directory} in Google Drive. You can now download it to your local machine.")