Studio: forward standard OpenAI tools / tool_choice on /v1/responses (Codex compat) (#5122 )

* Studio: forward standard OpenAI tools / tool_choice on /v1/responses Mirrors the /v1/chat/completions client-side tool pass-through from #5099 so clients (OpenAI Codex CLI, OpenAI Python SDK, ...) that target the Responses API receive structured function_call output items instead of plain text with tool-call tokens leaking into content. - ResponsesRequest: type tools/tool_choice properly, add parallel_tool_calls; accept function_call and function_call_output input items for multi-turn - Translate flat Responses tool / tool_choice shape to the nested Chat Completions shape before forwarding to llama-server - _normalise_responses_input: map function_call_output -> role="tool", function_call -> assistant tool_calls (preserving call_id) - Non-streaming: map returned tool_calls -> top-level function_call output items keyed by call_id - Streaming: emit response.output_item.added (function_call), response.function_call_arguments.delta/.done, and response.output_item.done per tool call while keeping the text message at output_index 0 - Pytest coverage: tools/tool_choice translation, multi-turn input mapping, non-streaming tool_calls mapping, response round-trip * Studio: merge system messages and close inner stream on /v1/responses Fixes two issues surfacing when OpenAI Codex CLI drives /v1/responses against a GGUF with a strict chat template (gpt-oss harmony, Qwen3, ...). 1. "System message must be at the beginning" upstream errors Codex sends `instructions` AND a `role:"developer"` message in `input`, producing two separate system-role messages. Strict templates raise when a second system message exists or when one appears after a user turn. _normalise_responses_input now hoists all instructions / system / developer content into a single merged system message at the top of the Chat Completions message list. 2. "async generator ignored GeneratorExit" / "Attempted to exit cancel scope in a different task" _responses_stream consumed the inner chat-completions body_iterator without an explicit aclose() in a finally block. On client disconnect (Codex frequently cancels mid-stream), Python 3.13 finalized the inner async generator on a different task, tripping anyio's cancel-scope check. Mirrored the same try/finally + aclose pattern used by the /v1/messages, /v1/chat/completions, and /v1/completions passthroughs. Tests: hoisting of instructions + developer, developer mid-conversation, multiple system messages in input, no-system passthrough. * Studio: accept Codex multi-turn shapes and fix cross-task stream close on /v1/responses Two issues observed driving /v1/responses from OpenAI Codex CLI against a GGUF backend. 1. 422 on every turn after the first Codex replays prior assistant turns with `content:[{"type":"output_text","text":...,"annotations":[],"logprobs":[]}]` and carries forward `reasoning` items (o-series / gpt-5) between turns. Our `ResponsesContentPart` union only accepted input_text / input_image, and `ResponsesInputItem` only message / function_call / function_call_output, so Pydantic failed the whole list and FastAPI returned `"Input should be a valid string"` against the `str` branch of the outer union. - Add `ResponsesOutputTextPart` for assistant-replay content. - Add `ResponsesUnknownContentPart` and `ResponsesUnknownInputItem` as permissive catch-alls (drop during normalisation). - Wire an explicit `Discriminator` so dispatch is deterministic and the fallthrough reaches the catch-all instead of misreporting via the outer `Union[str, list[...]]`. - `_normalise_responses_input` now accepts output_text parts, flattens single-part assistant text to a plain string (keeps legacy chat templates happy), and silently drops reasoning / unknown items. 2. "async generator ignored GeneratorExit" / cross-task cancel scope `_responses_stream` awaited `openai_chat_completions` in the parent route-handler task, which opens the httpx client for the inner passthrough on *that* task. The outer `StreamingResponse` then iterates in a child task, so the asyncgen GC finalises the inner httpcore byte stream on the child task, tripping anyio's "Attempted to exit cancel scope in a different task". Move the `await` inside `event_generator` so the httpx lifecycle stays within the single streaming child task, and surface any HTTPException as a `response.failed` SSE frame. Tests: assistant output_text replay, reasoning-item tolerance, unknown content-part tolerance, end-to-end Codex-shape payload (developer + user + reasoning + function_call + function_call_output + assistant output_text + user), and single-part assistant flattening to plain string. * Studio: call llama-server directly from streaming /v1/responses The previous fix (running the inner await inside event_generator) was not enough. Wrapping the existing `openai_chat_completions` pass-through still stacks two async generators: when the outer generator is closed, the innermost `HTTP11ConnectionByteStream.__aiter__` in httpcore doesn't receive GeneratorExit before Python's asyncgen GC finalises it in a sibling task, tripping "Attempted to exit cancel scope in a different task" and "async generator ignored GeneratorExit" — the same Python 3.13 + httpcore 1.0.x interaction already seen in PRs #4956, #4981, #5099. Cure both pass-throughs had: a single same-task httpx lifecycle with explicit `aiter_lines().aclose()` BEFORE `resp.aclose()` / `client.aclose()` in the generator's finally block. Apply it at the Responses layer by dropping the wrapper entirely for GGUF: open httpx, consume `resp.aiter_lines()`, parse `chat.completion.chunk`, emit Responses SSE events, close everything in finally — all in the single StreamingResponse child task. Non-GGUF streaming is rejected with a 400 (wrapping the transformers backend would re-introduce the double-layer pattern and isn't a Codex-compatible path today anyway). Also surfaces upstream httpx.RequestError / non-200 as a `response.failed` SSE frame rather than a dropped stream now that the request is dispatched after SSE headers have gone out. * Studio: silence benign httpcore asyncgen GC warnings on Python 3.13 The streaming pass-throughs (/v1/chat/completions, /v1/messages, /v1/responses, /v1/completions) all use the proven #4981 / #5099 pattern — single-task httpx lifecycle with explicit aiter_lines().aclose() ahead of resp.aclose() / client.aclose() in the generator's finally block. That handles our own iterators correctly. The residual noise ("async generator ignored GeneratorExit" / "Attempted to exit cancel scope in a different task") comes from an innermost HTTP11ConnectionByteStream.__aiter__ that httpcore creates internally inside its pool. We hold no reference to it, so we cannot aclose it ourselves. Python 3.13's asyncgen GC hook finalises it on the finaliser task, its aclose path enters an anyio CancelScope shield, and Python flags the cross-task exit. The response has already been delivered with a 200 by then — it is purely log noise, not a functional failure. Same interaction seen in modelcontextprotocol/python-sdk #831, agno #3556, chainlit #2361, langchain-mcp-adapters #254. Install a targeted sys.unraisablehook that swallows this specific tuple — RuntimeError mentioning "cancel scope" or "GeneratorExit" plus an object repr referencing HTTP11ConnectionByteStream — and defers to the default hook for every other unraisable. Idempotent; guarded by a sentinel attribute so repeated imports don't stack filters.
Studio: Improve chat composition, fix scroll behaviour, and refine sidebar UX (#5089 )
2026-04-21 13:37:39 +00:00 · 2026-04-21 13:17:20 +04:00 · 2026-04-21 02:20:45 +04:00 · 2026-04-20 23:18:18 +04:00 · 2026-04-20 23:14:49 +04:00 · 2026-04-20 22:28:02 +04:00
359 changed files with 67867 additions and 8384 deletions
--- a/.github/ISSUE_TEMPLATE/bug---issue.md
+++ b/.github/ISSUE_TEMPLATE/bug---issue.md
@ -6,7 +6,7 @@ labels: bug
 assignees: ''

 ---
-
+Note: Please do not remove the questions. Answer beside them.
 1. Did you update? `pip install --upgrade unsloth unsloth_zoo`
 2. `Colab` or `Kaggle` or local / cloud
 3. Number GPUs used, use `nvidia-smi`
@ -16,6 +16,7 @@ assignees: ''

 ```python
 Put Minimal code to reproduce error here ###Remove Hugging Face token###
+###Please make sure to check formatting properly, edit if needed.###
 ```

 🦥 You can also ask via our Reddit page: https://reddit.com/r/unsloth/
--- a/.github/dependabot.yml
+++ b/.github/dependabot.yml
@ -0,0 +1,27 @@
+---
+version: 2
+updates:
+  - package-ecosystem: "github-actions"
+    directory: "/"
+    schedule:
+      interval: "weekly"
+    groups:
+      actions:
+        patterns: ["*"]
+
+  - package-ecosystem: "bun"
+    directory: "/studio/frontend"
+    schedule:
+      interval: "weekly"
+    groups:
+      bun-frontend:
+        patterns: ["*"]
+
+  - package-ecosystem: "npm"
+    directory: "/studio/backend/core/data_recipe/oxc-validator"
+    schedule:
+      interval: "weekly"
+    groups:
+      npm-oxc-validator:
+        patterns: ["*"]
+...
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@ -1,6 +1,6 @@
 repos:
  - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.15.6
+    rev: v0.15.10
    hooks:
      - id: ruff
        args:
--- a/README.md
+++ b/README.md
@ -1,28 +1,43 @@
 <h1 align="center" style="margin:0;">
  <a href="https://unsloth.ai/docs"><picture>
-    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/unslothai/unsloth/main/images/STUDIO%20WHITE%20LOGO.png">
-    <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/unslothai/unsloth/main/images/STUDIO%20BLACK%20LOGO.png">
-    <img alt="Unsloth logo" src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/STUDIO%20BLACK%20LOGO.png" height="60" style="max-width:100%;">
+    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20logo%20white%20text.png">
+    <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20logo%20black%20text.png">
+    <img alt="Unsloth logo" src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20logo%20black%20text.png" height="80" style="max-width:100%;">
  </picture></a>
 </h1>
 <h3 align="center" style="margin: 0; margin-top: 0;">
-Run and train AI models with a unified local interface.
+Unsloth Studio lets you run and train models locally.
 </h3>

 <p align="center">
  <a href="#-features">Features</a> •
-  <a href="#-quickstart">Quickstart</a> •
+  <a href="#-install">Quickstart</a> •
  <a href="#-free-notebooks">Notebooks</a> •
-  <a href="https://unsloth.ai/docs">Documentation</a> •
-  <a href="https://discord.com/invite/unsloth">Discord</a>
+  <a href="https://unsloth.ai/docs">Documentation</a>
 </p>
- <a href="https://unsloth.ai/docs/new/studio">
-<img alt="unsloth studio ui homepage" src="https://raw.githubusercontent.com/unslothai/unsloth/main/studio/frontend/public/studio%20github%20landscape%20colab%20display.png" style="max-width: 100%; margin-bottom: 0;"></a>
+<br>
+<a href="https://unsloth.ai/docs/new/studio">
+<img alt="unsloth studio ui homepage" src="https://github.com/user-attachments/assets/53ae17a9-d975-44ef-9686-efb4ebd0454d" style="max-width: 100%; margin-bottom: 0;"></a>

-Unsloth Studio (Beta) lets you run and train text, [audio](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning), [embedding](https://unsloth.ai/docs/new/embedding-finetuning), [vision](https://unsloth.ai/docs/basics/vision-fine-tuning) models on Windows, Linux and macOS.
+## ⚡ Get started
+
+#### macOS, Linux, WSL:
+```bash
+curl -fsSL https://unsloth.ai/install.sh | sh
+```
+#### Windows:
+```powershell
+irm https://unsloth.ai/install.ps1 | iex
+```
+#### Community:
+
+- [Discord](https://discord.gg/unsloth)
+- [𝕏 (Twitter)](https://x.com/UnslothAI)
+- [Reddit](https://reddit.com/r/unsloth)

 ## ⭐ Features
-Unsloth provides several key features for both inference and training:
+Unsloth Studio (Beta) lets you run and train text, [audio](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning), [embedding](https://unsloth.ai/docs/new/embedding-finetuning), [vision](https://unsloth.ai/docs/basics/vision-fine-tuning) models on Windows, Linux and macOS.
+
 ### Inference
 * **Search + download + run models** including GGUF, LoRA adapters, safetensors
 * **Export models**: [Save or export](https://unsloth.ai/docs/new/studio/export) models to GGUF, 16-bit safetensors and other formats.
@ -32,15 +47,15 @@ Unsloth provides several key features for both inference and training:
 * We work directly with teams behind [gpt-oss](https://docs.unsloth.ai/new/gpt-oss-how-to-run-and-fine-tune#unsloth-fixes-for-gpt-oss), [Qwen3](https://www.reddit.com/r/LocalLLaMA/comments/1kaodxu/qwen3_unsloth_dynamic_ggufs_128k_context_bug_fixes/), [Llama 4](https://github.com/ggml-org/llama.cpp/pull/12889), [Mistral](models/tutorials/devstral-how-to-run-and-fine-tune.md), [Gemma 1-3](https://news.ycombinator.com/item?id=39671146), and [Phi-4](https://unsloth.ai/blog/phi4), where we’ve fixed bugs that improve model accuracy.
 * Upload images, audio, PDFs, code, DOCX and more file types to chat with.
 ### Training
-* Train **500+ models** up to **2x faster** with up to **70% less VRAM**, with no accuracy loss.
+* Train and RL **500+ models** up to **2x faster** with up to **70% less VRAM**, with no accuracy loss.
 * Custom Triton and mathematical **kernels**. See some collabs we did with [PyTorch](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) and [Hugging Face](https://unsloth.ai/docs/new/faster-moe).
 * **Data Recipes**: [Auto-create datasets](https://unsloth.ai/docs/new/studio/data-recipe) from **PDF, CSV, DOCX** etc. Edit data in a visual-node workflow.
-* Supports full fine-tuning, pretraining, 4-bit, 16-bit and, FP8 training.
+* **[Reinforcement Learning](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide)** (RL): The most efficient [RL](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) library, using **80% less VRAM** for GRPO, [FP8](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) etc.
+* Supports full fine-tuning, RL, pretraining, 4-bit, 16-bit and, FP8 training.
 * **Observability**: Monitor training live, track loss and GPU usage and customize graphs.
-* **Reinforcement Learning**: The most efficient [RL](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) library, using **80% less VRAM** for GRPO, [FP8](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) etc.
 * [Multi-GPU](https://unsloth.ai/docs/basics/multi-gpu-training-with-unsloth) training is supported, with major improvements coming soon.

-## ⚡ Quickstart
+## 📥 Install
 Unsloth can be used in two ways: through **[Unsloth Studio](https://unsloth.ai/docs/new/studio/)**, the web UI, or through **Unsloth Core**, the code-based version. Each has different requirements.

 ### Unsloth Studio (web UI)
@ -49,7 +64,7 @@ Unsloth Studio (Beta) works on **Windows, Linux, WSL** and **macOS**.
 * **CPU:** Supported for Chat and Data Recipes currently
 * **NVIDIA:** Training works on RTX 30/40/50, Blackwell, DGX Spark, Station and more
 * **macOS:** Currently supports chat and Data Recipes. **MLX training** is coming very soon
-* **AMD:** Chat works. Train with [Unsloth Core](#unsloth-core-code-based). Studio support is coming soon.
+* **AMD:** Chat + Data works. Train with [Unsloth Core](#unsloth-core-code-based). Studio support is out soon.
 * **Coming soon:** Training support for Apple MLX, AMD, and Intel.
 * **Multi-GPU:** Available now, with a major upgrade on the way

@ -57,19 +72,20 @@ Unsloth Studio (Beta) works on **Windows, Linux, WSL** and **macOS**.
 ```bash
 curl -fsSL https://unsloth.ai/install.sh | sh
 ```
-If you don't have `curl`, use `wget`. Launch after setup via:
-```bash
-source unsloth_studio/bin/activate
-unsloth studio -H 0.0.0.0 -p 8888
-```
-
 #### Windows:
 ```powershell
 irm https://unsloth.ai/install.ps1 | iex
 ```
-Launch after setup via:
-```powershell
-& .\unsloth_studio\Scripts\unsloth.exe studio -H 0.0.0.0 -p 8888
+
+#### Launch
+```bash
+unsloth studio -H 0.0.0.0 -p 8888
+```
+
+#### Update
+To update, use the same install commands as above. Or run (does not work on Windows):
+```bash
+unsloth studio update
 ```

 #### Docker
@ -82,64 +98,8 @@ docker run -d -e JUPYTER_PASSWORD="mypassword" \
  unsloth/unsloth
  ```

-#### macOS, Linux, WSL developer installs:
-```bash
-curl -LsSf https://astral.sh/uv/install.sh | sh
-uv venv unsloth_studio --python 3.13
-source unsloth_studio/bin/activate
-uv pip install unsloth --torch-backend=auto
-unsloth studio setup
-unsloth studio -H 0.0.0.0 -p 8888
-```
-
-#### Windows PowerShell developer installs:
-```powershell
-winget install -e --id Python.Python.3.13
-winget install --id=astral-sh.uv  -e
-uv venv unsloth_studio --python 3.13
-.\unsloth_studio\Scripts\activate
-uv pip install unsloth --torch-backend=auto
-unsloth studio setup
-unsloth studio -H 0.0.0.0 -p 8888
-```
-
-#### Nightly - MacOS, Linux, WSL:
-```bash
-curl -LsSf https://astral.sh/uv/install.sh | sh
-git clone --filter=blob:none https://github.com/unslothai/unsloth.git unsloth_studio
-cd unsloth_studio
-uv venv --python 3.13
-source .venv/bin/activate
-uv pip install -e . --torch-backend=auto
-unsloth studio setup
-unsloth studio -H 0.0.0.0 -p 8888
-```
-Then to launch every time:
-```bash
-cd unsloth_studio
-source .venv/bin/activate
-unsloth studio -H 0.0.0.0 -p 8888
-```
-
-#### Nightly - Windows:
-Run in Windows Powershell:
-```bash
-winget install -e --id Python.Python.3.13
-winget install --id=astral-sh.uv  -e
-git clone --filter=blob:none https://github.com/unslothai/unsloth.git unsloth_studio
-cd unsloth_studio
-uv venv --python 3.13
-.\.venv\Scripts\activate
-uv pip install -e . --torch-backend=auto
-unsloth studio setup
-unsloth studio -H 0.0.0.0 -p 8888
-```
-Then to launch every time:
-```bash
-cd unsloth_studio
-.\.venv\Scripts\activate
-unsloth studio -H 0.0.0.0 -p 8888
-```
+#### Developer, Nightly, Uninstall
+To see developer, nightly and uninstallation etc. instructions, see [advanced installation](#-advanced-installation).

 ### Unsloth Core (code-based)
 #### Linux, WSL:
@ -164,17 +124,19 @@ You can use the same Docker image as Unsloth Studio.
 For RTX 50x, B200, 6000 GPUs: `uv pip install unsloth --torch-backend=auto`. Read our guides for: [Blackwell](https://unsloth.ai/docs/blog/fine-tuning-llms-with-blackwell-rtx-50-series-and-unsloth) and [DGX Spark](https://unsloth.ai/docs/blog/fine-tuning-llms-with-nvidia-dgx-spark-and-unsloth). <br>
 To install Unsloth on **AMD** and **Intel** GPUs, follow our [AMD Guide](https://unsloth.ai/docs/get-started/install/amd) and [Intel Guide](https://unsloth.ai/docs/get-started/install/intel).

-## ✨ Free Notebooks
+## 📒 Free Notebooks

-Train for free with our notebooks. Read our [guide](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide). Add dataset, run, then deploy your trained model.
+Train for free with our notebooks. You can use our new [free Unsloth Studio notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb) to run and train models for free in a web UI.
+Read our [guide](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide). Add dataset, run, then deploy your trained model.

 | Model | Free Notebooks | Performance | Memory use |
 |-----------|---------|--------|----------|
+| **Gemma 4 (E2B)**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_(E2B)-Vision.ipynb)               | 1.5x faster | 50% less |
 | **Qwen3.5 (4B)**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_5_(4B)_Vision.ipynb)               | 1.5x faster | 60% less |
 | **gpt-oss (20B)**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-Fine-tuning.ipynb)               | 2x faster | 70% less |
+| **Qwen3.5 GSPO**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_5_(4B)_Vision_GRPO.ipynb)               | 2x faster | 70% less |
 | **gpt-oss (20B): GRPO**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb)               | 2x faster | 80% less |
-| **Qwen3: Advanced GRPO**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb)               | 2x faster | 50% less |
-| **Gemma 3 (4B) Vision** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_(4B)-Vision.ipynb)               | 1.7x faster | 60% less |
+| **Qwen3: Advanced GRPO**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb)               | 2x faster | 70% less |
 | **embeddinggemma (300M)**    | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/EmbeddingGemma_(300M).ipynb)               | 2x faster | 20% less |
 | **Mistral Ministral 3 (3B)**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Ministral_3_VL_(3B)_Vision.ipynb)               | 1.5x faster | 60% less |
 | **Llama 3.1 (8B) Alpaca**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb)               | 2x faster | 70% less |
@ -186,6 +148,8 @@ Train for free with our notebooks. Read our [guide](https://unsloth.ai/docs/get-
 - See detailed documentation for Unsloth [here](https://unsloth.ai/docs)

 ## 🦥 Unsloth News
+- **Qwen3.6**: Qwen3.6-35B-A3B can now be trained and run in Unsloth Studio. [Blog](https://unsloth.ai/docs/models/qwen3.6)
+- **Gemma 4**: Run and train Google’s new models directly in Unsloth. [Blog](https://unsloth.ai/docs/models/gemma-4)
 - **Introducing Unsloth Studio**: our new web UI for running and training LLMs. [Blog](https://unsloth.ai/docs/new/studio)
 - **Qwen3.5** - 0.8B, 2B, 4B, 9B, 27B, 35-A3B, 112B-A10B are now supported. [Guide + notebooks](https://unsloth.ai/docs/models/qwen3.5/fine-tune)
 - Train **MoE LLMs 12x faster** with 35% less VRAM - DeepSeek, GLM, Qwen and gpt-oss. [Blog](https://unsloth.ai/docs/new/faster-moe)
@ -196,13 +160,83 @@ Train for free with our notebooks. Read our [guide](https://unsloth.ai/docs/get-
 - **FP8 & Vision RL**: You can now do FP8 & VLM GRPO on consumer GPUs. [FP8 Blog](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/vision-reinforcement-learning-vlm-rl)
 - **gpt-oss** by OpenAI: Read our [RL blog](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune/gpt-oss-reinforcement-learning), [Flex Attention](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune/long-context-gpt-oss-training) blog and [Guide](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune).

-## 🔗 Links and Resources
+## 📥 Advanced Installation
+The below advanced instructions are for Unsloth Studio. For Unsloth Core advanced installation, [view our docs](https://unsloth.ai/docs/get-started/install/pip-install#advanced-pip-installation).
+#### Developer installs: macOS, Linux, WSL:
+```bash
+git clone https://github.com/unslothai/unsloth
+cd unsloth
+./install.sh --local
+unsloth studio -H 0.0.0.0 -p 8888
+```
+Then to update :
+```bash
+unsloth studio update
+```
+
+#### Developer installs: Windows PowerShell:
+```powershell
+git clone https://github.com/unslothai/unsloth.git
+cd unsloth
+Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
+.\install.ps1 --local
+unsloth studio -H 0.0.0.0 -p 8888
+```
+Then to update :
+```bash
+unsloth studio update
+```
+
+#### Nightly: MacOS, Linux, WSL:
+```bash
+git clone https://github.com/unslothai/unsloth
+cd unsloth
+git checkout nightly
+./install.sh --local
+unsloth studio -H 0.0.0.0 -p 8888
+```
+Then to launch every time:
+```bash
+unsloth studio -H 0.0.0.0 -p 8888
+```
+
+#### Nightly: Windows:
+Run in Windows Powershell:
+```bash
+git clone https://github.com/unslothai/unsloth.git
+cd unsloth
+git checkout nightly
+Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
+.\install.ps1 --local
+unsloth studio -H 0.0.0.0 -p 8888
+```
+Then to launch every time:
+```bash
+unsloth studio -H 0.0.0.0 -p 8888
+```
+
+#### Uninstall
+You can uninstall Unsloth Studio by deleting its install folder usually located under `$HOME/.unsloth/studio` on Mac/Linux/WSL and `%USERPROFILE%\.unsloth\studio` on Windows. Using the `rm -rf` commands will **delete everything**, including your history, cache:
+
+*  **MacOS, WSL, Linux:** `rm -rf ~/.unsloth/studio`
+*  **Windows (PowerShell):** `Remove-Item -Recurse -Force "$HOME\.unsloth\studio"`
+
+For more info, [see our docs](https://unsloth.ai/docs/new/studio/install#uninstall).
+
+#### Deleting model files
+
+You can delete old model files either from the bin icon in model search or by removing the relevant cached model folder from the default Hugging Face cache directory. By default, HF uses:
+
+*  **MacOS, Linux, WSL:** `~/.cache/huggingface/hub/`
+*  **Windows:** `%USERPROFILE%\.cache\huggingface\hub\`
+
+## 💚 Community and Links
 | Type                                                                                                                                      | Links                                                                          |
 | ----------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
+| <img width="16" src="https://cdn.prod.website-files.com/6257adef93867e50d84d30e2/66e3d80db9971f10a9757c99_Symbol.svg" />  **Discord**                       | [Join Discord server](https://discord.com/invite/unsloth)                          |
 | <img width="15" src="https://redditinc.com/hs-fs/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" />  **r/unsloth Reddit**                       | [Join Reddit community](https://reddit.com/r/unsloth)                          |
 | 📚 **Documentation & Wiki**                                                                                                               | [Read Our Docs](https://unsloth.ai/docs)                                       |
 | <img width="13" src="https://upload.wikimedia.org/wikipedia/commons/0/09/X_(formerly_Twitter)_logo_late_2025.svg" />  **Twitter (aka X)** | [Follow us on X](https://twitter.com/unslothai)                                |
-| 💾 **Installation**                                                                                                                       | [Pip & Docker Install](https://unsloth.ai/docs/get-started/install) |
 | 🔮 **Our Models**                                                                                                                         | [Unsloth Catalog](https://unsloth.ai/docs/get-started/unsloth-model-catalog)   |
 | ✍️ **Blog**                                                                                                                               | [Read our Blogs](https://unsloth.ai/blog)                                      |

--- a/build.sh
+++ b/build.sh
@ -29,7 +29,22 @@ _restore_gitignores() {
 }
 trap _restore_gitignores EXIT

-npm install
+# Use bun for install if available (faster), fall back to npm.
+_install_ok=false
+if command -v bun &>/dev/null; then
+    if bun install; then
+        _install_ok=true
+    else
+        echo "⚠ bun install failed, falling back to npm"
+        rm -rf node_modules
+    fi
+fi
+if [ "$_install_ok" != "true" ]; then
+    if ! npm install; then
+        echo "❌ ERROR: package install failed" >&2
+        exit 1
+    fi
+fi
 npm run build       # outputs to studio/frontend/dist/

 _restore_gitignores
--- a/images/unsloth
+++ b/images/unsloth
--- a/images/unsloth
+++ b/images/unsloth
--- a/install.ps1
+++ b/install.ps1
--- a/install.sh
+++ b/install.sh
--- a/pyproject.toml
+++ b/pyproject.toml
@ -46,12 +46,9 @@ studio = [
    "*.ps1",
    "*.bat",
    "frontend/dist/**/*",
-    "frontend/public/**/*",
-    "frontend/src/**/*",
    "frontend/*.json",
    "frontend/*.ts",
    "frontend/*.js",
-    "frontend/*.lock",
    "frontend/*.html",
    "frontend/*.yaml",
    "frontend/.git*",
@ -61,7 +58,8 @@ studio = [
 ]

 [tool.setuptools.packages.find]
-exclude = ["images*", "tests*", "kernels/moe*"]
+include = ["unsloth*", "unsloth_cli*", "studio", "studio.backend*"]
+exclude = ["images*", "tests*", "*.node_modules", "*.node_modules.*"]

 [project.optional-dependencies]
 triton = [
@ -84,13 +82,13 @@ huggingfacenotorch = [
    "huggingface_hub>=0.34.0",
    "hf_transfer",
    "diffusers",
-    "transformers>=4.51.3,!=4.52.0,!=4.52.1,!=4.52.2,!=4.52.3,!=4.53.0,!=4.54.0,!=4.55.0,!=4.55.1,!=4.57.0,!=4.57.4,!=4.57.5,!=5.0.0,!=5.1.0,<=5.3.0",
+    "transformers>=4.51.3,!=4.52.0,!=4.52.1,!=4.52.2,!=4.52.3,!=4.53.0,!=4.54.0,!=4.55.0,!=4.55.1,!=4.57.0,!=4.57.4,!=4.57.5,!=5.0.0,!=5.1.0,<=5.5.0",
    "trl>=0.18.2,!=0.19.0,<=0.24.0",
    "sentence-transformers",
 ]
 huggingface = [
    "unsloth[huggingfacenotorch]",
-    "unsloth_zoo>=2026.3.4",
+    "unsloth_zoo>=2026.4.8",
    "torchvision",
    "unsloth[triton]",
 ]
@ -580,10 +578,10 @@ colab-ampere-torch220 = [
    "flash-attn>=2.6.3 ; ('linux' in sys_platform)",
 ]
 colab-new = [
-    "unsloth_zoo>=2026.3.4",
+    "unsloth_zoo>=2026.4.8",
    "packaging",
    "tyro",
-    "transformers>=4.51.3,!=4.52.0,!=4.52.1,!=4.52.2,!=4.52.3,!=4.53.0,!=4.54.0,!=4.55.0,!=4.55.1,!=4.57.0,!=4.57.4,!=4.57.5,!=5.0.0,!=5.1.0,<=5.3.0",
+    "transformers>=4.51.3,!=4.52.0,!=4.52.1,!=4.52.2,!=4.52.3,!=4.53.0,!=4.54.0,!=4.55.0,!=4.55.1,!=4.57.0,!=4.57.4,!=4.57.5,!=5.0.0,!=5.1.0,<=5.5.0",
    "datasets>=3.4.1,!=4.0.*,!=4.1.0,<4.4.0",
    "sentencepiece>=0.2.0",
    "tqdm",
--- a/scripts/install_gemma4_mlx.sh
+++ b/scripts/install_gemma4_mlx.sh
@ -0,0 +1,169 @@
+#!/bin/bash
+set -e
+
+# ============================================================
+# Gemma 4 MLX — One-command setup + inference
+#
+# Usage:
+#   bash install_gemma4_mlx.sh [--venv-dir DIR]
+#
+# This script:
+#   1. Creates a Python virtual environment
+#   2. Installs uv, mlx-vlm, transformers
+# ============================================================
+
+# ── Output style (inspired by unsloth/install.sh) ─────────────
+RULE=""
+_rule_i=0
+while [ "$_rule_i" -lt 52 ]; do
+    RULE="${RULE}─"
+    _rule_i=$((_rule_i + 1))
+done
+
+if [ -n "${NO_COLOR:-}" ]; then
+    C_TITLE= C_DIM= C_OK= C_WARN= C_ERR= C_RST=
+elif [ -t 1 ] || [ -n "${FORCE_COLOR:-}" ]; then
+    _ESC="$(printf '\033')"
+    C_TITLE="${_ESC}[38;5;117m"
+    C_DIM="${_ESC}[38;5;245m"
+    C_OK="${_ESC}[38;5;108m"
+    C_WARN="${_ESC}[38;5;136m"
+    C_ERR="${_ESC}[91m"
+    C_RST="${_ESC}[0m"
+else
+    C_TITLE= C_DIM= C_OK= C_WARN= C_ERR= C_RST=
+fi
+
+step()    { printf "  ${C_DIM}%-18.18s${C_RST}${3:-$C_OK}%s${C_RST}\n" "$1" "$2"; }
+substep() { printf "  ${C_DIM}%-18s${2:-$C_DIM}%s${C_RST}\n" "" "$1"; }
+fail()    { step "error" "$1" "$C_ERR"; exit 1; }
+
+# ── Parse flags ───────────────────────────────────────────────
+VENV_DIR=""
+_next_is_venv=false
+
+for arg in "$@"; do
+    if [ "$_next_is_venv" = true ]; then
+        VENV_DIR="$arg"
+        _next_is_venv=false
+        continue
+    fi
+    case "$arg" in
+        --venv-dir)  _next_is_venv=true ;;
+    esac
+done
+
+# Default venv location
+if [ -z "$VENV_DIR" ]; then
+    VENV_DIR="$HOME/.unsloth/unsloth_gemma4_mlx"
+fi
+
+# ── Banner ────────────────────────────────────────────────────
+echo ""
+printf "  ${C_TITLE}%s${C_RST}\n" "💎 Gemma 4 MLX Installer"
+printf "  ${C_DIM}%s${C_RST}\n" "$RULE"
+echo ""
+
+# ── Platform check ────────────────────────────────────────────
+if [ "$(uname)" != "Darwin" ]; then
+    fail "MLX requires macOS with Apple Silicon. Detected: $(uname)"
+fi
+
+_ARCH=$(uname -m)
+if [ "$_ARCH" != "arm64" ]; then
+    step "warning" "Apple Silicon recommended (detected: $_ARCH)" "$C_WARN"
+fi
+
+step "platform" "macOS ($_ARCH)"
+
+# ── Detect Python ─────────────────────────────────────────────
+PYTHON=""
+for _candidate in python3.12 python3.11 python3.13 python3; do
+    if command -v "$_candidate" >/dev/null 2>&1; then
+        PYTHON="$_candidate"
+        break
+    fi
+done
+
+if [ -z "$PYTHON" ]; then
+    fail "Python 3 not found. Install via: brew install python@3.12"
+fi
+
+_PY_VERSION=$("$PYTHON" -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}')")
+step "python" "$PYTHON ($_PY_VERSION)"
+
+# ── Create virtual environment ────────────────────────────────
+if [ -x "$VENV_DIR/bin/python" ]; then
+    step "venv" "using existing environment"
+    substep "$VENV_DIR"
+else
+    step "venv" "creating virtual environment"
+    substep "$VENV_DIR"
+    mkdir -p "$(dirname "$VENV_DIR")"
+    "$PYTHON" -m venv "$VENV_DIR"
+fi
+
+# ── Install uv ───────────────────────────────────────────────
+if ! command -v uv >/dev/null 2>&1; then
+    step "uv" "installing uv package manager..."
+    _uv_tmp=$(mktemp)
+    curl -LsSf "https://astral.sh/uv/install.sh" -o "$_uv_tmp"
+    sh "$_uv_tmp" </dev/null >/dev/null 2>&1
+    rm -f "$_uv_tmp"
+    if [ -f "$HOME/.local/bin/env" ]; then
+        . "$HOME/.local/bin/env"
+    fi
+    export PATH="$HOME/.local/bin:$PATH"
+    substep "done"
+else
+    step "uv" "found $(uv --version 2>/dev/null || echo 'uv')"
+fi
+
+_VENV_PY="$VENV_DIR/bin/python"
+
+# ── Install dependencies ──────────────────────────────────────
+step "install" "installing mlx-vlm..."
+uv pip install --python "$_VENV_PY" -q mlx-vlm
+substep "done"
+
+step "install" "installing transformers>=5.5.0..."
+if uv pip install --python "$_VENV_PY" -q "transformers>=5.5.0" 2>/dev/null; then
+    substep "installed from PyPI"
+else
+    substep "PyPI install failed (Python <3.10?), trying GitHub..."
+    if uv pip install --python "$_VENV_PY" -q "git+https://github.com/huggingface/transformers.git@v5.5-release" 2>/dev/null; then
+        substep "installed from huggingface/transformers v5.5-release"
+    else
+        step "warning" "could not install transformers>=5.5.0" "$C_WARN"
+        substep "tried: PyPI, huggingface/transformers v5.5-release"
+    fi
+fi
+
+# ── Verify installation ──────────────────────────────────────
+if "$_VENV_PY" -c "import mlx_vlm"; then
+    substep "mlx-vlm verified"
+else
+    fail "Installation verification failed."
+fi
+
+# ── Done ──────────────────────────────────────────────────────
+echo ""
+printf "  ${C_TITLE}%s${C_RST}\n" "Gemma 4 MLX installed!"
+printf "  ${C_DIM}%s${C_RST}\n" "$RULE"
+echo ""
+step "available models" "unsloth/gemma-4-E2B-it-UD-MLX-4bit"
+substep "unsloth/gemma-4-E4B-it-UD-MLX-4bit"
+substep "unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit"
+substep "unsloth/gemma-4-31b-it-UD-MLX-4bit"
+echo ""
+step "venv activate" "source ${VENV_DIR}/bin/activate"
+echo ""
+step "text chat" "python -m mlx_vlm.chat --model unsloth/gemma-4-E2B-it-UD-MLX-4bit"
+echo ""
+step "vision chat" "python -m mlx_vlm.chat --model unsloth/gemma-4-31b-it-UD-MLX-4bit"
+substep "Use /image path/to/image.jpg to load an image"
+echo ""
+step "gradio UI" "python -m mlx_vlm.chat_ui --model unsloth/gemma-4-31b-it-UD-MLX-4bit"
+echo ""
+printf "  ${C_DIM}%s${C_RST}\n" "$RULE"
+echo ""
--- a/scripts/install_qwen3_6_mlx.sh
+++ b/scripts/install_qwen3_6_mlx.sh
@ -0,0 +1,191 @@
+#!/bin/bash
+set -e
+
+# ============================================================
+# Qwen3.6 MLX — One-command setup + inference
+#
+# Usage:
+#   bash install_qwen3_6_mlx.sh [--venv-dir DIR]
+#
+# This script:
+#   1. Creates a Python virtual environment
+#   2. Installs uv, mlx-vlm, transformers, torch, torchvision
+# ============================================================
+
+# ── Output style (inspired by unsloth/install.sh) ─────────────
+RULE=""
+_rule_i=0
+while [ "$_rule_i" -lt 52 ]; do
+    RULE="${RULE}─"
+    _rule_i=$((_rule_i + 1))
+done
+
+if [ -n "${NO_COLOR:-}" ]; then
+    C_TITLE= C_DIM= C_OK= C_WARN= C_ERR= C_RST=
+elif [ -t 1 ] || [ -n "${FORCE_COLOR:-}" ]; then
+    _ESC="$(printf '\033')"
+    C_TITLE="${_ESC}[38;5;117m"
+    C_DIM="${_ESC}[38;5;245m"
+    C_OK="${_ESC}[38;5;108m"
+    C_WARN="${_ESC}[38;5;136m"
+    C_ERR="${_ESC}[91m"
+    C_RST="${_ESC}[0m"
+else
+    C_TITLE= C_DIM= C_OK= C_WARN= C_ERR= C_RST=
+fi
+
+step()    { printf "  ${C_DIM}%-18.18s${C_RST}${3:-$C_OK}%s${C_RST}\n" "$1" "$2"; }
+substep() { printf "  ${C_DIM}%-18s${2:-$C_DIM}%s${C_RST}\n" "" "$1"; }
+fail()    { step "error" "$1" "$C_ERR"; exit 1; }
+
+# ── Parse flags ───────────────────────────────────────────────
+VENV_DIR=""
+_next_is_venv=false
+
+for arg in "$@"; do
+    if [ "$_next_is_venv" = true ]; then
+        VENV_DIR="$arg"
+        _next_is_venv=false
+        continue
+    fi
+    case "$arg" in
+        --venv-dir)  _next_is_venv=true ;;
+    esac
+done
+
+# Default venv location
+if [ -z "$VENV_DIR" ]; then
+    VENV_DIR="$HOME/.unsloth/unsloth_qwen3_6_mlx"
+fi
+
+# ── Banner ────────────────────────────────────────────────────
+echo ""
+printf "  ${C_TITLE}%s${C_RST}\n" "Qwen3.6 MLX Installer"
+printf "  ${C_DIM}%s${C_RST}\n" "$RULE"
+echo ""
+
+# ── Platform check ────────────────────────────────────────────
+if [ "$(uname)" != "Darwin" ]; then
+    fail "MLX requires macOS with Apple Silicon. Detected: $(uname)"
+fi
+
+_ARCH=$(uname -m)
+if [ "$_ARCH" != "arm64" ]; then
+    step "warning" "Apple Silicon recommended (detected: $_ARCH)" "$C_WARN"
+fi
+
+step "platform" "macOS ($_ARCH)"
+
+# ── Detect Python ─────────────────────────────────────────────
+PYTHON=""
+for _candidate in python3.12 python3.11 python3.13 python3; do
+    if command -v "$_candidate" >/dev/null 2>&1; then
+        PYTHON="$_candidate"
+        break
+    fi
+done
+
+if [ -z "$PYTHON" ]; then
+    fail "Python 3 not found. Install via: brew install python@3.12"
+fi
+
+_PY_VERSION=$("$PYTHON" -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}')")
+step "python" "$PYTHON ($_PY_VERSION)"
+
+# ── Create virtual environment ────────────────────────────────
+if [ -x "$VENV_DIR/bin/python" ]; then
+    step "venv" "using existing environment"
+    substep "$VENV_DIR"
+else
+    step "venv" "creating virtual environment"
+    substep "$VENV_DIR"
+    mkdir -p "$(dirname "$VENV_DIR")"
+    "$PYTHON" -m venv "$VENV_DIR"
+fi
+
+# ── Install uv ───────────────────────────────────────────────
+if ! command -v uv >/dev/null 2>&1; then
+    step "uv" "installing uv package manager..."
+    _uv_tmp=$(mktemp)
+    curl -LsSf "https://astral.sh/uv/install.sh" -o "$_uv_tmp"
+    sh "$_uv_tmp" </dev/null
+    rm -f "$_uv_tmp"
+    if [ -f "$HOME/.local/bin/env" ]; then
+        . "$HOME/.local/bin/env"
+    fi
+    export PATH="$HOME/.local/bin:$PATH"
+    substep "done"
+else
+    step "uv" "found $(uv --version 2>/dev/null || echo 'uv')"
+fi
+
+_VENV_PY="$VENV_DIR/bin/python"
+
+# ── Install dependencies ──────────────────────────────────────
+step "install" "installing mlx-vlm..."
+uv pip install --python "$_VENV_PY" -q mlx-vlm
+substep "done"
+
+step "install" "installing transformers>=5.2.0..."
+if uv pip install --python "$_VENV_PY" -q "transformers>=5.2.0"; then
+    substep "installed from PyPI"
+else
+    substep "PyPI install failed, trying GitHub..."
+    if uv pip install --python "$_VENV_PY" -q "git+https://github.com/huggingface/transformers.git"; then
+        substep "installed from huggingface/transformers main"
+    else
+        fail "Could not install transformers>=5.2.0 (required for Qwen3.5/3.6 model support). Please check your Python version (>=3.10 required) and network connection, then try again."
+    fi
+fi
+
+step "install" "installing torch + torchvision (needed for Qwen3 VL processor)..."
+uv pip install --python "$_VENV_PY" -q torch torchvision
+substep "done"
+
+# ── Verify installation ──────────────────────────────────────
+if "$_VENV_PY" -c "import mlx_vlm; import torch; import torchvision; import transformers"; then
+    substep "mlx-vlm + torch + transformers verified"
+else
+    fail "Installation verification failed. Please ensure Python >=3.10 and try again."
+fi
+
+# ── Apply patches for multi-turn image chat ──────────────────
+_PATCH_BASE="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/fix/ui-fix/unsloth/models/patches/mlx_vlm_qwen3_5"
+_SITE_PKGS=$("$_VENV_PY" -c "import site; print(site.getsitepackages()[0])")
+
+step "patch" "fixing multi-turn image chat..."
+
+if curl -sSLf "${_PATCH_BASE}/qwen3_5.py" -o "${_SITE_PKGS}/mlx_vlm/models/qwen3_5/qwen3_5.py"; then
+    substep "patched qwen3_5.py (MRoPE position reset)"
+else
+    step "warning" "failed to download qwen3_5.py patch — multi-turn image chat may not work" "$C_WARN"
+fi
+
+if curl -sSLf "${_PATCH_BASE}/generate.py" -o "${_SITE_PKGS}/mlx_vlm/generate.py"; then
+    substep "patched generate.py (mask trim on cache reuse)"
+else
+    step "warning" "failed to download generate.py patch — multi-turn image chat may not work" "$C_WARN"
+fi
+
+# Clear pycache so patches take effect
+find "${_SITE_PKGS}/mlx_vlm" -name "__pycache__" -type d -exec rm -rf {} + 2>/dev/null || true
+substep "cleared bytecode cache"
+
+# ── Done ──────────────────────────────────────────────────────
+echo ""
+printf "  ${C_TITLE}%s${C_RST}\n" "Qwen3.6 MLX installed!"
+printf "  ${C_DIM}%s${C_RST}\n" "$RULE"
+echo ""
+step "available models" "unsloth/Qwen3.6-35B-A3B-UD-MLX-3bit"
+substep "unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit"
+substep "unsloth/Qwen3.6-35B-A3B-MLX-8bit"
+echo ""
+step "venv activate" "source ${VENV_DIR}/bin/activate"
+echo ""
+step "vision chat" "python -m mlx_vlm.chat --model unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit"
+substep "Use /image path/to/image.jpg to load an image"
+echo ""
+step "gradio UI" "python -m mlx_vlm.chat_ui --model unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit"
+echo ""
+printf "  ${C_DIM}%s${C_RST}\n" "$RULE"
+echo ""
--- a/studio/Unsloth_Studio_Colab.ipynb
+++ b/studio/Unsloth_Studio_Colab.ipynb
@ -1,157 +1,153 @@
 {
-  "cells": [
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "view-in-github",
-        "colab_type": "text"
-      },
-      "source": [
-        "<a href=\"https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "id": "6b87de59",
-      "metadata": {
-        "id": "6b87de59"
-      },
-      "source": [
-        "To run this, press \"*Runtime*\" and press \"*Run all*\" on a **free** Tesla T4 Google Colab instance!\n",
-        "<div class=\"align-center\">\n",
-        "<a href=\"https://unsloth.ai/\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png\" width=\"115\"></a>\n",
-        "<a href=\"https://discord.gg/unsloth\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/Discord button.png\" width=\"145\"></a>\n",
-        "<a href=\"https://unsloth.ai/docs/\"><img src=\"https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true\" width=\"125\"></a> Join Discord if you need help + ⭐ <i>Star us on <a href=\"https://github.com/unslothai/unsloth\">Github</a> </i> ⭐\n",
-        "</div>\n",
-        "\n",
-        "To install Unsloth Studio on your local device, follow [our guide](https://unsloth.ai/docs/new/unsloth-studio/install). Unsloth Studio is licensed [AGPL-3.0](https://github.com/unslothai/unsloth/blob/main/studio/LICENSE.AGPL-3.0).\n",
-        "\n",
-        "### Unsloth Studio\n",
-        "\n",
-        "Train and run open models with [**Unsloth Studio**](https://unsloth.ai/docs/new/unsloth-studio/start). Currently, installation may take 30+ mins so use a newer GPU.\n",
-        "\n",
-        "\n",
-        "We are actively working on making Unsloth Studio install on Colab T4 GPUs faster.\n",
-        "\n",
-        "[Features](https://unsloth.ai/docs/new/unsloth-studio#features) • [Quickstart](https://unsloth.ai/docs/new/unsloth-studio/start) • [Data Recipes](https://unsloth.ai/docs/new/unsloth-studio/data-recipe) • [Studio Chat](https://unsloth.ai/docs/new/unsloth-studio/chat) • [Export](https://unsloth.ai/docs/new/unsloth-studio/export)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "id": "e4206349",
-      "metadata": {
-        "id": "e4206349"
-      },
-      "source": [
-        "<p align=\"left\"><img src=\"https://github.com/unslothai/unsloth/raw/main/studio/frontend/public/studio%20github%20landscape%20colab%20display.png\" width=\"600\"></p>"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "id": "27da2957",
-      "metadata": {
-        "id": "27da2957"
-      },
-      "source": [
-        "### Setup: Clone repo and run setup"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "id": "27e68f91",
-      "metadata": {
-        "id": "27e68f91"
-      },
-      "outputs": [],
-      "source": [
-        "!git clone --depth 1 --branch main https://github.com/unslothai/unsloth.git\n",
-        "%cd /content/unsloth\n",
-        "!chmod +x studio/setup.sh && ./studio/setup.sh"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "id": "3e1771a9",
-      "metadata": {
-        "id": "3e1771a9"
-      },
-      "source": [
-        "### Start Unsloth Studio"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "id": "277e431e",
-      "metadata": {
-        "id": "277e431e"
-      },
-      "outputs": [],
-      "source": [
-        "import sys, time\n",
-        "sys.path.insert(0, \"/content/unsloth/studio/backend\")\n",
-        "from colab import start\n",
-        "start()"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "from google.colab import output\n",
-        "output.serve_kernel_port_as_iframe(8888, height = 1200, width = \"100%\")\n",
-        "for _ in range(10000): time.sleep(300), print(\"=\", end = \"\")"
-      ],
-      "metadata": {
-        "id": "wb9UELh--XzX"
-      },
-      "id": "wb9UELh--XzX",
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "id": "f2b0c6a1",
-      "metadata": {
-        "id": "f2b0c6a1"
-      },
-      "source": [
-        "And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!\n",
-        "\n",
-        "Some other resources:\n",
-        "1. Looking to use Unsloth locally? Read our [Installation Guide](https://unsloth.ai/docs/get-started/install) for details on installing Unsloth on Windows, Docker, AMD, Intel GPUs.\n",
-        "2. Learn how to do Reinforcement Learning with our [RL Guide and notebooks](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide).\n",
-        "3. Read our guides and notebooks for [Text-to-speech (TTS)](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning) and [vision](https://unsloth.ai/docs/basics/vision-fine-tuning) model support.\n",
-        "4. Explore our [LLM Tutorials Directory](https://unsloth.ai/docs/models/tutorials-how-to-fine-tune-and-run-llms) to find dedicated guides for each model.\n",
-        "5. Need help with Inference? Read our [Inference & Deployment page](https://unsloth.ai/docs/basics/inference-and-deployment) for details on using vLLM, llama.cpp, Ollama etc.\n",
-        "\n",
-        "<div class=\"align-center\">\n",
-        "  <a href=\"https://unsloth.ai\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png\" width=\"115\"></a>\n",
-        "  <a href=\"https://discord.gg/unsloth\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/Discord.png\" width=\"145\"></a>\n",
-        "  <a href=\"https://unsloth.ai/docs/\"><img src=\"https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true\" width=\"125\"></a>\n",
-        "\n",
-        "  Join Discord if you need help + ⭐️ <i>Star us on <a href=\"https://github.com/unslothai/unsloth\">Github</a> </i> ⭐️\n",
-        "\n",
-        "  <b>This notebook is licensed <a href=\"https://github.com/unslothai/unsloth/blob/main/studio/LICENSE.AGPL-3.0\">AGPL-3.0</a></b>\n",
-        "</div>"
-      ]
-    }
-  ],
-  "metadata": {
-    "accelerator": "GPU",
-    "colab": {
-      "gpuType": "T4",
-      "provenance": [],
-      "include_colab_link": true
-    },
-    "kernelspec": {
-      "display_name": "Python 3",
-      "name": "python3"
-    },
-    "language_info": {
-      "name": "python"
-    }
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+   ]
  },
-  "nbformat": 4,
-  "nbformat_minor": 5
-}
+  {
+   "cell_type": "markdown",
+   "id": "6b87de59",
+   "metadata": {
+    "id": "6b87de59"
+   },
+   "source": [
+    "To run this, press \"*Runtime*\" and press \"*Run all*\" on a **free** Tesla T4 Google Colab instance!\n",
+    "<div class=\"align-center\">\n",
+    "<a href=\"https://unsloth.ai/\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png\" width=\"115\"></a>\n",
+    "<a href=\"https://discord.gg/unsloth\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/Discord button.png\" width=\"145\"></a>\n",
+    "<a href=\"https://unsloth.ai/docs/\"><img src=\"https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true\" width=\"125\"></a> Join Discord if you need help + ⭐ <i>Star us on <a href=\"https://github.com/unslothai/unsloth\">Github</a> </i> ⭐\n",
+    "</div>\n",
+    "\n",
+    "To install Unsloth Studio on your local device, follow [our guide](https://unsloth.ai/docs/new/unsloth-studio/install). Unsloth Studio is licensed [AGPL-3.0](https://github.com/unslothai/unsloth/blob/main/studio/LICENSE.AGPL-3.0).\n",
+    "\n",
+    "### Unsloth Studio\n",
+    "\n",
+    "Train and run open models with [**Unsloth Studio**](https://unsloth.ai/docs/new/unsloth-studio/start). NEW! Installation should now only take 2 mins!\n",
+    "\n",
+    "\n",
+    "We are actively working on making Unsloth Studio install on Colab T4 GPUs faster.\n",
+    "\n",
+    "[Features](https://unsloth.ai/docs/new/unsloth-studio#features) • [Quickstart](https://unsloth.ai/docs/new/unsloth-studio/start) • [Data Recipes](https://unsloth.ai/docs/new/unsloth-studio/data-recipe) • [Studio Chat](https://unsloth.ai/docs/new/unsloth-studio/chat) • [Export](https://unsloth.ai/docs/new/unsloth-studio/export)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e4206349",
+   "metadata": {
+    "id": "e4206349"
+   },
+   "source": [
+    "<p align=\"left\"><img src=\"https://github.com/unslothai/unsloth/raw/main/studio/frontend/public/studio%20github%20landscape%20colab%20display.png\" width=\"600\"></p>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "27da2957",
+   "metadata": {
+    "id": "27da2957"
+   },
+   "source": [
+    "### Setup: Clone repo and run setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "27e68f91",
+   "metadata": {
+    "id": "27e68f91"
+   },
+   "outputs": [],
+   "source": "!git clone --depth 1 --branch main https://github.com/unslothai/unsloth.git\n%cd /content/unsloth\n!chmod +x studio/setup.sh && ./studio/setup.sh"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3e1771a9",
+   "metadata": {
+    "id": "3e1771a9"
+   },
+   "source": [
+    "### Start Unsloth Studio"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "277e431e",
+   "metadata": {
+    "id": "277e431e"
+   },
+   "outputs": [],
+   "source": [
+    "import sys, time\n",
+    "sys.path.insert(0, \"/content/unsloth/studio/backend\")\n",
+    "from colab import start\n",
+    "start()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "from google.colab import output\n",
+    "output.serve_kernel_port_as_iframe(8888, height = 1200, width = \"100%\")\n",
+    "for _ in range(10000): time.sleep(300), print(\"=\", end = \"\")"
+   ],
+   "metadata": {
+    "id": "wb9UELh--XzX"
+   },
+   "id": "wb9UELh--XzX",
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f2b0c6a1",
+   "metadata": {
+    "id": "f2b0c6a1"
+   },
+   "source": [
+    "And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!\n",
+    "\n",
+    "Some other resources:\n",
+    "1. Looking to use Unsloth locally? Read our [Installation Guide](https://unsloth.ai/docs/get-started/install) for details on installing Unsloth on Windows, Docker, AMD, Intel GPUs.\n",
+    "2. Learn how to do Reinforcement Learning with our [RL Guide and notebooks](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide).\n",
+    "3. Read our guides and notebooks for [Text-to-speech (TTS)](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning) and [vision](https://unsloth.ai/docs/basics/vision-fine-tuning) model support.\n",
+    "4. Explore our [LLM Tutorials Directory](https://unsloth.ai/docs/models/tutorials-how-to-fine-tune-and-run-llms) to find dedicated guides for each model.\n",
+    "5. Need help with Inference? Read our [Inference & Deployment page](https://unsloth.ai/docs/basics/inference-and-deployment) for details on using vLLM, llama.cpp, Ollama etc.\n",
+    "\n",
+    "<div class=\"align-center\">\n",
+    "  <a href=\"https://unsloth.ai\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png\" width=\"115\"></a>\n",
+    "  <a href=\"https://discord.gg/unsloth\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/Discord.png\" width=\"145\"></a>\n",
+    "  <a href=\"https://unsloth.ai/docs/\"><img src=\"https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true\" width=\"125\"></a>\n",
+    "\n",
+    "  Join Discord if you need help + ⭐️ <i>Star us on <a href=\"https://github.com/unslothai/unsloth\">Github</a> </i> ⭐️\n",
+    "\n",
+    "  <b>This notebook is licensed <a href=\"https://github.com/unslothai/unsloth/blob/main/studio/LICENSE.AGPL-3.0\">AGPL-3.0</a></b>\n",
+    "</div>"
+   ]
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "gpuType": "T4",
+   "provenance": [],
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/studio/backend/assets/configs/full_finetune.yaml
+++ b/studio/backend/assets/configs/full_finetune.yaml
@ -10,13 +10,13 @@ training:
  load_in_4bit: false
  output_dir: outputs
  num_epochs: 1
-  learning_rate: 0.0002
+  learning_rate: 2e-5
  batch_size: 1
  gradient_accumulation_steps: 4
  warmup_steps: 5
  max_steps: 0
  save_steps: 0
-  weight_decay: 0.01
+  weight_decay: 0.001
  random_seed: 3407
  packing: false
  train_on_completions: false
--- a/studio/backend/assets/configs/inference_defaults.json
+++ b/studio/backend/assets/configs/inference_defaults.json
@ -1,6 +1,14 @@
 {
  "_comment": "Per-model-family inference parameter defaults. Sources: (1) Ollama params blobs, (2) Existing Unsloth Studio YAML configs. Patterns ordered longest-match-first.",
  "families": {
+    "qwen3.6": {
+      "temperature": 0.7,
+      "top_p": 0.8,
+      "top_k": 20,
+      "min_p": 0.0,
+      "repetition_penalty": 1.0,
+      "presence_penalty": 1.5
+    },
    "qwen3.5": {
      "temperature": 0.7,
      "top_p": 0.8,
@ -93,6 +101,14 @@
      "min_p": 0.0,
      "repetition_penalty": 1.0
    },
+    "gemma-4": {
+      "temperature": 1.0,
+      "top_p": 0.95,
+      "top_k": 64,
+      "min_p": 0.0,
+      "repetition_penalty": 1.0,
+      "presence_penalty": 0.0
+    },
    "gemma-3n": {
      "temperature": 1.0,
      "top_p": 0.95,
@ -361,12 +377,12 @@
    }
  },
  "patterns": [
-    "qwen3.5",
+    "qwen3.6", "qwen3.5",
    "qwen3-coder", "qwen3-next", "qwen3-vl", "qwen3",
    "qwen2.5-coder", "qwen2.5-vl", "qwen2.5-omni", "qwen2.5-math", "qwen2.5",
    "qwen2-vl", "qwen2",
    "qwq",
-    "gemma-3n", "gemma-3", "medgemma", "gemma-2",
+    "gemma-4", "gemma-3n", "gemma-3", "medgemma", "gemma-2",
    "llama-4", "llama-3.3", "llama-3.2", "llama-3.1", "llama-3",
    "phi-4", "phi-3",
    "mistral-nemo", "mistral-small", "mistral-large", "magistral", "ministral",
--- a/studio/backend/assets/configs/lora_text.yaml
+++ b/studio/backend/assets/configs/lora_text.yaml
@ -16,7 +16,7 @@ training:
  warmup_steps: 5
  max_steps: 0
  save_steps: 0
-  weight_decay: 0.01
+  weight_decay: 0.001
  random_seed: 3407
  packing: false
  train_on_completions: false
--- a/studio/backend/assets/configs/model_defaults/default.yaml
+++ b/studio/backend/assets/configs/model_defaults/default.yaml
@ -6,13 +6,13 @@ training:
  max_seq_length: 2048
  # num_epochs: 4
  num_epochs: 0
-  learning_rate: 5e-5
+  learning_rate: 2e-4
  batch_size: 2
  gradient_accumulation_steps: 4
  warmup_ratio: 0.1
  max_steps: 30
  save_steps: 30
-  weight_decay: 0.01
+  weight_decay: 0.001
  random_seed: 3407
  packing: false
  train_on_completions: true
--- a/studio/backend/assets/configs/model_defaults/embedding/unsloth_Qwen3-Embedding-0.6B.yaml
+++ b/studio/backend/assets/configs/model_defaults/embedding/unsloth_Qwen3-Embedding-0.6B.yaml
@ -12,7 +12,7 @@ training:
  warmup_ratio: 0.03
  max_steps: 30
  save_steps: 30
-  weight_decay: 0.01
+  weight_decay: 0.001
  random_seed: 3407
  packing: false
  train_on_completions: false
--- a/studio/backend/assets/configs/model_defaults/embedding/unsloth_all-MiniLM-L6-v2.yaml
+++ b/studio/backend/assets/configs/model_defaults/embedding/unsloth_all-MiniLM-L6-v2.yaml
@ -11,7 +11,7 @@ training:
  warmup_ratio: 0.03
  max_steps: 30
  save_steps: 30
-  weight_decay: 0.01
+  weight_decay: 0.001
  random_seed: 3407
  packing: false
  train_on_completions: false
--- a/studio/backend/assets/configs/model_defaults/embedding/unsloth_bge-m3.yaml
+++ b/studio/backend/assets/configs/model_defaults/embedding/unsloth_bge-m3.yaml
@ -11,7 +11,7 @@ training:
  warmup_ratio: 0.03
  max_steps: 30
  save_steps: 30
-  weight_decay: 0.01
+  weight_decay: 0.001
  random_seed: 3407
  packing: false
  train_on_completions: false
--- a/studio/backend/assets/configs/model_defaults/embedding/unsloth_embeddinggemma-300m.yaml
+++ b/studio/backend/assets/configs/model_defaults/embedding/unsloth_embeddinggemma-300m.yaml
@ -11,7 +11,7 @@ training:
  warmup_ratio: 0.03
  max_steps: 30
  save_steps: 30
-  weight_decay: 0.01
+  weight_decay: 0.001
  random_seed: 3407
  packing: false
  train_on_completions: false
--- a/studio/backend/assets/configs/model_defaults/embedding/unsloth_gte-modernbert-base.yaml
+++ b/studio/backend/assets/configs/model_defaults/embedding/unsloth_gte-modernbert-base.yaml
@ -11,7 +11,7 @@ training:
  warmup_ratio: 0.03
  max_steps: 30
  save_steps: 30
-  weight_decay: 0.01
+  weight_decay: 0.001
  random_seed: 3407
  packing: false
  train_on_completions: false
--- a/studio/backend/assets/configs/model_defaults/falcon/tiiuae_Falcon-H1-0.5B-Instruct.yaml
+++ b/studio/backend/assets/configs/model_defaults/falcon/tiiuae_Falcon-H1-0.5B-Instruct.yaml
@ -13,7 +13,7 @@ training:
  warmup_steps: 5
  max_steps: 30
  save_steps: 30
-  weight_decay: 0.01
+  weight_decay: 0.001
  random_seed: 3407
  packing: false
  train_on_completions: true
--- a/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-2-2b.yaml
+++ b/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-2-2b.yaml
@ -13,7 +13,7 @@ training:
  warmup_steps: 5
  max_steps: 30
  save_steps: 30
-  weight_decay: 0.01
+  weight_decay: 0.001
  random_seed: 3407
  packing: false
  train_on_completions: true
--- a/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-4-26B-A4B-it.yaml
+++ b/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-4-26B-A4B-it.yaml
@ -0,0 +1,47 @@
+# Model defaults for unsloth/gemma-4-26B-A4B-it
+# Also applies to: google/gemma-4-26B-A4B-it, unsloth/gemma-4-26B-A4B-it-GGUF
+
+training:
+  trust_remote_code: false
+  max_seq_length: 2048
+  num_epochs: 0
+  learning_rate: 2e-4
+  batch_size: 2
+  gradient_accumulation_steps: 4
+  warmup_steps: 5
+  max_steps: 30
+  save_steps: 30
+  weight_decay: 0.001
+  random_seed: 3407
+  packing: false
+  train_on_completions: true
+  gradient_checkpointing: "unsloth"
+  optim: "adamw_8bit"
+  lr_scheduler_type: "linear"
+
+lora:
+  lora_r: 8
+  lora_alpha: 8
+  lora_dropout: 0.0
+  target_modules:
+    - "all-linear"
+  use_rslora: false
+  use_loftq: false
+  finetune_vision_layers: true
+  finetune_language_layers: true
+  finetune_attention_modules: true
+  finetune_mlp_modules: true
+
+logging:
+  enable_wandb: false
+  wandb_project: "llm-finetuning"
+  enable_tensorboard: false
+  tensorboard_dir: "runs"
+  log_frequency: 10
+
+inference:
+  trust_remote_code: false
+  temperature: 1.0
+  top_p: 0.95
+  top_k: 64
+  min_p: 0.0
--- a/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-4-26B-A4B.yaml
+++ b/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-4-26B-A4B.yaml
@ -0,0 +1,47 @@
+# Model defaults for unsloth/gemma-4-26B-A4B (base/pretrained)
+# Also applies to: google/gemma-4-26B-A4B
+
+training:
+  trust_remote_code: false
+  max_seq_length: 2048
+  num_epochs: 0
+  learning_rate: 2e-4
+  batch_size: 2
+  gradient_accumulation_steps: 4
+  warmup_steps: 5
+  max_steps: 30
+  save_steps: 30
+  weight_decay: 0.001
+  random_seed: 3407
+  packing: false
+  train_on_completions: true
+  gradient_checkpointing: "unsloth"
+  optim: "adamw_8bit"
+  lr_scheduler_type: "linear"
+
+lora:
+  lora_r: 8
+  lora_alpha: 8
+  lora_dropout: 0.0
+  target_modules:
+    - "all-linear"
+  use_rslora: false
+  use_loftq: false
+  finetune_vision_layers: true
+  finetune_language_layers: true
+  finetune_attention_modules: true
+  finetune_mlp_modules: true
+
+logging:
+  enable_wandb: false
+  wandb_project: "llm-finetuning"
+  enable_tensorboard: false
+  tensorboard_dir: "runs"
+  log_frequency: 10
+
+inference:
+  trust_remote_code: false
+  temperature: 1.0
+  top_p: 0.95
+  top_k: 64
+  min_p: 0.0
--- a/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-4-31B-it.yaml
+++ b/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-4-31B-it.yaml
@ -0,0 +1,47 @@
+# Model defaults for unsloth/gemma-4-31B-it
+# Also applies to: google/gemma-4-31B-it, unsloth/gemma-4-31B-it-GGUF
+
+training:
+  trust_remote_code: false
+  max_seq_length: 2048
+  num_epochs: 0
+  learning_rate: 2e-4
+  batch_size: 2
+  gradient_accumulation_steps: 4
+  warmup_steps: 5
+  max_steps: 30
+  save_steps: 30
+  weight_decay: 0.001
+  random_seed: 3407
+  packing: false
+  train_on_completions: true
+  gradient_checkpointing: "unsloth"
+  optim: "adamw_8bit"
+  lr_scheduler_type: "linear"
+
+lora:
+  lora_r: 8
+  lora_alpha: 8
+  lora_dropout: 0.0
+  target_modules:
+    - "all-linear"
+  use_rslora: false
+  use_loftq: false
+  finetune_vision_layers: true
+  finetune_language_layers: true
+  finetune_attention_modules: true
+  finetune_mlp_modules: true
+
+logging:
+  enable_wandb: false
+  wandb_project: "llm-finetuning"
+  enable_tensorboard: false
+  tensorboard_dir: "runs"
+  log_frequency: 10
+
+inference:
+  trust_remote_code: false
+  temperature: 1.0
+  top_p: 0.95
+  top_k: 64
+  min_p: 0.0
--- a/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-4-31B.yaml
+++ b/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-4-31B.yaml
@ -0,0 +1,47 @@
+# Model defaults for unsloth/gemma-4-31B (base/pretrained)
+# Also applies to: google/gemma-4-31B
+
+training:
+  trust_remote_code: false
+  max_seq_length: 2048
+  num_epochs: 0
+  learning_rate: 2e-4
+  batch_size: 2
+  gradient_accumulation_steps: 4
+  warmup_steps: 5
+  max_steps: 30
+  save_steps: 30
+  weight_decay: 0.001
+  random_seed: 3407
+  packing: false
+  train_on_completions: true
+  gradient_checkpointing: "unsloth"
+  optim: "adamw_8bit"
+  lr_scheduler_type: "linear"
+
+lora:
+  lora_r: 8
+  lora_alpha: 8
+  lora_dropout: 0.0
+  target_modules:
+    - "all-linear"
+  use_rslora: false
+  use_loftq: false
+  finetune_vision_layers: true
+  finetune_language_layers: true
+  finetune_attention_modules: true
+  finetune_mlp_modules: true
+
+logging:
+  enable_wandb: false
+  wandb_project: "llm-finetuning"
+  enable_tensorboard: false
+  tensorboard_dir: "runs"
+  log_frequency: 10
+
+inference:
+  trust_remote_code: false
+  temperature: 1.0
+  top_p: 0.95
+  top_k: 64
+  min_p: 0.0
--- a/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-4-E2B-it.yaml
+++ b/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-4-E2B-it.yaml
@ -0,0 +1,47 @@
+# Model defaults for unsloth/gemma-4-E2B-it
+# Also applies to: google/gemma-4-E2B-it, unsloth/gemma-4-E2B-it-GGUF
+
+training:
+  trust_remote_code: false
+  max_seq_length: 2048
+  num_epochs: 0
+  learning_rate: 2e-4
+  batch_size: 2
+  gradient_accumulation_steps: 4
+  warmup_steps: 5
+  max_steps: 30
+  save_steps: 30
+  weight_decay: 0.001
+  random_seed: 3407
+  packing: false
+  train_on_completions: true
+  gradient_checkpointing: "unsloth"
+  optim: "adamw_8bit"
+  lr_scheduler_type: "linear"
+
+lora:
+  lora_r: 8
+  lora_alpha: 8
+  lora_dropout: 0.0
+  target_modules:
+    - "all-linear"
+  use_rslora: false
+  use_loftq: false
+  finetune_vision_layers: true
+  finetune_language_layers: true
+  finetune_attention_modules: true
+  finetune_mlp_modules: true
+
+logging:
+  enable_wandb: false
+  wandb_project: "llm-finetuning"
+  enable_tensorboard: false
+  tensorboard_dir: "runs"
+  log_frequency: 10
+
+inference:
+  trust_remote_code: false
+  temperature: 1.0
+  top_p: 0.95
+  top_k: 64
+  min_p: 0.0
--- a/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-4-E2B.yaml
+++ b/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-4-E2B.yaml
@ -0,0 +1,47 @@
+# Model defaults for unsloth/gemma-4-E2B (base/pretrained)
+# Also applies to: google/gemma-4-E2B
+
+training:
+  trust_remote_code: false
+  max_seq_length: 2048
+  num_epochs: 0
+  learning_rate: 2e-4
+  batch_size: 2
+  gradient_accumulation_steps: 4
+  warmup_steps: 5
+  max_steps: 30
+  save_steps: 30
+  weight_decay: 0.001
+  random_seed: 3407
+  packing: false
+  train_on_completions: true
+  gradient_checkpointing: "unsloth"
+  optim: "adamw_8bit"
+  lr_scheduler_type: "linear"
+
+lora:
+  lora_r: 8
+  lora_alpha: 8
+  lora_dropout: 0.0
+  target_modules:
+    - "all-linear"
+  use_rslora: false
+  use_loftq: false
+  finetune_vision_layers: true
+  finetune_language_layers: true
+  finetune_attention_modules: true
+  finetune_mlp_modules: true
+
+logging:
+  enable_wandb: false
+  wandb_project: "llm-finetuning"
+  enable_tensorboard: false
+  tensorboard_dir: "runs"
+  log_frequency: 10
+
+inference:
+  trust_remote_code: false
+  temperature: 1.0
+  top_p: 0.95
+  top_k: 64
+  min_p: 0.0
--- a/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-4-E4B-it.yaml
+++ b/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-4-E4B-it.yaml
@ -0,0 +1,47 @@
+# Model defaults for unsloth/gemma-4-E4B-it
+# Also applies to: google/gemma-4-E4B-it, unsloth/gemma-4-E4B-it-GGUF
+
+training:
+  trust_remote_code: false
+  max_seq_length: 2048
+  num_epochs: 0
+  learning_rate: 2e-4
+  batch_size: 2
+  gradient_accumulation_steps: 4
+  warmup_steps: 5
+  max_steps: 30
+  save_steps: 30
+  weight_decay: 0.001
+  random_seed: 3407
+  packing: false
+  train_on_completions: true
+  gradient_checkpointing: "unsloth"
+  optim: "adamw_8bit"
+  lr_scheduler_type: "linear"
+
+lora:
+  lora_r: 8
+  lora_alpha: 8
+  lora_dropout: 0.0
+  target_modules:
+    - "all-linear"
+  use_rslora: false
+  use_loftq: false
+  finetune_vision_layers: true
+  finetune_language_layers: true
+  finetune_attention_modules: true
+  finetune_mlp_modules: true
+
+logging:
+  enable_wandb: false
+  wandb_project: "llm-finetuning"
+  enable_tensorboard: false
+  tensorboard_dir: "runs"
+  log_frequency: 10
+
+inference:
+  trust_remote_code: false
+  temperature: 1.0
+  top_p: 0.95
+  top_k: 64
+  min_p: 0.0
--- a/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-4-E4B.yaml
+++ b/studio/backend/assets/configs/model_defaults/gemma/unsloth_gemma-4-E4B.yaml
@ -0,0 +1,47 @@
+# Model defaults for unsloth/gemma-4-E4B (base/pretrained)
+# Also applies to: google/gemma-4-E4B
+
+training:
+  trust_remote_code: false
+  max_seq_length: 2048
+  num_epochs: 0
+  learning_rate: 2e-4
+  batch_size: 2
+  gradient_accumulation_steps: 4
+  warmup_steps: 5
+  max_steps: 30
+  save_steps: 30
+  weight_decay: 0.001
+  random_seed: 3407
+  packing: false
+  train_on_completions: true
+  gradient_checkpointing: "unsloth"
+  optim: "adamw_8bit"
+  lr_scheduler_type: "linear"
+
+lora:
+  lora_r: 8
+  lora_alpha: 8
+  lora_dropout: 0.0
+  target_modules:
+    - "all-linear"
+  use_rslora: false
+  use_loftq: false
+  finetune_vision_layers: true
+  finetune_language_layers: true
+  finetune_attention_modules: true
+  finetune_mlp_modules: true
+
+logging:
+  enable_wandb: false
+  wandb_project: "llm-finetuning"
+  enable_tensorboard: false
+  tensorboard_dir: "runs"
+  log_frequency: 10
+
+inference:
+  trust_remote_code: false
+  temperature: 1.0
+  top_p: 0.95
+  top_k: 64
+  min_p: 0.0
--- a/studio/backend/assets/configs/model_defaults/llama/unsloth_Llama-3.2-1B-Instruct.yaml
+++ b/studio/backend/assets/configs/model_defaults/llama/unsloth_Llama-3.2-1B-Instruct.yaml
@ -13,7 +13,7 @@ training:
  warmup_steps: 0
  max_steps: 30
  save_steps: 30
-  weight_decay: 0.01
+  weight_decay: 0.001
  random_seed: 3407
  packing: false
  train_on_completions: true
--- a/studio/backend/assets/configs/vision_lora.yaml
+++ b/studio/backend/assets/configs/vision_lora.yaml
@ -16,7 +16,7 @@ training:
  warmup_steps: 5
  max_steps: 0
  save_steps: 0
-  weight_decay: 0.01
+  weight_decay: 0.001
  random_seed: 3407
  packing: false
  train_on_completions: false
--- a/studio/backend/auth/authentication.py
+++ b/studio/backend/auth/authentication.py
@ -10,10 +10,12 @@ from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
 import jwt

 from .storage import (
+    API_KEY_PREFIX,
    get_jwt_secret,
    get_user_and_secret,
    load_jwt_secret,
    save_refresh_token,
+    validate_api_key,
    verify_refresh_token,
 )

@ -137,6 +139,18 @@ async def _get_current_subject(
            ...
    """
    token = credentials.credentials
+
+    # --- API key path (sk-unsloth-...) ---
+    if token.startswith(API_KEY_PREFIX):
+        username = validate_api_key(token)
+        if username is None:
+            raise HTTPException(
+                status_code = status.HTTP_401_UNAUTHORIZED,
+                detail = "Invalid or expired API key",
+            )
+        return username
+
+    # --- JWT path ---
    subject = _decode_subject_without_verification(token)
    if subject is None:
        raise HTTPException(
--- a/studio/backend/auth/storage.py
+++ b/studio/backend/auth/storage.py
@ -72,7 +72,22 @@ def clear_bootstrap_password() -> None:


 def _hash_token(token: str) -> str:
-    """SHA-256 hash helper used for refresh token storage."""
+    """SHA-256 hash helper used for refresh token storage.
+
+    Plain SHA-256 is intentional here: refresh tokens are high-entropy
+    random strings from ``secrets.token_urlsafe(48)`` (384 bits of
+    entropy), so a slow KDF (Argon2 / bcrypt / PBKDF2) provides zero
+    additional security — no attacker can brute-force 2^384 regardless
+    of hash speed — while adding tens of ms of CPU to every refresh.
+    See the OWASP Password Storage Cheat Sheet on fast-vs-slow hashing
+    of high-entropy inputs.
+
+    API keys use the separate ``_pbkdf2_api_key`` helper below, which
+    runs PBKDF2-HMAC-SHA256 with a persistent server-side salt — not
+    for cryptographic reasons (128-bit random tokens don't need slow
+    hashing), but because CodeQL's ``py/weak-sensitive-data-hashing``
+    query mislabels API keys as passwords and demands a KDF.
+    """
    return hashlib.sha256(token.encode("utf-8")).hexdigest()


@ -103,6 +118,29 @@ def get_connection() -> sqlite3.Connection:
        );
        """
    )
+    conn.execute(
+        """
+        CREATE TABLE IF NOT EXISTS api_keys (
+            id         INTEGER PRIMARY KEY AUTOINCREMENT,
+            username   TEXT NOT NULL,
+            key_prefix TEXT NOT NULL,
+            key_hash   TEXT NOT NULL UNIQUE,
+            name       TEXT NOT NULL DEFAULT '',
+            created_at TEXT NOT NULL,
+            last_used_at TEXT,
+            expires_at TEXT,
+            is_active  INTEGER NOT NULL DEFAULT 1
+        );
+        """
+    )
+    conn.execute(
+        """
+        CREATE TABLE IF NOT EXISTS app_secrets (
+            key   TEXT PRIMARY KEY,
+            value TEXT NOT NULL
+        );
+        """
+    )
    columns = {row["name"] for row in conn.execute("PRAGMA table_info(auth_user)")}
    if "must_change_password" not in columns:
        conn.execute(
@ -112,6 +150,89 @@ def get_connection() -> sqlite3.Connection:
    return conn


+# ── API-key PBKDF2 salt ────────────────────────────────────────────────
+#
+# Module-level cache for the persistent API-key PBKDF2 salt. Populated
+# lazily on first use via ``_get_or_create_api_key_pbkdf2_salt``. Not
+# protected by a lock because (a) the ``INSERT OR IGNORE`` provides
+# atomicity at the SQLite layer and (b) concurrent populations converge
+# on the same value, so the worst case is a harmless duplicate read on
+# startup.
+_api_key_pbkdf2_salt_cache: Optional[bytes] = None
+
+
+def _get_or_create_api_key_pbkdf2_salt() -> bytes:
+    """Return the persistent API-key PBKDF2 salt, generating it once if missing.
+
+    Stored as a hex-encoded 32-byte random value in the ``app_secrets``
+    table under key ``"api_key_pbkdf2_salt"``. Regenerated only if the row
+    is missing (i.e. fresh install, or operator manually deleted the row
+    and accepts invalidating existing API keys).
+    """
+    global _api_key_pbkdf2_salt_cache
+    if _api_key_pbkdf2_salt_cache is not None:
+        return _api_key_pbkdf2_salt_cache
+
+    conn = get_connection()
+    try:
+        cur = conn.execute(
+            "SELECT value FROM app_secrets WHERE key = ?",
+            ("api_key_pbkdf2_salt",),
+        )
+        row = cur.fetchone()
+        if row is None:
+            new_value = secrets.token_hex(32)  # 32 bytes -> 64 hex chars
+            conn.execute(
+                "INSERT OR IGNORE INTO app_secrets (key, value) VALUES (?, ?)",
+                ("api_key_pbkdf2_salt", new_value),
+            )
+            conn.commit()
+            cur = conn.execute(
+                "SELECT value FROM app_secrets WHERE key = ?",
+                ("api_key_pbkdf2_salt",),
+            )
+            row = cur.fetchone()
+        salt = bytes.fromhex(row["value"])
+    finally:
+        conn.close()
+
+    _api_key_pbkdf2_salt_cache = salt
+    return salt
+
+
+_API_KEY_PBKDF2_ITERATIONS = 100_000
+
+
+def _pbkdf2_api_key(raw_key: str) -> str:
+    """PBKDF2-HMAC-SHA256 an API key with a persistent server-side salt.
+
+    Used for API-key storage ONLY, not refresh tokens. Matches the
+    PBKDF2 algorithm + iteration count used by the password hasher in
+    ``auth/hashing.py`` so the codebase is consistent on which KDF it
+    uses for credential storage.
+
+    Notes on why a slow KDF here is *only* a CodeQL appeasement and
+    *not* a cryptographic requirement: API keys are cryptographically
+    random 128-bit tokens (via ``secrets.token_hex``), so brute force
+    against 2^128 is infeasible regardless of hash speed. CodeQL's
+    ``py/weak-sensitive-data-hashing`` query mislabels these tokens as
+    "password" sensitive data and then demands a KDF from its
+    allowlist (Argon2 / scrypt / bcrypt / PBKDF2). Per the query's
+    own recommendation page we use PBKDF2. The persistent salt is
+    still loaded from ``app_secrets`` so an attacker dumping the
+    ``api_keys`` table alone cannot derive hashes for candidate
+    tokens without also obtaining the salt row.
+    """
+    salt = _get_or_create_api_key_pbkdf2_salt()
+    dk = hashlib.pbkdf2_hmac(
+        "sha256",
+        raw_key.encode("utf-8"),
+        salt,
+        _API_KEY_PBKDF2_ITERATIONS,
+    )
+    return dk.hex()
+
+
 def is_initialized() -> bool:
    """Check if auth is ready for login (at least one user exists in DB)."""
    conn = get_connection()
@ -357,3 +478,105 @@ def revoke_user_refresh_tokens(username: str) -> None:
        conn.commit()
    finally:
        conn.close()
+
+
+# ---------------------------------------------------------------------------
+# API key management
+# ---------------------------------------------------------------------------
+
+API_KEY_PREFIX = "sk-unsloth-"
+
+
+def create_api_key(
+    username: str,
+    name: str,
+    expires_at: Optional[str] = None,
+) -> Tuple[str, dict]:
+    """Create a new API key for *username*.
+
+    Returns ``(raw_key, row_dict)`` where *raw_key* is shown to the user
+    exactly once.  The database only stores the SHA-256 hash.
+    """
+    raw_key = API_KEY_PREFIX + secrets.token_hex(16)
+    key_hash = _pbkdf2_api_key(raw_key)
+    key_prefix = raw_key[len(API_KEY_PREFIX) : len(API_KEY_PREFIX) + 8]
+    now = datetime.now(timezone.utc).isoformat()
+
+    conn = get_connection()
+    try:
+        conn.execute(
+            """
+            INSERT INTO api_keys (username, key_prefix, key_hash, name, created_at, expires_at)
+            VALUES (?, ?, ?, ?, ?, ?)
+            """,
+            (username, key_prefix, key_hash, name, now, expires_at),
+        )
+        conn.commit()
+        cur = conn.execute("SELECT * FROM api_keys WHERE key_hash = ?", (key_hash,))
+        row = cur.fetchone()
+        return raw_key, dict(row)
+    finally:
+        conn.close()
+
+
+def list_api_keys(username: str) -> list:
+    """Return all API keys for *username* (never exposes ``key_hash``)."""
+    conn = get_connection()
+    try:
+        cur = conn.execute(
+            """
+            SELECT id, username, key_prefix, name, created_at, last_used_at, expires_at, is_active
+            FROM api_keys
+            WHERE username = ?
+            ORDER BY created_at DESC
+            """,
+            (username,),
+        )
+        return [dict(row) for row in cur.fetchall()]
+    finally:
+        conn.close()
+
+
+def revoke_api_key(username: str, key_id: int) -> bool:
+    """Soft-delete an API key.  Returns True if a matching row was found."""
+    conn = get_connection()
+    try:
+        cursor = conn.execute(
+            "UPDATE api_keys SET is_active = 0 WHERE id = ? AND username = ?",
+            (key_id, username),
+        )
+        conn.commit()
+        return cursor.rowcount > 0
+    finally:
+        conn.close()
+
+
+def validate_api_key(raw_key: str) -> Optional[str]:
+    """Validate *raw_key* and return the owning username, or ``None``.
+
+    Also updates ``last_used_at`` on success.
+    """
+    key_hash = _pbkdf2_api_key(raw_key)
+    conn = get_connection()
+    try:
+        cur = conn.execute(
+            "SELECT id, username, is_active, expires_at FROM api_keys WHERE key_hash = ?",
+            (key_hash,),
+        )
+        row = cur.fetchone()
+        if row is None:
+            return None
+        if not row["is_active"]:
+            return None
+        if row["expires_at"] is not None:
+            expires = datetime.fromisoformat(row["expires_at"])
+            if datetime.now(timezone.utc) > expires:
+                return None
+        conn.execute(
+            "UPDATE api_keys SET last_used_at = ? WHERE id = ?",
+            (datetime.now(timezone.utc).isoformat(), row["id"]),
+        )
+        conn.commit()
+        return row["username"]
+    finally:
+        conn.close()
--- a/studio/backend/colab.py
+++ b/studio/backend/colab.py
@ -18,31 +18,6 @@ if _backend_dir not in sys.path:
 import _platform_compat  # noqa: F401


-def _bootstrap_studio_venv() -> None:
-    """Expose the Studio venv's site-packages to the current interpreter.
-
-    On Colab, notebook cells run outside the venv subshell. Instead of
-    installing the full stack into system Python, we prepend the venv's
-    site-packages so that packages like structlog, fastapi, etc. are
-    importable from notebook cells and take priority over system copies.
-    """
-    venv_lib = Path.home() / ".unsloth" / "studio" / ".venv" / "lib"
-    if not venv_lib.exists():
-        import warnings
-
-        warnings.warn(
-            f"Studio venv not found at {venv_lib.parent} -- run 'unsloth studio setup' first",
-            stacklevel = 2,
-        )
-        return
-    for sp in venv_lib.glob("python*/site-packages"):
-        sp_str = str(sp)
-        if sp_str not in sys.path:
-            sys.path.insert(0, sp_str)
-
-
-_bootstrap_studio_venv()
-
 from loggers import get_logger

 logger = get_logger(__name__)
@ -91,7 +66,10 @@ def show_link(port: int = 8888):
            <svg xmlns="http://www.w3.org/2000/svg" width="18" height="18" viewBox="0 0 24 24" fill="white"><polygon points="5,3 19,12 5,21"/></svg>
            Open Unsloth Studio
        </a>
-        <p style="color: #333333; margin: 16px 0 0 0; font-size: 13px; font-family: monospace;">
+        <p style="color: #333333; margin: 12px 0 0 0; font-size: 14px; font-weight: bold;">
+            If the link doesn't work, you can scroll down to view the UI generated directly in Colab.
+        </p>
+        <p style="color: #333333; margin: 16px 0 0 0; font-size: 13px; font-family: monospace; font-weight: bold;">
            {short_url}
        </p>
    </div>
--- a/studio/backend/core/init.py
+++ b/studio/backend/core/init.py
@ -31,6 +31,7 @@ __all__ = [
    # Config
    "ModelConfig",
    "is_vision_model",
+    "scan_trained_models",
    "scan_trained_loras",
    "load_model_defaults",
    "get_base_model_from_lora",
@ -72,6 +73,7 @@ def __getattr__(name):
    if name in (
        "is_vision_model",
        "ModelConfig",
+        "scan_trained_models",
        "scan_trained_loras",
        "load_model_defaults",
        "get_base_model_from_lora",
@ -79,14 +81,15 @@ def __getattr__(name):
        from utils.models import (
            is_vision_model,
            ModelConfig,
-            scan_trained_loras,
+            scan_trained_models,
            load_model_defaults,
            get_base_model_from_lora,
        )

        globals()["is_vision_model"] = is_vision_model
        globals()["ModelConfig"] = ModelConfig
-        globals()["scan_trained_loras"] = scan_trained_loras
+        globals()["scan_trained_models"] = scan_trained_models
+        globals()["scan_trained_loras"] = scan_trained_models
        globals()["load_model_defaults"] = load_model_defaults
        globals()["get_base_model_from_lora"] = get_base_model_from_lora
        return globals()[name]
--- a/studio/backend/core/data_recipe/oxc-validator/package.json
+++ b/studio/backend/core/data_recipe/oxc-validator/package.json
@ -4,7 +4,7 @@
  "version": "0.0.1",
  "type": "module",
  "dependencies": {
-    "oxc-parser": "^0.116.0",
+    "oxc-parser": "^0.123.0",
    "oxlint": "^1.51.0"
  }
 }
--- a/studio/backend/core/data_recipe/service.py
+++ b/studio/backend/core/data_recipe/service.py
@ -167,12 +167,7 @@ def _validate_recipe_runtime_support(
    recipe: dict[str, Any],
    model_providers: list[Any],
 ) -> None:
-    if not _recipe_has_llm_columns(recipe):
-        raise ValueError(
-            "Recipe Studio currently requires at least one AI generation step."
-        )
-
-    if not model_providers:
+    if _recipe_has_llm_columns(recipe) and not model_providers:
        raise ValueError("Add a Provider connection block before running this recipe.")


@ -266,6 +261,21 @@ def create_data_designer(
    model_providers = build_model_providers(recipe)
    _validate_recipe_runtime_support(recipe, model_providers)

+    # DataDesigner requires at least one model provider in its registry even
+    # when the pipeline contains no LLM columns.  Supply a lightweight stub
+    # so sampler/expression-only recipes can run without a real provider.
+    if not model_providers:
+        from data_designer.config.models import ModelProvider
+
+        model_providers = [
+            ModelProvider(
+                name = "_unused",
+                endpoint = "http://localhost",
+                provider_type = "openai",
+                api_key = None,
+            )
+        ]
+
    return DataDesigner(
        artifact_path = artifact_path,
        model_providers = model_providers,
--- a/studio/backend/core/export/export.py
+++ b/studio/backend/core/export/export.py
@ -310,7 +310,7 @@ class ExportBackend:
        repo_id: Optional[str] = None,
        hf_token: Optional[str] = None,
        private: bool = False,
-    ) -> Tuple[bool, str]:
+    ) -> Tuple[bool, str, Optional[str]]:
        """
        Export merged model (for PEFT models).

@ -323,14 +323,21 @@ class ExportBackend:
            private: Whether to make the repo private

        Returns:
-            Tuple of (success: bool, message: str)
+            Tuple of (success, message, output_path). output_path is the
+            resolved absolute on-disk directory of the saved model when
+            ``save_directory`` was set, else None.
        """
        if not self.current_model or not self.current_tokenizer:
-            return False, "No model loaded. Please select a checkpoint first."
+            return False, "No model loaded. Please select a checkpoint first.", None

        if not self.is_peft:
-            return False, "This is not a PEFT model. Use 'Export Base Model' instead."
+            return (
+                False,
+                "This is not a PEFT model. Use 'Export Base Model' instead.",
+                None,
+            )

+        output_path: Optional[str] = None
        try:
            # Determine save method
            if format_type == "4-bit (FP4)":
@ -354,6 +361,7 @@ class ExportBackend:
                # Write export metadata so the Chat page can identify the base model
                self._write_export_metadata(save_directory)
                logger.info(f"Model saved successfully to {save_directory}")
+                output_path = str(Path(save_directory).resolve())

            # Push to hub if requested
            if push_to_hub:
@ -361,6 +369,7 @@ class ExportBackend:
                    return (
                        False,
                        "Repository ID and Hugging Face token required for Hub upload",
+                        None,
                    )

                logger.info(f"Pushing merged model to Hub: {repo_id}")
@ -378,14 +387,14 @@ class ExportBackend:
                )
                logger.info(f"Model pushed successfully to {repo_id}")

-            return True, "Model exported successfully"
+            return True, "Model exported successfully", output_path

        except Exception as e:
            logger.error(f"Error exporting merged model: {e}")
            import traceback

            logger.error(traceback.format_exc())
-            return False, f"Export failed: {str(e)}"
+            return False, f"Export failed: {str(e)}", None

    def export_base_model(
        self,
@ -395,22 +404,26 @@ class ExportBackend:
        hf_token: Optional[str] = None,
        private: bool = False,
        base_model_id: Optional[str] = None,
-    ) -> Tuple[bool, str]:
+    ) -> Tuple[bool, str, Optional[str]]:
        """
        Export base model (for non-PEFT models).

        Returns:
-            Tuple of (success: bool, message: str)
+            Tuple of (success, message, output_path). output_path is the
+            resolved absolute on-disk directory of the saved model when
+            ``save_directory`` was set, else None.
        """
        if not self.current_model or not self.current_tokenizer:
-            return False, "No model loaded. Please select a checkpoint first."
+            return False, "No model loaded. Please select a checkpoint first.", None

        if self.is_peft:
            return (
                False,
                "This is a PEFT model. Use 'Merged Model' export type instead.",
+                None,
            )

+        output_path: Optional[str] = None
        try:
            # Save locally if requested
            if save_directory:
@ -424,6 +437,7 @@ class ExportBackend:
                # Write export metadata so the Chat page can identify the base model
                self._write_export_metadata(save_directory)
                logger.info(f"Model saved successfully to {save_directory}")
+                output_path = str(Path(save_directory).resolve())

            # Push to hub if requested
            if push_to_hub:
@ -431,6 +445,7 @@ class ExportBackend:
                    return (
                        False,
                        "Repository ID and Hugging Face token required for Hub upload",
+                        None,
                    )

                logger.info(f"Pushing base model to Hub: {repo_id}")
@ -472,16 +487,16 @@ class ExportBackend:
                    )
                    logger.info(f"Model pushed successfully to {repo_id}")
                else:
-                    return False, "Local save directory required for Hub upload"
+                    return False, "Local save directory required for Hub upload", None

-            return True, "Model exported successfully"
+            return True, "Model exported successfully", output_path

        except Exception as e:
            logger.error(f"Error exporting base model: {e}")
            import traceback

            logger.error(traceback.format_exc())
-            return False, f"Export failed: {str(e)}"
+            return False, f"Export failed: {str(e)}", None

    def export_gguf(
        self,
@ -490,7 +505,7 @@ class ExportBackend:
        push_to_hub: bool = False,
        repo_id: Optional[str] = None,
        hf_token: Optional[str] = None,
-    ) -> Tuple[bool, str]:
+    ) -> Tuple[bool, str, Optional[str]]:
        """
        Export model in GGUF format.

@ -502,11 +517,14 @@ class ExportBackend:
            hf_token: Hugging Face token

        Returns:
-            Tuple of (success: bool, message: str)
+            Tuple of (success, message, output_path). output_path is the
+            resolved absolute on-disk directory containing the .gguf
+            files when ``save_directory`` was set, else None.
        """
        if not self.current_model or not self.current_tokenizer:
-            return False, "No model loaded. Please select a checkpoint first."
+            return False, "No model loaded. Please select a checkpoint first.", None

+        output_path: Optional[str] = None
        try:
            # Convert quantization method to lowercase for unsloth
            quant_method = quantization_method.lower()
@ -601,6 +619,7 @@ class ExportBackend:
                    abs_save_dir,
                    "\n  ".join(os.path.basename(f) for f in final_ggufs) or "(none)",
                )
+                output_path = str(Path(abs_save_dir).resolve())

            # Push to hub if requested
            if push_to_hub:
@ -608,6 +627,7 @@ class ExportBackend:
                    return (
                        False,
                        "Repository ID and Hugging Face token required for Hub upload",
+                        None,
                    )

                logger.info(f"Pushing GGUF model to Hub: {repo_id}")
@ -620,14 +640,18 @@ class ExportBackend:
                )
                logger.info(f"GGUF model pushed successfully to {repo_id}")

-            return True, f"GGUF model exported successfully ({quantization_method})"
+            return (
+                True,
+                f"GGUF model exported successfully ({quantization_method})",
+                output_path,
+            )

        except Exception as e:
            logger.error(f"Error exporting GGUF model: {e}")
            import traceback

            logger.error(traceback.format_exc())
-            return False, f"GGUF export failed: {str(e)}"
+            return False, f"GGUF export failed: {str(e)}", None

    def export_lora_adapter(
        self,
@ -636,19 +660,22 @@ class ExportBackend:
        repo_id: Optional[str] = None,
        hf_token: Optional[str] = None,
        private: bool = False,
-    ) -> Tuple[bool, str]:
+    ) -> Tuple[bool, str, Optional[str]]:
        """
        Export LoRA adapter only (not merged).

        Returns:
-            Tuple of (success: bool, message: str)
+            Tuple of (success, message, output_path). output_path is the
+            resolved absolute on-disk directory of the saved adapter
+            when ``save_directory`` was set, else None.
        """
        if not self.current_model or not self.current_tokenizer:
-            return False, "No model loaded. Please select a checkpoint first."
+            return False, "No model loaded. Please select a checkpoint first.", None

        if not self.is_peft:
-            return False, "This is not a PEFT model. No adapter to export."
+            return False, "This is not a PEFT model. No adapter to export.", None

+        output_path: Optional[str] = None
        try:
            # Save locally if requested
            if save_directory:
@ -659,6 +686,7 @@ class ExportBackend:
                self.current_model.save_pretrained(save_directory)
                self.current_tokenizer.save_pretrained(save_directory)
                logger.info(f"Adapter saved successfully to {save_directory}")
+                output_path = str(Path(save_directory).resolve())

            # Push to hub if requested
            if push_to_hub:
@ -666,6 +694,7 @@ class ExportBackend:
                    return (
                        False,
                        "Repository ID and Hugging Face token required for Hub upload",
+                        None,
                    )

                logger.info(f"Pushing LoRA adapter to Hub: {repo_id}")
@ -676,14 +705,14 @@ class ExportBackend:
                )
                logger.info(f"Adapter pushed successfully to {repo_id}")

-            return True, "LoRA adapter exported successfully"
+            return True, "LoRA adapter exported successfully", output_path

        except Exception as e:
            logger.error(f"Error exporting LoRA adapter: {e}")
            import traceback

            logger.error(traceback.format_exc())
-            return False, f"Adapter export failed: {str(e)}"
+            return False, f"Adapter export failed: {str(e)}", None


 # Global export backend instance
--- a/studio/backend/core/export/orchestrator.py
+++ b/studio/backend/core/export/orchestrator.py
@ -16,19 +16,25 @@ Pattern follows core/inference/orchestrator.py.

 import atexit
 import structlog
+from collections import deque
 from loggers import get_logger
 import multiprocessing as mp
 import queue
 import threading
 import time
 from pathlib import Path
-from typing import Any, List, Optional, Tuple
+from typing import Any, Deque, Dict, List, Optional, Tuple
 from utils.paths import outputs_root

 logger = get_logger(__name__)

 _CTX = mp.get_context("spawn")

+# Maximum number of captured log lines kept in memory per export
+# orchestrator. Acts as scrollback for the live export log panel in the
+# UI. 4000 lines is ~1 MB worst-case at 256 chars/line.
+_LOG_BUFFER_MAXLEN = 4000
+

 class ExportOrchestrator:
    """
@ -44,6 +50,9 @@ class ExportOrchestrator:
        self._proc: Optional[mp.Process] = None
        self._cmd_queue: Any = None
        self._resp_queue: Any = None
+        # Serializes export operations (load_checkpoint, export_*,
+        # cleanup) so concurrent HTTP requests can never interleave
+        # commands on the subprocess queue. Previously unused.
        self._lock = threading.Lock()

        # Local state mirrors (updated from subprocess responses)
@ -51,9 +60,103 @@ class ExportOrchestrator:
        self.is_vision: bool = False
        self.is_peft: bool = False

+        # ── Live log capture ─────────────────────────────────────
+        # Thread-safe ring buffer of log lines forwarded from the
+        # worker subprocess. Powers the GET /api/export/logs/stream
+        # SSE endpoint that the export dialog consumes.
+        self._log_buffer: Deque[Dict[str, Any]] = deque(maxlen = _LOG_BUFFER_MAXLEN)
+        self._log_lock = threading.Lock()
+        # Monotonically increasing sequence number. Never reset across
+        # operations, so SSE clients can use it as a stable cursor even
+        # if clear_logs() is called mid-session.
+        self._log_seq: int = 0
+        # Snapshot of _log_seq captured at the start of the current run
+        # (updated by clear_logs()). The SSE endpoint defaults its
+        # cursor to this value so a client that connects AFTER the
+        # worker has already emitted its first lines still sees the
+        # full run. Every line appended during the current run has seq
+        # strictly greater than _run_start_seq, and every line from
+        # prior runs has seq less than or equal to it.
+        self._run_start_seq: int = 0
+        # True while an export operation (load/export/cleanup) is
+        # running. The SSE endpoint ends the stream 1 second after
+        # this flips back to False to drain any trailing log lines.
+        self._export_active: bool = False
+
        atexit.register(self._cleanup)
        logger.info("ExportOrchestrator initialized (subprocess mode)")

+    # ------------------------------------------------------------------
+    # Live log capture helpers
+    # ------------------------------------------------------------------
+
+    def _append_log(self, entry: Dict[str, Any]) -> None:
+        """Append a log line from the worker subprocess to the buffer.
+
+        Entries look like {"type": "log", "stream": "stdout"|"stderr",
+        "line": "...", "ts": ...}. Each is stamped with a monotonic
+        seq number before it lands in the buffer so SSE clients can
+        cursor through new lines.
+        """
+        line = entry.get("line")
+        if not line:
+            return
+        with self._log_lock:
+            self._log_seq += 1
+            self._log_buffer.append(
+                {
+                    "seq": self._log_seq,
+                    "stream": entry.get("stream", "stdout"),
+                    "line": line,
+                    "ts": entry.get("ts", time.time()),
+                }
+            )
+
+    def clear_logs(self) -> None:
+        """Drop any buffered log lines from a previous operation.
+
+        Called at the start of each export op so the UI shows only the
+        output of the current run. The seq counter is NOT reset, so an
+        SSE client that captured the cursor before clear_logs() will
+        still see new lines (with strictly greater seq numbers).
+
+        Also snapshots the current seq into ``_run_start_seq`` so the
+        SSE endpoint can anchor its default cursor at the start of
+        this run. Anything appended after this call has seq strictly
+        greater than the snapshot and is reachable via
+        ``get_logs_since(get_run_start_seq())``.
+        """
+        with self._log_lock:
+            self._log_buffer.clear()
+            self._run_start_seq = self._log_seq
+
+    def get_logs_since(self, cursor: int) -> Tuple[List[Dict[str, Any]], int]:
+        """Return log entries with seq > cursor, plus the new cursor."""
+        with self._log_lock:
+            new_entries = [entry for entry in self._log_buffer if entry["seq"] > cursor]
+        if new_entries:
+            return new_entries, new_entries[-1]["seq"]
+        return [], cursor
+
+    def get_current_log_seq(self) -> int:
+        """Return the current seq counter without reading any entries."""
+        with self._log_lock:
+            return self._log_seq
+
+    def get_run_start_seq(self) -> int:
+        """Return the seq value captured at the start of the current run.
+
+        The SSE endpoint uses this as the default cursor so a client
+        that connects AFTER the worker has already started emitting
+        output still sees every line from the current run.
+        """
+        with self._log_lock:
+            return self._run_start_seq
+
+    def is_export_active(self) -> bool:
+        """True while an export / load / cleanup command is running."""
+        return self._export_active
+
    # ------------------------------------------------------------------
    # Subprocess lifecycle
    # ------------------------------------------------------------------
@ -179,8 +282,26 @@ class ExportOrchestrator:
                error_msg = resp.get("error", "Unknown error")
                raise RuntimeError(f"Subprocess error: {error_msg}")

+            if rtype == "log":
+                # Forwarded stdout/stderr line from the worker process.
+                self._append_log(resp)
+                continue
+
            if rtype == "status":
-                logger.info("Export subprocess status: %s", resp.get("message", ""))
+                message = resp.get("message", "")
+                logger.info("Export subprocess status: %s", message)
+                # Surface status messages in the live log panel too so
+                # users see high level progress (e.g. "Importing
+                # Unsloth...", "Loading checkpoint: ...") alongside
+                # subprocess output.
+                if message:
+                    self._append_log(
+                        {
+                            "stream": "status",
+                            "line": message,
+                            "ts": resp.get("ts", time.time()),
+                        }
+                    )
                continue

            # Other response types during wait — skip
@ -217,6 +338,7 @@ class ExportOrchestrator:
        max_seq_length: int = 2048,
        load_in_4bit: bool = True,
        trust_remote_code: bool = False,
+        hf_token: Optional[str] = None,
    ) -> Tuple[bool, str]:
        """Load a checkpoint for export.

@ -227,39 +349,50 @@ class ExportOrchestrator:
            "max_seq_length": max_seq_length,
            "load_in_4bit": load_in_4bit,
            "trust_remote_code": trust_remote_code,
+            "hf_token": hf_token,
        }

-        # Always kill existing subprocess and spawn fresh.
-        if self._ensure_subprocess_alive():
-            self._shutdown_subprocess()
-        elif self._proc is not None:
-            self._shutdown_subprocess(timeout = 2)
+        with self._lock:
+            # Start a fresh log buffer for this operation so the UI
+            # sees only the current run's output.
+            self.clear_logs()
+            self._export_active = True
+            try:
+                # Always kill existing subprocess and spawn fresh.
+                if self._ensure_subprocess_alive():
+                    self._shutdown_subprocess()
+                elif self._proc is not None:
+                    self._shutdown_subprocess(timeout = 2)

-        logger.info("Spawning fresh export subprocess for '%s'", checkpoint_path)
-        self._spawn_subprocess(sub_config)
+                logger.info(
+                    "Spawning fresh export subprocess for '%s'", checkpoint_path
+                )
+                self._spawn_subprocess(sub_config)

-        try:
-            resp = self._wait_response("loaded", timeout = 300)
-        except RuntimeError as exc:
-            self._shutdown_subprocess(timeout = 5)
-            self.current_checkpoint = None
-            self.is_vision = False
-            self.is_peft = False
-            return False, str(exc)
+                try:
+                    resp = self._wait_response("loaded")
+                except RuntimeError as exc:
+                    self._shutdown_subprocess(timeout = 5)
+                    self.current_checkpoint = None
+                    self.is_vision = False
+                    self.is_peft = False
+                    return False, str(exc)

-        if resp.get("success"):
-            self.current_checkpoint = resp.get("checkpoint")
-            self.is_vision = resp.get("is_vision", False)
-            self.is_peft = resp.get("is_peft", False)
-            logger.info("Checkpoint '%s' loaded in subprocess", checkpoint_path)
-            return True, resp.get("message", "Loaded successfully")
-        else:
-            error = resp.get("message", "Failed to load checkpoint")
-            logger.error("Failed to load checkpoint: %s", error)
-            self.current_checkpoint = None
-            self.is_vision = False
-            self.is_peft = False
-            return False, error
+                if resp.get("success"):
+                    self.current_checkpoint = resp.get("checkpoint")
+                    self.is_vision = resp.get("is_vision", False)
+                    self.is_peft = resp.get("is_peft", False)
+                    logger.info("Checkpoint '%s' loaded in subprocess", checkpoint_path)
+                    return True, resp.get("message", "Loaded successfully")
+                else:
+                    error = resp.get("message", "Failed to load checkpoint")
+                    logger.error("Failed to load checkpoint: %s", error)
+                    self.current_checkpoint = None
+                    self.is_vision = False
+                    self.is_peft = False
+                    return False, error
+            finally:
+                self._export_active = False

    def export_merged_model(
        self,
@ -269,7 +402,7 @@ class ExportOrchestrator:
        repo_id: Optional[str] = None,
        hf_token: Optional[str] = None,
        private: bool = False,
-    ) -> Tuple[bool, str]:
+    ) -> Tuple[bool, str, Optional[str]]:
        """Export merged PEFT model."""
        return self._run_export(
            "merged",
@ -291,7 +424,7 @@ class ExportOrchestrator:
        hf_token: Optional[str] = None,
        private: bool = False,
        base_model_id: Optional[str] = None,
-    ) -> Tuple[bool, str]:
+    ) -> Tuple[bool, str, Optional[str]]:
        """Export base model (non-PEFT)."""
        return self._run_export(
            "base",
@ -312,7 +445,7 @@ class ExportOrchestrator:
        push_to_hub: bool = False,
        repo_id: Optional[str] = None,
        hf_token: Optional[str] = None,
-    ) -> Tuple[bool, str]:
+    ) -> Tuple[bool, str, Optional[str]]:
        """Export model in GGUF format."""
        return self._run_export(
            "gguf",
@ -332,7 +465,7 @@ class ExportOrchestrator:
        repo_id: Optional[str] = None,
        hf_token: Optional[str] = None,
        private: bool = False,
-    ) -> Tuple[bool, str]:
+    ) -> Tuple[bool, str, Optional[str]]:
        """Export LoRA adapter only."""
        return self._run_export(
            "lora",
@ -345,46 +478,74 @@ class ExportOrchestrator:
            },
        )

-    def _run_export(self, export_type: str, params: dict) -> Tuple[bool, str]:
-        """Send an export command to the subprocess and wait for result."""
-        if not self._ensure_subprocess_alive():
-            return False, "No export subprocess running. Load a checkpoint first."
+    def _run_export(
+        self, export_type: str, params: dict
+    ) -> Tuple[bool, str, Optional[str]]:
+        """Send an export command to the subprocess and wait for result.

-        cmd = {"type": "export", "export_type": export_type, **params}
+        Returns ``(success, message, output_path)``. ``output_path`` is the
+        resolved on-disk directory the worker actually wrote to (None when
+        the export only pushed to Hub or failed before any file was
+        written). Surfaced via the export route's ``details.output_path``
+        so the dialog's success screen can show the user where the model
+        landed.
+        """
+        with self._lock:
+            if not self._ensure_subprocess_alive():
+                return (
+                    False,
+                    "No export subprocess running. Load a checkpoint first.",
+                    None,
+                )

-        try:
-            self._send_cmd(cmd)
-            resp = self._wait_response(
-                f"export_{export_type}_done",
-                timeout = 3600,  # GGUF for 30B+ models can take 30+ min
-            )
-            return resp.get("success", False), resp.get("message", "")
-        except RuntimeError as exc:
-            return False, str(exc)
+            self.clear_logs()
+            self._export_active = True
+            try:
+                cmd = {"type": "export", "export_type": export_type, **params}
+                try:
+                    self._send_cmd(cmd)
+                    resp = self._wait_response(
+                        f"export_{export_type}_done",
+                        timeout = 3600,  # GGUF for 30B+ models can take 30+ min
+                    )
+                    return (
+                        resp.get("success", False),
+                        resp.get("message", ""),
+                        resp.get("output_path"),
+                    )
+                except RuntimeError as exc:
+                    return False, str(exc), None
+            finally:
+                self._export_active = False

    def cleanup_memory(self) -> bool:
        """Cleanup export-related models from memory."""
-        if not self._ensure_subprocess_alive():
-            # No subprocess — just clear local state
-            self.current_checkpoint = None
-            self.is_vision = False
-            self.is_peft = False
-            return True
+        with self._lock:
+            if not self._ensure_subprocess_alive():
+                # No subprocess — just clear local state
+                self.current_checkpoint = None
+                self.is_vision = False
+                self.is_peft = False
+                return True

-        try:
-            self._send_cmd({"type": "cleanup"})
-            resp = self._wait_response("cleanup_done", timeout = 30)
-            success = resp.get("success", False)
-        except RuntimeError:
-            success = False
+            self._export_active = True
+            try:
+                try:
+                    self._send_cmd({"type": "cleanup"})
+                    resp = self._wait_response("cleanup_done", timeout = 30)
+                    success = resp.get("success", False)
+                except RuntimeError:
+                    success = False

-        # Shut down subprocess after cleanup — no model loaded
-        self._shutdown_subprocess()
+                # Shut down subprocess after cleanup — no model loaded
+                self._shutdown_subprocess()

-        self.current_checkpoint = None
-        self.is_vision = False
-        self.is_peft = False
-        return success
+                self.current_checkpoint = None
+                self.is_vision = False
+                self.is_peft = False
+                return success
+            finally:
+                self._export_active = False

    def scan_checkpoints(
        self, outputs_dir: str = str(outputs_root())
--- a/studio/backend/core/export/worker.py
+++ b/studio/backend/core/export/worker.py
@ -17,10 +17,12 @@ Pattern follows core/inference/worker.py and core/training/worker.py.

 from __future__ import annotations

+import errno
 import structlog
 from loggers import get_logger
 import os
 import sys
+import threading
 import time
 import traceback
 from pathlib import Path
@ -29,38 +31,164 @@ from typing import Any
 logger = get_logger(__name__)


-def _activate_transformers_version(model_name: str) -> None:
-    """Activate the correct transformers version BEFORE any ML imports.
+# Gate that controls whether captured stdout/stderr lines are forwarded
+# to the parent's resp_queue (and from there to the export-dialog SSE
+# stream). Closed by default so the noisy bootstrap phase -- transformers
+# venv activation, Unsloth/torch imports, base-model resolution, "Top
+# GGUF/hub models" lists, vision detection, weight loading bars -- is
+# suppressed in the UI. _handle_export() opens the gate at the start of
+# the actual export work and leaves it open; the orchestrator always
+# spawns a fresh subprocess for the next checkpoint load (see
+# orchestrator._spawn_subprocess) which resets this state.
+#
+# Lines dropped while the gate is closed are still echoed to the saved
+# original stdout/stderr fds so the server console / log file keeps the
+# full output for debugging.
+_log_forward_gate = threading.Event()

-    If the model needs transformers 5.x, prepend the pre-installed .venv_t5/
-    directory to sys.path. Otherwise do nothing (default 4.57.x in .venv/).
+
+def _setup_log_capture(resp_queue: Any) -> None:
+    """Redirect fds 1 and 2 through pipes so every line printed by this
+    worker process and any child process it spawns is forwarded to the
+    parent process via resp_queue as {"type": "log", ...} messages.
+
+    Must be called BEFORE LogConfig.setup_logging and BEFORE any ML
+    imports, otherwise library handlers may capture the original stderr
+    reference and bypass the pipe.
+
+    Lines are also echoed back to the original stdout/stderr so the
+    server console keeps receiving the full subprocess output, even
+    while ``_log_forward_gate`` is closed.
    """
+
+    try:
+        saved_out_fd = os.dup(1)
+        saved_err_fd = os.dup(2)
+    except OSError:
+        # dup failed (exotic platforms) - give up quietly, export still
+        # works, just no live log streaming.
+        return
+
+    try:
+        r_out, w_out = os.pipe()
+        r_err, w_err = os.pipe()
+    except OSError:
+        os.close(saved_out_fd)
+        os.close(saved_err_fd)
+        return
+
+    try:
+        os.dup2(w_out, 1)
+        os.dup2(w_err, 2)
+    except OSError:
+        for fd in (saved_out_fd, saved_err_fd, r_out, w_out, r_err, w_err):
+            try:
+                os.close(fd)
+            except OSError:
+                pass
+        return
+
+    # Close the write ends we just dup2'd (fds 1 and 2 are the real
+    # write ends now).
+    os.close(w_out)
+    os.close(w_err)
+
+    # Replace Python's sys.stdout/sys.stderr with line-buffered writers
+    # bound to the (now-redirected) fds 1 and 2.
+    try:
+        sys.stdout = os.fdopen(1, "w", buffering = 1, encoding = "utf-8", errors = "replace")
+        sys.stderr = os.fdopen(2, "w", buffering = 1, encoding = "utf-8", errors = "replace")
+    except Exception:
+        pass
+
+    def _reader(read_fd: int, stream_name: str, echo_fd: int) -> None:
+        buf = bytearray()
+        while True:
+            try:
+                chunk = os.read(read_fd, 4096)
+            except OSError as exc:
+                if exc.errno == errno.EBADF:
+                    break
+                continue
+            if not chunk:
+                break
+            # Echo to the original fd so the server console still sees
+            # the full output.
+            try:
+                os.write(echo_fd, chunk)
+            except OSError:
+                pass
+            buf.extend(chunk)
+            # Split on \n OR \r so tqdm-style progress bars update.
+            while True:
+                nl = -1
+                for i, b in enumerate(buf):
+                    if b == 0x0A or b == 0x0D:
+                        nl = i
+                        break
+                if nl < 0:
+                    break
+                line = bytes(buf[:nl]).decode("utf-8", errors = "replace")
+                del buf[: nl + 1]
+                if not line:
+                    continue
+                if not _log_forward_gate.is_set():
+                    # Gate closed (bootstrap phase) -- already echoed to
+                    # the saved console fd above; drop the line so the
+                    # export dialog doesn't see import / vendoring noise.
+                    continue
+                try:
+                    resp_queue.put_nowait(
+                        {
+                            "type": "log",
+                            "stream": stream_name,
+                            "line": line,
+                            "ts": time.time(),
+                        }
+                    )
+                except Exception:
+                    # Queue put failed (full, closed, etc.) - drop the
+                    # line rather than crash the reader thread.
+                    pass
+        if buf and _log_forward_gate.is_set():
+            try:
+                resp_queue.put_nowait(
+                    {
+                        "type": "log",
+                        "stream": stream_name,
+                        "line": bytes(buf).decode("utf-8", errors = "replace"),
+                        "ts": time.time(),
+                    }
+                )
+            except Exception:
+                pass
+
+    t_out = threading.Thread(
+        target = _reader,
+        args = (r_out, "stdout", saved_out_fd),
+        daemon = True,
+        name = "export-log-stdout",
+    )
+    t_err = threading.Thread(
+        target = _reader,
+        args = (r_err, "stderr", saved_err_fd),
+        daemon = True,
+        name = "export-log-stderr",
+    )
+    t_out.start()
+    t_err.start()
+
+
+def _activate_transformers_version(model_name: str) -> None:
+    """Activate the correct transformers version BEFORE any ML imports."""
    # Ensure backend is on path for utils imports
    backend_path = str(Path(__file__).resolve().parent.parent.parent)
    if backend_path not in sys.path:
        sys.path.insert(0, backend_path)

-    from utils.transformers_version import (
-        needs_transformers_5,
-        _resolve_base_model,
-        _ensure_venv_t5_exists,
-        _VENV_T5_DIR,
-    )
+    from utils.transformers_version import activate_transformers_for_subprocess

-    resolved = _resolve_base_model(model_name)
-    if needs_transformers_5(resolved):
-        if not _ensure_venv_t5_exists():
-            raise RuntimeError(
-                f"Cannot activate transformers 5.x: .venv_t5 missing at {_VENV_T5_DIR}"
-            )
-        if _VENV_T5_DIR not in sys.path:
-            sys.path.insert(0, _VENV_T5_DIR)
-        logger.info("Activated transformers 5.x from %s", _VENV_T5_DIR)
-        # Propagate to child subprocesses (e.g. GGUF converter)
-        _pp = os.environ.get("PYTHONPATH", "")
-        os.environ["PYTHONPATH"] = _VENV_T5_DIR + (os.pathsep + _pp if _pp else "")
-    else:
-        logger.info("Using default transformers (4.57.x) for %s", model_name)
+    activate_transformers_for_subprocess(model_name)


 def _send_response(resp_queue: Any, response: dict) -> None:
@ -78,6 +206,19 @@ def _handle_load(backend, cmd: dict, resp_queue: Any) -> None:
    load_in_4bit = cmd.get("load_in_4bit", True)
    trust_remote_code = cmd.get("trust_remote_code", False)

+    # Auto-enable trust_remote_code for NemotronH/Nano models.
+    if not trust_remote_code:
+        _NEMOTRON_TRUST_SUBSTRINGS = ("nemotron_h", "nemotron-h", "nemotron-3-nano")
+        _cp_lower = checkpoint_path.lower()
+        if any(sub in _cp_lower for sub in _NEMOTRON_TRUST_SUBSTRINGS) and (
+            _cp_lower.startswith("unsloth/") or _cp_lower.startswith("nvidia/")
+        ):
+            trust_remote_code = True
+            logger.info(
+                "Auto-enabled trust_remote_code for Nemotron model: %s",
+                checkpoint_path,
+            )
+
    try:
        _send_response(
            resp_queue,
@ -126,9 +267,17 @@ def _handle_export(backend, cmd: dict, resp_queue: Any) -> None:
    export_type = cmd["export_type"]  # "merged", "base", "gguf", "lora"
    response_type = f"export_{export_type}_done"

+    # Open the log forwarding gate so the user sees the actual export
+    # progress (Unsloth merge bars, file copies, GGUF conversion, etc.)
+    # in the live log panel. The gate stays open for the rest of this
+    # subprocess's life; the orchestrator spawns a fresh subprocess for
+    # the next checkpoint load, which resets the gate to closed.
+    _log_forward_gate.set()
+
+    output_path: Any = None
    try:
        if export_type == "merged":
-            success, message = backend.export_merged_model(
+            success, message, output_path = backend.export_merged_model(
                save_directory = cmd.get("save_directory", ""),
                format_type = cmd.get("format_type", "16-bit (FP16)"),
                push_to_hub = cmd.get("push_to_hub", False),
@ -137,7 +286,7 @@ def _handle_export(backend, cmd: dict, resp_queue: Any) -> None:
                private = cmd.get("private", False),
            )
        elif export_type == "base":
-            success, message = backend.export_base_model(
+            success, message, output_path = backend.export_base_model(
                save_directory = cmd.get("save_directory", ""),
                push_to_hub = cmd.get("push_to_hub", False),
                repo_id = cmd.get("repo_id"),
@ -146,7 +295,7 @@ def _handle_export(backend, cmd: dict, resp_queue: Any) -> None:
                base_model_id = cmd.get("base_model_id"),
            )
        elif export_type == "gguf":
-            success, message = backend.export_gguf(
+            success, message, output_path = backend.export_gguf(
                save_directory = cmd.get("save_directory", ""),
                quantization_method = cmd.get("quantization_method", "Q4_K_M"),
                push_to_hub = cmd.get("push_to_hub", False),
@ -154,7 +303,7 @@ def _handle_export(backend, cmd: dict, resp_queue: Any) -> None:
                hf_token = cmd.get("hf_token"),
            )
        elif export_type == "lora":
-            success, message = backend.export_lora_adapter(
+            success, message, output_path = backend.export_lora_adapter(
                save_directory = cmd.get("save_directory", ""),
                push_to_hub = cmd.get("push_to_hub", False),
                repo_id = cmd.get("repo_id"),
@ -170,6 +319,7 @@ def _handle_export(backend, cmd: dict, resp_queue: Any) -> None:
                "type": response_type,
                "success": success,
                "message": message,
+                "output_path": output_path,
                "ts": time.time(),
            },
        )
@ -181,6 +331,7 @@ def _handle_export(backend, cmd: dict, resp_queue: Any) -> None:
                "type": response_type,
                "success": False,
                "message": str(exc),
+                "output_path": None,
                "stack": traceback.format_exc(limit = 20),
                "ts": time.time(),
            },
@ -226,10 +377,26 @@ def run_export_process(
    """
    import queue as _queue

+    # Install fd-level stdout/stderr capture FIRST so every subsequent
+    # print and every child process inherits the redirected fds. This
+    # is what powers the live export log stream in the UI.
+    _setup_log_capture(resp_queue)
+
    os.environ["TOKENIZERS_PARALLELISM"] = "false"
    os.environ["PYTHONWARNINGS"] = (
        "ignore"  # Suppress warnings at C-level before imports
    )
+    # Force unbuffered output from any child Python process (e.g. the
+    # GGUF converter) so their prints surface in the log stream as they
+    # happen rather than at the end.
+    os.environ["PYTHONUNBUFFERED"] = "1"
+    # tqdm defaults to a 10-second mininterval when stdout is not a tty
+    # (which it isn't here -- we redirected fd 1/2 to a pipe). That makes
+    # multi-step progress bars look frozen in the export log panel. Force
+    # frequent flushes so the user sees movement during merge / GGUF
+    # conversion. Has no effect on single-step bars (e.g. "Copying 1
+    # files") which only emit start/end events regardless.
+    os.environ.setdefault("TQDM_MININTERVAL", "0.5")

    import warnings
    from loggers.config import LogConfig
--- a/studio/backend/core/inference/_html_to_md.py
+++ b/studio/backend/core/inference/_html_to_md.py
@ -0,0 +1,447 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
+
+"""
+Minimal HTML-to-Markdown converter using only the standard library.
+
+Replaces the external ``html2text`` (GPL-3.0) dependency with a ~250-line
+``html.parser.HTMLParser`` subclass.  Covers headings, links, bold/italic,
+lists, tables, blockquotes, code blocks, and entity decoding.
+"""
+
+from __future__ import annotations
+
+import html
+import re
+from html.parser import HTMLParser
+
+__all__ = ["html_to_markdown"]
+
+_SKIP_TAGS = frozenset(
+    {
+        "script",
+        "style",
+        "head",
+        "noscript",
+        "svg",
+        "math",
+        "nav",
+        "footer",
+    }
+)
+_BLOCK_TAGS = frozenset(
+    {
+        "p",
+        "div",
+        "section",
+        "article",
+        "main",
+        "aside",
+        "figure",
+        "figcaption",
+        "details",
+        "summary",
+        "dl",
+        "dt",
+        "dd",
+    }
+)
+_HEADING_TAGS = frozenset({"h1", "h2", "h3", "h4", "h5", "h6"})
+_INLINE_EMPHASIS = {"strong": "**", "b": "**", "em": "*", "i": "*"}
+
+
+class _MarkdownRenderer(HTMLParser):
+    """HTMLParser subclass that emits Markdown tokens into a list."""
+
+    def __init__(self):
+        super().__init__(convert_charrefs = False)
+        self._out: list[str] = []
+        self._skip_depth: int = 0
+
+        # Link state
+        self._link_href: str | None = None
+        self._link_text_parts: list[str] = []
+        self._in_link: bool = False
+
+        # List state
+        self._list_stack: list[str] = []  # "ul" or "ol"
+        self._ol_counter: list[int] = []
+
+        # Table state
+        self._in_table: bool = False
+        self._current_row: list[str] = []
+        self._cell_parts: list[str] = []
+        self._in_cell: bool = False
+        self._header_row_done: bool = False
+        self._row_has_th: bool = False
+        self._is_first_row: bool = False
+
+        # Pre/code state
+        self._in_pre: bool = False
+        self._pre_parts: list[str] = []
+        self._in_inline_code: bool = False
+
+        # Blockquote state -- stack of output buffers so nested
+        # blockquotes each collect their own content and get prefixed
+        # with the correct number of ">" markers on close.
+        self._bq_stack: list[list[str]] = []
+
+    # ------------------------------------------------------------------
+    def _emit(self, text: str) -> None:
+        if self._in_link:
+            self._link_text_parts.append(text)
+        elif self._in_cell:
+            self._cell_parts.append(text)
+        elif self._in_pre:
+            self._pre_parts.append(text)
+        elif self._bq_stack:
+            self._bq_stack[-1].append(text)
+        else:
+            self._out.append(text)
+
+    # ------------------------------------------------------------------
+    def _prefix_blockquote(self, content: str) -> str:
+        """Prefix every line of *content* with ``> ``."""
+        # Strip trailing whitespace first, then collapse blank lines
+        content = re.sub(r"[ \t]+$", "", content, flags = re.MULTILINE)
+        content = re.sub(r"\n{3,}", "\n\n", content).strip()
+        if not content:
+            return ""
+        lines = content.split("\n")
+        prefixed: list[str] = []
+        for line in lines:
+            if line.strip():
+                prefixed.append("> " + line)
+            else:
+                prefixed.append(">")
+        return "\n".join(prefixed)
+
+    # ------------------------------------------------------------------
+    # Table helpers -- flush open cells and rows so that HTML with
+    # omitted optional end tags (</td>, </tr>) does not lose data.
+    # ------------------------------------------------------------------
+    def _finish_cell(self) -> None:
+        if not self._in_cell:
+            return
+        self._in_cell = False
+        cell_text = "".join(self._cell_parts).strip().replace("\n", " ")
+        cell_text = cell_text.replace("|", "\\|")
+        self._current_row.append(cell_text)
+        self._cell_parts = []
+
+    def _finish_row(self) -> None:
+        if not self._current_row:
+            return
+        line = "| " + " | ".join(self._current_row) + " |"
+        self._emit(line + "\n")
+        if not self._header_row_done and (self._row_has_th or self._is_first_row):
+            sep = "| " + " | ".join("---" for _ in self._current_row) + " |"
+            self._emit(sep + "\n")
+            self._header_row_done = True
+        self._is_first_row = False
+        self._current_row = []
+        self._row_has_th = False
+
+    # ------------------------------------------------------------------
+    # Link text helper -- normalize whitespace so block-level content
+    # inside an <a> does not produce multiline Markdown link labels.
+    # ------------------------------------------------------------------
+    def _finish_link(self) -> None:
+        text = re.sub(r"\s+", " ", "".join(self._link_text_parts)).strip()
+        href = self._link_href or ""
+        self._in_link = False
+        if href and text:
+            self._emit(f"[{text}]({href})")
+        elif text:
+            self._emit(text)
+
+    # ------------------------------------------------------------------
+    # Tag handlers
+    # ------------------------------------------------------------------
+    def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None:
+        tag = tag.lower()
+
+        if tag in _SKIP_TAGS:
+            self._skip_depth += 1
+            return
+        if self._skip_depth:
+            return
+
+        attr_dict = dict(attrs)
+
+        if tag in _HEADING_TAGS:
+            level = int(tag[1])
+            self._emit("\n\n" + "#" * level + " ")
+
+        elif tag == "a":
+            self._link_href = attr_dict.get("href")
+            self._link_text_parts = []
+            self._in_link = True
+
+        elif tag in _INLINE_EMPHASIS:
+            self._emit(_INLINE_EMPHASIS[tag])
+
+        elif tag == "br":
+            self._emit("\n")
+
+        elif tag in _BLOCK_TAGS:
+            self._emit("\n\n")
+
+        elif tag == "hr":
+            self._emit("\n\n---\n\n")
+
+        elif tag == "blockquote":
+            self._emit("\n\n")
+            self._bq_stack.append([])
+
+        elif tag == "ul":
+            self._list_stack.append("ul")
+            self._emit("\n")
+
+        elif tag == "ol":
+            self._list_stack.append("ol")
+            start_attr = attr_dict.get("start")
+            try:
+                start = int(start_attr) if start_attr is not None else 1
+            except (ValueError, TypeError):
+                start = 1
+            self._ol_counter.append(start - 1)
+            self._emit("\n")
+
+        elif tag == "li":
+            indent = "  " * max(0, len(self._list_stack) - 1)
+            if self._list_stack and self._list_stack[-1] == "ol":
+                if self._ol_counter:
+                    self._ol_counter[-1] += 1
+                    self._emit(f"\n{indent}{self._ol_counter[-1]}. ")
+                else:
+                    self._emit(f"\n{indent}1. ")
+            else:
+                self._emit(f"\n{indent}* ")
+
+        elif tag == "pre":
+            self._pre_parts = []
+            self._in_pre = True
+
+        elif tag == "code" and not self._in_pre:
+            self._in_inline_code = True
+            self._emit("`")
+
+        elif tag == "table":
+            self._in_table = True
+            self._header_row_done = False
+            self._is_first_row = True
+            self._emit("\n\n")
+
+        elif tag == "tr":
+            # Flush any open cell/row from a previous row that may
+            # have omitted its optional </td> or </tr> end tags.
+            self._finish_cell()
+            self._finish_row()
+
+        elif tag in ("th", "td"):
+            # Flush any open cell (handles omitted </td>/<th>)
+            self._finish_cell()
+            self._cell_parts = []
+            self._in_cell = True
+            if tag == "th":
+                self._row_has_th = True
+
+        elif tag == "img":
+            # Skip images -- keeps fetched page text focused on readable
+            # content and avoids data-URI amplification.
+            return
+
+    def handle_endtag(self, tag: str) -> None:
+        tag = tag.lower()
+
+        if tag in _SKIP_TAGS:
+            self._skip_depth = max(0, self._skip_depth - 1)
+            return
+        if self._skip_depth:
+            return
+
+        if tag in _HEADING_TAGS:
+            self._emit("\n\n")
+
+        elif tag == "a":
+            self._finish_link()
+
+        elif tag in _INLINE_EMPHASIS:
+            self._emit(_INLINE_EMPHASIS[tag])
+
+        elif tag in _BLOCK_TAGS:
+            self._emit("\n\n")
+
+        elif tag == "blockquote":
+            if self._bq_stack:
+                content = "".join(self._bq_stack.pop())
+                prefixed = self._prefix_blockquote(content)
+                if prefixed:
+                    self._emit("\n\n" + prefixed + "\n\n")
+
+        elif tag == "ul":
+            if self._list_stack and self._list_stack[-1] == "ul":
+                self._list_stack.pop()
+            self._emit("\n")
+
+        elif tag == "ol":
+            if self._list_stack and self._list_stack[-1] == "ol":
+                self._list_stack.pop()
+                if self._ol_counter:
+                    self._ol_counter.pop()
+            self._emit("\n")
+
+        elif tag == "pre":
+            raw = "".join(self._pre_parts)
+            self._in_pre = False
+            block = "```\n" + raw + "\n```"
+            self._emit("\n\n" + block + "\n\n")
+
+        elif tag == "code" and not self._in_pre:
+            self._in_inline_code = False
+            self._emit("`")
+
+        elif tag in ("th", "td"):
+            self._finish_cell()
+
+        elif tag == "tr":
+            self._finish_cell()
+            self._finish_row()
+
+        elif tag == "table":
+            # Flush any remaining row (handles omitted </tr>)
+            self._finish_cell()
+            self._finish_row()
+            self._in_table = False
+            self._emit("\n")
+
+    # ------------------------------------------------------------------
+    # Text / entity handlers
+    # ------------------------------------------------------------------
+    def handle_data(self, data: str) -> None:
+        if self._skip_depth:
+            return
+        if self._in_pre:
+            self._pre_parts.append(data)
+            return
+        # Preserve literal whitespace inside inline <code> spans
+        if self._in_inline_code:
+            self._emit(data)
+            return
+        # Collapse all whitespace (including newlines) per HTML rules
+        text = re.sub(r"\s+", " ", data)
+        # Suppress whitespace-only text nodes between table structural
+        # elements (indentation from source HTML) to prevent leading
+        # spaces from breaking Markdown table row alignment.
+        if self._in_table and not self._in_cell and not text.strip():
+            return
+        self._emit(text)
+
+    def handle_entityref(self, name: str) -> None:
+        if self._skip_depth:
+            return
+        self._emit(html.unescape(f"&{name};"))
+
+    def handle_charref(self, name: str) -> None:
+        if self._skip_depth:
+            return
+        self._emit(html.unescape(f"&#{name};"))
+
+    # ------------------------------------------------------------------
+    # Flush pending buffers (handles truncated HTML from capped fetches)
+    # ------------------------------------------------------------------
+    def flush_pending(self) -> None:
+        """Flush any open side-buffers into ``_out``.
+
+        Called after ``close()`` to recover content from truncated HTML
+        where closing tags were never seen (common when ``_fetch_page_text``
+        caps the download by byte count).
+        """
+        # Flush innermost buffers first so their content propagates outward.
+
+        if self._in_link:
+            self._finish_link()
+
+        if self._in_inline_code:
+            self._in_inline_code = False
+            self._emit("`")
+
+        self._finish_cell()
+        self._finish_row()
+
+        if self._in_pre:
+            raw = "".join(self._pre_parts)
+            self._in_pre = False
+            block = "```\n" + raw + "\n```"
+            self._emit("\n\n" + block + "\n\n")
+
+        # Flatten any open blockquote buffers (innermost first)
+        while self._bq_stack:
+            content = "".join(self._bq_stack.pop())
+            prefixed = self._prefix_blockquote(content)
+            if not prefixed:
+                continue
+            if self._bq_stack:
+                self._bq_stack[-1].append("\n\n" + prefixed + "\n\n")
+            else:
+                self._out.append("\n\n" + prefixed + "\n\n")
+
+
+# ------------------------------------------------------------------
+# Post-processing
+# ------------------------------------------------------------------
+def _cleanup(text: str) -> str:
+    """Normalize whitespace and blank lines in the final output.
+
+    Preserves content inside fenced code blocks verbatim so that
+    intentional blank lines in ``<pre>`` content are not collapsed.
+    """
+    lines = text.split("\n")
+    out: list[str] = []
+    in_fence = False
+    blank_run = 0
+
+    for line in lines:
+        stripped = line.rstrip(" \t")
+        if stripped.startswith("```"):
+            in_fence = not in_fence
+            blank_run = 0
+            out.append(stripped)
+            continue
+
+        if in_fence:
+            # Preserve code block content exactly as-is
+            out.append(line)
+            continue
+
+        if not stripped:
+            blank_run += 1
+            if blank_run <= 1:
+                out.append("")
+            continue
+
+        blank_run = 0
+        out.append(stripped)
+
+    return "\n".join(out).strip()
+
+
+# ------------------------------------------------------------------
+# Public API
+# ------------------------------------------------------------------
+def html_to_markdown(source_html: str) -> str:
+    """Convert an HTML string to Markdown.
+
+    Handles headings, links, bold/italic, lists (ordered and unordered),
+    tables, blockquotes, code blocks, and HTML entities.  ``<script>``,
+    ``<style>``, and ``<head>`` sections are stripped entirely.
+    """
+    # Normalize line endings before parsing
+    source_html = source_html.replace("\r\n", "\n").replace("\r", "\n")
+    renderer = _MarkdownRenderer()
+    renderer.feed(source_html)
+    renderer.close()
+    renderer.flush_pending()
+    raw = "".join(renderer._out)
+    return _cleanup(raw)
--- a/studio/backend/core/inference/anthropic_compat.py
+++ b/studio/backend/core/inference/anthropic_compat.py
@ -0,0 +1,521 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved.
+
+"""
+Anthropic Messages API ↔ OpenAI format translation utilities.
+
+Pure functions and a stateful stream emitter — no FastAPI, no I/O.
+"""
+
+from __future__ import annotations
+
+import json
+from typing import Any, Optional, Union
+
+
+def anthropic_messages_to_openai(
+    messages: list[dict],
+    system: Optional[Union[str, list]] = None,
+) -> list[dict]:
+    """Convert Anthropic messages + system to OpenAI-format message dicts."""
+    result: list[dict] = []
+
+    # System prompt
+    if system:
+        if isinstance(system, str):
+            result.append({"role": "system", "content": system})
+        elif isinstance(system, list):
+            parts = []
+            for block in system:
+                if isinstance(block, dict) and block.get("type") == "text":
+                    parts.append(block["text"])
+                elif isinstance(block, str):
+                    parts.append(block)
+            if parts:
+                result.append({"role": "system", "content": "\n".join(parts)})
+
+    for msg in messages:
+        role = msg["role"] if isinstance(msg, dict) else msg.role
+        content = msg["content"] if isinstance(msg, dict) else msg.content
+
+        if isinstance(content, str):
+            result.append({"role": role, "content": content})
+            continue
+
+        # Content is a list of blocks
+        text_parts: list[str] = []
+        tool_calls: list[dict] = []
+        tool_results: list[dict] = []
+
+        for block in content:
+            b = block if isinstance(block, dict) else block.model_dump()
+            btype = b.get("type", "")
+
+            if btype == "text":
+                text_parts.append(b["text"])
+            elif btype == "tool_use":
+                tool_calls.append(
+                    {
+                        "id": b["id"],
+                        "type": "function",
+                        "function": {
+                            "name": b["name"],
+                            "arguments": json.dumps(b["input"]),
+                        },
+                    }
+                )
+            elif btype == "tool_result":
+                tc = b.get("content", "")
+                if isinstance(tc, list):
+                    tc = " ".join(
+                        p["text"]
+                        for p in tc
+                        if isinstance(p, dict) and p.get("type") == "text"
+                    )
+                tool_results.append(
+                    {
+                        "role": "tool",
+                        "tool_call_id": b["tool_use_id"],
+                        "content": str(tc),
+                    }
+                )
+
+        if role == "assistant":
+            msg_dict: dict[str, Any] = {"role": "assistant"}
+            if text_parts:
+                msg_dict["content"] = "\n".join(text_parts)
+            if tool_calls:
+                msg_dict["tool_calls"] = tool_calls
+            result.append(msg_dict)
+        elif role == "user":
+            if text_parts:
+                result.append({"role": "user", "content": "\n".join(text_parts)})
+            for tr in tool_results:
+                result.append(tr)
+
+    return result
+
+
+def anthropic_tools_to_openai(tools: list) -> list[dict]:
+    """Convert Anthropic tool definitions to OpenAI function-tool format."""
+    result = []
+    for t in tools:
+        td = t if isinstance(t, dict) else t.model_dump()
+        result.append(
+            {
+                "type": "function",
+                "function": {
+                    "name": td["name"],
+                    "description": td.get("description", ""),
+                    "parameters": td.get("input_schema", {}),
+                },
+            }
+        )
+    return result
+
+
+def anthropic_tool_choice_to_openai(tc: Any) -> Any:
+    """Translate Anthropic `tool_choice` into OpenAI `tool_choice`.
+
+    Anthropic formats (all dict shapes with a ``type`` discriminator):
+
+    - ``{"type": "auto"}``                       → ``"auto"``
+    - ``{"type": "any"}``                        → ``"required"``
+    - ``{"type": "none"}``                       → ``"none"``
+    - ``{"type": "tool", "name": "get_weather"}``
+          → ``{"type": "function", "function": {"name": "get_weather"}}``
+
+    Returns ``None`` for ``None`` or any unrecognized shape (caller may
+    then fall back to its own default, typically ``"auto"``).
+    """
+    if tc is None:
+        return None
+    if not isinstance(tc, dict):
+        return None
+    t = tc.get("type")
+    if t == "auto":
+        return "auto"
+    if t == "any":
+        return "required"
+    if t == "none":
+        return "none"
+    if t == "tool":
+        name = tc.get("name")
+        if not name:
+            return None
+        return {"type": "function", "function": {"name": name}}
+    return None
+
+
+def build_anthropic_sse_event(event_type: str, data: dict) -> str:
+    """Format a single Anthropic SSE event."""
+    return f"event: {event_type}\ndata: {json.dumps(data)}\n\n"
+
+
+class AnthropicStreamEmitter:
+    """Converts generator events from generate_chat_completion_with_tools()
+    into Anthropic Messages SSE strings."""
+
+    def __init__(self) -> None:
+        self.block_index: int = 0
+        self._text_block_open: bool = False
+        self._prev_text: str = ""
+        self._usage: dict = {}
+
+    def start(self, message_id: str, model: str) -> list[str]:
+        """Emit message_start and open the first text content block."""
+        events = []
+        events.append(
+            build_anthropic_sse_event(
+                "message_start",
+                {
+                    "type": "message_start",
+                    "message": {
+                        "id": message_id,
+                        "type": "message",
+                        "role": "assistant",
+                        "content": [],
+                        "model": model,
+                        "stop_reason": None,
+                        "stop_sequence": None,
+                        "usage": {"input_tokens": 0, "output_tokens": 0},
+                    },
+                },
+            )
+        )
+        events.extend(self._open_text_block())
+        return events
+
+    def feed(self, event: dict) -> list[str]:
+        """Process one generator event, return SSE strings."""
+        etype = event.get("type", "")
+        if etype == "content":
+            return self._handle_content(event)
+        elif etype == "tool_start":
+            return self._handle_tool_start(event)
+        elif etype == "tool_end":
+            return self._handle_tool_end(event)
+        elif etype == "metadata":
+            self._usage = event.get("usage", {})
+            return []
+        # status events — no Anthropic equivalent
+        return []
+
+    def finish(self, stop_reason: str = "end_turn") -> list[str]:
+        """Close any open block and emit message_delta + message_stop."""
+        events = []
+        if self._text_block_open:
+            events.append(self._close_block())
+        events.append(
+            build_anthropic_sse_event(
+                "message_delta",
+                {
+                    "type": "message_delta",
+                    "delta": {"stop_reason": stop_reason, "stop_sequence": None},
+                    "usage": {
+                        "output_tokens": self._usage.get("completion_tokens", 0),
+                    },
+                },
+            )
+        )
+        events.append(
+            build_anthropic_sse_event(
+                "message_stop",
+                {
+                    "type": "message_stop",
+                },
+            )
+        )
+        return events
+
+    def _handle_content(self, event: dict) -> list[str]:
+        cumulative = event.get("text", "")
+        new_text = cumulative[len(self._prev_text) :]
+        self._prev_text = cumulative
+        if not new_text:
+            return []
+        if not self._text_block_open:
+            events = self._open_text_block()
+        else:
+            events = []
+        events.append(
+            build_anthropic_sse_event(
+                "content_block_delta",
+                {
+                    "type": "content_block_delta",
+                    "index": self.block_index,
+                    "delta": {"type": "text_delta", "text": new_text},
+                },
+            )
+        )
+        return events
+
+    def _handle_tool_start(self, event: dict) -> list[str]:
+        events = []
+        # Close current text block if open
+        if self._text_block_open:
+            events.append(self._close_block())
+        # Open a tool_use block
+        self.block_index += 1
+        events.append(
+            build_anthropic_sse_event(
+                "content_block_start",
+                {
+                    "type": "content_block_start",
+                    "index": self.block_index,
+                    "content_block": {
+                        "type": "tool_use",
+                        "id": event.get("tool_call_id", ""),
+                        "name": event.get("tool_name", ""),
+                        "input": {},
+                    },
+                },
+            )
+        )
+        # Emit the arguments as input_json_delta
+        args = event.get("arguments", {})
+        if args:
+            events.append(
+                build_anthropic_sse_event(
+                    "content_block_delta",
+                    {
+                        "type": "content_block_delta",
+                        "index": self.block_index,
+                        "delta": {
+                            "type": "input_json_delta",
+                            "partial_json": json.dumps(args),
+                        },
+                    },
+                )
+            )
+        return events
+
+    def _handle_tool_end(self, event: dict) -> list[str]:
+        events = []
+        # Close the tool_use block
+        events.append(self._close_block())
+        # Emit custom tool_result event (non-standard, ignored by SDKs)
+        events.append(
+            build_anthropic_sse_event(
+                "tool_result",
+                {
+                    "type": "tool_result",
+                    "tool_use_id": event.get("tool_call_id", ""),
+                    "content": event.get("result", ""),
+                },
+            )
+        )
+        # Open a new text block for the model's next response
+        self.block_index += 1
+        events.extend(self._open_text_block())
+        # Reset text tracking for the next synthesis turn
+        self._prev_text = ""
+        return events
+
+    def _open_text_block(self) -> list[str]:
+        self._text_block_open = True
+        return [
+            build_anthropic_sse_event(
+                "content_block_start",
+                {
+                    "type": "content_block_start",
+                    "index": self.block_index,
+                    "content_block": {"type": "text", "text": ""},
+                },
+            )
+        ]
+
+    def _close_block(self) -> str:
+        self._text_block_open = False
+        return build_anthropic_sse_event(
+            "content_block_stop",
+            {
+                "type": "content_block_stop",
+                "index": self.block_index,
+            },
+        )
+
+
+class AnthropicPassthroughEmitter:
+    """Converts llama-server's OpenAI-format streaming chunks into Anthropic SSE.
+
+    Used for the client-side tool-use pass-through path: the client (e.g. Claude
+    Code) sends its own tool definitions in the ``tools`` field and expects to
+    execute them itself. We forward them to llama-server and translate the
+    streaming response back to Anthropic format without executing anything.
+    """
+
+    def __init__(self) -> None:
+        self.block_index: int = -1
+        self._current_block_type: Optional[str] = None  # "text" | "tool_use" | None
+        self._tool_call_states: dict = {}  # delta index -> {block_index, id, name}
+        self._usage: dict = {}
+        self._stop_reason: str = "end_turn"
+
+    def start(self, message_id: str, model: str) -> list[str]:
+        return [
+            build_anthropic_sse_event(
+                "message_start",
+                {
+                    "type": "message_start",
+                    "message": {
+                        "id": message_id,
+                        "type": "message",
+                        "role": "assistant",
+                        "content": [],
+                        "model": model,
+                        "stop_reason": None,
+                        "stop_sequence": None,
+                        "usage": {"input_tokens": 0, "output_tokens": 0},
+                    },
+                },
+            )
+        ]
+
+    def feed_chunk(self, chunk: dict) -> list[str]:
+        """Process one OpenAI streaming chat.completion.chunk."""
+        events: list[str] = []
+
+        # usage-only chunks carry token totals
+        usage = chunk.get("usage")
+        if usage:
+            self._usage = usage
+
+        choices = chunk.get("choices") or []
+        if not choices:
+            return events
+
+        choice = choices[0]
+        delta = choice.get("delta") or {}
+        finish_reason = choice.get("finish_reason")
+
+        # ── Text content ──
+        content = delta.get("content")
+        if content:
+            if self._current_block_type != "text":
+                if self._current_block_type is not None:
+                    events.append(self._close_current_block())
+                events.extend(self._open_text_block())
+            events.append(
+                build_anthropic_sse_event(
+                    "content_block_delta",
+                    {
+                        "type": "content_block_delta",
+                        "index": self.block_index,
+                        "delta": {"type": "text_delta", "text": content},
+                    },
+                )
+            )
+
+        # ── Tool calls (streaming deltas) ──
+        tool_calls = delta.get("tool_calls") or []
+        for tc in tool_calls:
+            tc_idx = tc.get("index", 0)
+            fn = tc.get("function") or {}
+            if tc_idx not in self._tool_call_states:
+                # New tool call — close prior block, open tool_use block
+                if self._current_block_type is not None:
+                    events.append(self._close_current_block())
+                tc_id = tc.get("id", "")
+                tc_name = fn.get("name", "")
+                self.block_index += 1
+                self._current_block_type = "tool_use"
+                self._tool_call_states[tc_idx] = {
+                    "block_index": self.block_index,
+                    "id": tc_id,
+                    "name": tc_name,
+                }
+                events.append(
+                    build_anthropic_sse_event(
+                        "content_block_start",
+                        {
+                            "type": "content_block_start",
+                            "index": self.block_index,
+                            "content_block": {
+                                "type": "tool_use",
+                                "id": tc_id,
+                                "name": tc_name,
+                                "input": {},
+                            },
+                        },
+                    )
+                )
+
+            args_delta = fn.get("arguments", "")
+            if args_delta:
+                events.append(
+                    build_anthropic_sse_event(
+                        "content_block_delta",
+                        {
+                            "type": "content_block_delta",
+                            "index": self._tool_call_states[tc_idx]["block_index"],
+                            "delta": {
+                                "type": "input_json_delta",
+                                "partial_json": args_delta,
+                            },
+                        },
+                    )
+                )
+
+        # ── Finish reason ──
+        if finish_reason:
+            if finish_reason == "tool_calls":
+                self._stop_reason = "tool_use"
+            elif finish_reason == "length":
+                self._stop_reason = "max_tokens"
+            else:
+                self._stop_reason = "end_turn"
+
+        return events
+
+    def finish(self) -> list[str]:
+        events: list[str] = []
+        if self._current_block_type is not None:
+            events.append(self._close_current_block())
+        events.append(
+            build_anthropic_sse_event(
+                "message_delta",
+                {
+                    "type": "message_delta",
+                    "delta": {
+                        "stop_reason": self._stop_reason,
+                        "stop_sequence": None,
+                    },
+                    "usage": {
+                        "output_tokens": self._usage.get("completion_tokens", 0),
+                    },
+                },
+            )
+        )
+        events.append(
+            build_anthropic_sse_event(
+                "message_stop",
+                {"type": "message_stop"},
+            )
+        )
+        return events
+
+    def _open_text_block(self) -> list[str]:
+        self.block_index += 1
+        self._current_block_type = "text"
+        return [
+            build_anthropic_sse_event(
+                "content_block_start",
+                {
+                    "type": "content_block_start",
+                    "index": self.block_index,
+                    "content_block": {"type": "text", "text": ""},
+                },
+            )
+        ]
+
+    def _close_current_block(self) -> str:
+        idx = self.block_index
+        self._current_block_type = None
+        return build_anthropic_sse_event(
+            "content_block_stop",
+            {
+                "type": "content_block_stop",
+                "index": idx,
+            },
+        )
--- a/studio/backend/core/inference/defaults.py
+++ b/studio/backend/core/inference/defaults.py
@ -6,6 +6,15 @@
 import utils.hardware.hardware as hw

 DEFAULT_MODELS_GGUF = [
+    "unsloth/gemma-4-E2B-it-GGUF",
+    "unsloth/gemma-4-E4B-it-GGUF",
+    "unsloth/gemma-4-31B-it-GGUF",
+    "unsloth/gemma-4-26B-A4B-it-GGUF",
+    "unsloth/Qwen3.6-35B-A3B-GGUF",
+    "unsloth/Qwen3.5-4B-GGUF",
+    "unsloth/Qwen3.5-9B-GGUF",
+    "unsloth/Qwen3.5-35B-A3B-GGUF",
+    "unsloth/Qwen3.5-0.8B-GGUF",
    "unsloth/Llama-3.2-1B-Instruct-GGUF",
    "unsloth/Llama-3.2-3B-Instruct-GGUF",
    "unsloth/Llama-3.1-8B-Instruct-GGUF",
@ -15,6 +24,19 @@ DEFAULT_MODELS_GGUF = [
 ]

 DEFAULT_MODELS_STANDARD = [
+    "unsloth/gemma-4-E2B-it-GGUF",
+    "unsloth/gemma-4-E4B-it-GGUF",
+    "unsloth/gemma-4-31B-it-GGUF",
+    "unsloth/gemma-4-26B-A4B-it-GGUF",
+    "unsloth/Qwen3.6-35B-A3B-GGUF",
+    "unsloth/Qwen3.5-4B-GGUF",
+    "unsloth/Qwen3.5-9B-GGUF",
+    "unsloth/Qwen3.5-35B-A3B-GGUF",
+    "unsloth/Qwen3.5-0.8B-GGUF",
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
    "unsloth/Qwen3-4B-Instruct-2507",
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
--- a/studio/backend/core/inference/inference.py
+++ b/studio/backend/core/inference/inference.py
@ -18,7 +18,14 @@ from typing import Optional, Union, Generator, Tuple
 from utils.models import ModelConfig, get_base_model_from_lora
 from utils.paths import is_model_cached
 from utils.utils import format_error_message
-from utils.hardware import get_device, clear_gpu_cache, log_gpu_memory
+from utils.hardware import (
+    get_device,
+    clear_gpu_cache,
+    log_gpu_memory,
+    get_device_map,
+    raise_if_offloaded,
+    get_visible_gpu_count,
+)
 from core.inference.audio_codecs import AudioCodecManager
 from io import StringIO
 import structlog
@ -241,10 +248,15 @@ class InferenceBackend:
        load_in_4bit: bool = True,
        hf_token: Optional[str] = None,
        trust_remote_code: bool = False,
+        gpu_ids: Optional[list[int]] = None,
    ) -> bool:
        """
        Load any model: base, LoRA adapter, text, or vision.
        """
+        # GGUF uses max_seq_length=0 as "model default"; Unsloth crashes on it.
+        if max_seq_length <= 0:
+            max_seq_length = 2048
+
        try:
            model_name = config.identifier

@ -260,6 +272,10 @@ class InferenceBackend:
                return False

            self.loading_models.add(model_name)
+            device_map = get_device_map(gpu_ids)
+            logger.info(
+                f"Using device_map='{device_map}' ({get_visible_gpu_count()} GPU(s) visible)"
+            )

            self.models[model_name] = {
                "is_vision": config.is_vision,
@ -290,6 +306,7 @@ class InferenceBackend:
                        config.path,
                        auto_model = CsmForConditionalGeneration,
                        load_in_4bit = False,
+                        device_map = device_map,
                        token = hf_token if hf_token and hf_token.strip() else None,
                        trust_remote_code = trust_remote_code,
                    )
@ -325,6 +342,7 @@ class InferenceBackend:
                            config.path,
                            dtype = torch.float32,
                            load_in_4bit = False,
+                            device_map = device_map,
                            token = hf_token if hf_token and hf_token.strip() else None,
                            trust_remote_code = trust_remote_code,
                        )
@ -345,6 +363,7 @@ class InferenceBackend:
                            llm_path,
                            dtype = torch.float32,
                            load_in_4bit = False,
+                            device_map = device_map,
                            token = hf_token if hf_token and hf_token.strip() else None,
                            trust_remote_code = trust_remote_code,
                        )
@ -361,6 +380,7 @@ class InferenceBackend:
                        config.path,
                        max_seq_length = max_seq_length,
                        load_in_4bit = False,
+                        device_map = device_map,
                        token = hf_token if hf_token and hf_token.strip() else None,
                        trust_remote_code = trust_remote_code,
                    )
@ -378,6 +398,7 @@ class InferenceBackend:
                        whisper_language = "English",
                        whisper_task = "transcribe",
                        load_in_4bit = False,
+                        device_map = device_map,
                        token = hf_token if hf_token and hf_token.strip() else None,
                        trust_remote_code = trust_remote_code,
                    )
@ -405,6 +426,7 @@ class InferenceBackend:
                        model_name = config.path,
                        max_seq_length = max_seq_length,
                        load_in_4bit = False,
+                        device_map = device_map,
                        token = hf_token if hf_token and hf_token.strip() else None,
                        trust_remote_code = trust_remote_code,
                    )
@ -420,6 +442,11 @@ class InferenceBackend:
                        audio_type, self.device, model_repo_path = model_repo_path
                    )

+                # Reject CPU/disk offload for audio models too
+                raise_if_offloaded(
+                    self.models[model_name]["model"], device_map, "Inference"
+                )
+
                self.active_model_name = model_name
                self.loading_models.discard(model_name)
                logger.info(f"Successfully loaded audio model: {model_name}")
@ -441,6 +468,7 @@ class InferenceBackend:
                    max_seq_length = max_seq_length,
                    dtype = dtype,
                    load_in_4bit = load_in_4bit,
+                    device_map = device_map,
                    token = hf_token if hf_token and hf_token.strip() else None,
                    trust_remote_code = trust_remote_code,
                )
@ -497,6 +525,7 @@ class InferenceBackend:
                    max_seq_length = max_seq_length,
                    dtype = dtype,
                    load_in_4bit = load_in_4bit,
+                    device_map = device_map,
                    token = hf_token if hf_token and hf_token.strip() else None,
                    trust_remote_code = trust_remote_code,
                )
@ -507,6 +536,10 @@ class InferenceBackend:
                self.models[model_name]["model"] = model
                self.models[model_name]["tokenizer"] = tokenizer

+            raise_if_offloaded(
+                self.models[model_name]["model"], device_map, "Inference"
+            )
+
            # Load chat template info
            self._load_chat_template_info(model_name)

@ -615,6 +648,7 @@ class InferenceBackend:
        dtype = None,
        load_in_4bit: bool = True,
        hf_token: Optional[str] = None,
+        gpu_ids: Optional[list[int]] = None,
    ) -> Tuple[bool, Optional[str], Optional[str]]:
        """
        Final Corrected Version:
@ -639,7 +673,12 @@ class InferenceBackend:
                    base_model_name, None, is_lora = False
                )
                if not self.load_model(
-                    base_config, max_seq_length, dtype, load_in_4bit, hf_token
+                    base_config,
+                    max_seq_length,
+                    dtype,
+                    load_in_4bit,
+                    hf_token,
+                    gpu_ids = gpu_ids,
                ):
                    return False, None, None

@ -927,6 +966,12 @@ class InferenceBackend:
            logger.warning(f"Could not apply get_chat_template: {e}")

        # Step 2: Format with tokenizer.apply_chat_template()
+        if system_prompt:
+            template_messages = [
+                {"role": "system", "content": system_prompt}
+            ] + messages
+        else:
+            template_messages = messages
        try:
            if not (hasattr(tokenizer, "chat_template") and tokenizer.chat_template):
                raise ValueError(
@ -937,7 +982,7 @@ class InferenceBackend:
                    f"one via tokenizer.chat_template before inference."
                )
            formatted_prompt = tokenizer.apply_chat_template(
-                messages, tokenize = False, add_generation_prompt = True
+                template_messages, tokenize = False, add_generation_prompt = True
            )
            logger.debug(f"Formatted prompt: {formatted_prompt[:200]}...")
        except Exception as e:
@ -992,30 +1037,51 @@ class InferenceBackend:

        # Prepare vision messages
        if image:
-            vision_messages = [
-                {
-                    "role": "user",
-                    "content": [
-                        {"type": "image"},
-                        {"type": "text", "text": user_message},
-                    ],
-                }
-            ]
+            user_msg = {
+                "role": "user",
+                "content": [
+                    {"type": "image"},
+                    {"type": "text", "text": user_message},
+                ],
+            }
+            if system_prompt:
+                vision_messages = [
+                    {
+                        "role": "system",
+                        "content": [{"type": "text", "text": system_prompt}],
+                    },
+                    user_msg,
+                ]
+            else:
+                vision_messages = [user_msg]

-            input_text = processor.apply_chat_template(
-                vision_messages, add_generation_prompt = True, tokenize = False
-            )
+            try:
+                input_text = processor.apply_chat_template(
+                    vision_messages, add_generation_prompt = True, tokenize = False
+                )
+            except Exception as e:
+                if system_prompt:
+                    logger.warning(
+                        f"Vision processor for '{self.active_model_name}' may not support "
+                        f"system messages; retrying without. Original error: {e}"
+                    )
+                    vision_messages = [user_msg]
+                    input_text = processor.apply_chat_template(
+                        vision_messages, add_generation_prompt = True, tokenize = False
+                    )
+                else:
+                    raise
            inputs = processor(
                image,
                input_text,
                add_special_tokens = False,
                return_tensors = "pt",
-            ).to(self.device)
+            ).to(model.device)
        else:
            # Text-only for vision model
            formatted_prompt = self.format_chat_prompt(messages, system_prompt)
            inputs = raw_tokenizer(formatted_prompt, return_tensors = "pt").to(
-                self.device
+                model.device
            )

        # Stream with TextIteratorStreamer + background thread
@ -1155,7 +1221,7 @@ class InferenceBackend:
            return_dict = True,
            return_tensors = "pt",
            truncation = False,
-        ).to(self.device)
+        ).to(model.device)

        try:
            from transformers import TextIteratorStreamer
--- a/studio/backend/core/inference/llama_cpp.py
+++ b/studio/backend/core/inference/llama_cpp.py
--- a/studio/backend/core/inference/orchestrator.py
+++ b/studio/backend/core/inference/orchestrator.py
@ -17,6 +17,7 @@ Pattern follows core/training/training.py.

 import atexit
 import base64
+import os
 import structlog
 from loggers import get_logger
 import multiprocessing as mp
@ -27,11 +28,17 @@ import uuid
 from io import BytesIO
 from pathlib import Path
 from typing import Any, Generator, Optional, Tuple, Union
+from utils.hardware import prepare_gpu_selection

 logger = get_logger(__name__)

 _CTX = mp.get_context("spawn")

+
+class DownloadStallError(RuntimeError):
+    """Raised when the worker reports no download progress for too long."""
+
+
 # Dispatcher timeout constants (seconds)
 _DISPATCH_READ_TIMEOUT = 30.0
 _DISPATCH_POLL_INTERVAL = 0.5
@ -102,12 +109,13 @@ class InferenceOrchestrator:
        self._top_models_ready.wait(timeout = 5)
        top_gguf = self._top_gguf_cache or []
        top_hub = self._top_hub_cache or []
-        # GGUFs first, then hub models, then static fallbacks.
+        # Curated static defaults first (editorial picks like new models),
+        # then HF download-ranked models to backfill.
        # Send extras so the frontend still has 4 per category
        # after removing already-downloaded models.
        result: list[str] = []
        seen: set[str] = set()
-        for m in top_gguf + top_hub + self._static_models:
+        for m in self._static_models + top_gguf + top_hub:
            if m not in seen:
                result.append(m)
                seen.add(m)
@ -262,12 +270,17 @@ class InferenceOrchestrator:
        except (EOFError, OSError, ValueError):
            return None

-    def _wait_response(self, expected_type: str, timeout: float = 120.0) -> dict:
+    def _wait_response(self, expected_type: str, timeout: float = 300.0) -> dict:
        """Block until a response of the expected type arrives.

        Also handles 'status' and 'error' events during the wait.
        Returns the matching response dict.
        Raises RuntimeError on timeout or subprocess crash.
+
+        The *timeout* is an **inactivity** timeout: it resets whenever the
+        subprocess sends a status message, so long-running operations (large
+        downloads, slow model loads) won't be killed as long as the subprocess
+        keeps reporting progress.
        """
        deadline = time.monotonic() + timeout

@ -292,8 +305,15 @@ class InferenceOrchestrator:

            if rtype == "status":
                logger.info("Subprocess status: %s", resp.get("message", ""))
+                # Reset deadline — subprocess is still alive and working
+                deadline = time.monotonic() + timeout
                continue

+            if rtype == "stall":
+                msg = resp.get("message", "Download stalled")
+                logger.warning("Subprocess reported stall: %s", msg)
+                raise DownloadStallError(msg)
+
            # Other response types during wait — skip
            logger.debug(
                "Skipping response type '%s' while waiting for '%s'",
@ -302,7 +322,8 @@ class InferenceOrchestrator:
            )

        raise RuntimeError(
-            f"Timeout waiting for '{expected_type}' response after {timeout}s"
+            f"Timeout waiting for '{expected_type}' response "
+            f"(no activity for {timeout}s)"
        )

    def _drain_queue(self) -> list:
@ -571,6 +592,7 @@ class InferenceOrchestrator:
        load_in_4bit: bool = True,
        hf_token: Optional[str] = None,
        trust_remote_code: bool = False,
+        gpu_ids: Optional[list[int]] = None,
    ) -> bool:
        """Load a model for inference.

@ -594,7 +616,16 @@ class InferenceOrchestrator:
                "hf_token": hf_token or "",
                "gguf_variant": getattr(config, "gguf_variant", None),
                "trust_remote_code": trust_remote_code,
+                "gpu_ids": gpu_ids,
            }
+            resolved_gpu_ids, gpu_selection = prepare_gpu_selection(
+                gpu_ids,
+                model_name = model_name,
+                hf_token = hf_token,
+                load_in_4bit = load_in_4bit,
+            )
+            sub_config["resolved_gpu_ids"] = resolved_gpu_ids
+            sub_config["gpu_selection"] = gpu_selection

            # Always kill existing subprocess and spawn fresh.
            # Reusing a subprocess after unsloth patches torch internals
@ -608,36 +639,66 @@ class InferenceOrchestrator:
                # Dead subprocess — clean up
                self._shutdown_subprocess(timeout = 2)

-            logger.info(
-                "Spawning fresh inference subprocess for '%s' (transformers %s.x)",
-                model_name,
-                needed_major,
+            disable_xet = sub_config.get("disable_xet", False) or (
+                os.environ.get("HF_HUB_DISABLE_XET") == "1"
            )
-            self._spawn_subprocess(sub_config)
-            resp = self._wait_response("loaded", timeout = 180)

-            # Update local state from response
-            if resp.get("success"):
-                self._current_transformers_major = needed_major
-                model_info = resp.get("model_info", {})
-                self.active_model_name = model_info.get("identifier", model_name)
-                self.models[self.active_model_name] = {
-                    "is_vision": model_info.get("is_vision", False),
-                    "is_lora": model_info.get("is_lora", False),
-                    "display_name": model_info.get("display_name", model_name),
-                    "is_audio": model_info.get("is_audio", False),
-                    "audio_type": model_info.get("audio_type"),
-                    "has_audio_input": model_info.get("has_audio_input", False),
-                }
-                self.loading_models.discard(model_name)
-                logger.info("Model '%s' loaded successfully in subprocess", model_name)
-                return True
-            else:
-                error = resp.get("error", "Failed to load model")
-                self.loading_models.discard(model_name)
-                self.active_model_name = None
-                self.models.clear()
-                raise Exception(error)
+            for attempt in range(2):
+                logger.info(
+                    "Spawning fresh inference subprocess for '%s' "
+                    "(transformers %s.x, attempt %d/2%s)",
+                    model_name,
+                    needed_major,
+                    attempt + 1,
+                    ", xet disabled" if disable_xet else "",
+                )
+                sub_config["disable_xet"] = disable_xet
+                self._spawn_subprocess(sub_config)
+
+                try:
+                    resp = self._wait_response("loaded")
+                except DownloadStallError:
+                    # First stall and Xet was enabled -> retry with Xet disabled
+                    if attempt == 0 and not disable_xet:
+                        logger.warning(
+                            "Download stalled for '%s' -- retrying with "
+                            "HF_HUB_DISABLE_XET=1",
+                            model_name,
+                        )
+                        self._shutdown_subprocess(timeout = 5)
+                        disable_xet = True
+                        continue
+                    # Second stall (or already had xet disabled) -> give up
+                    self._shutdown_subprocess(timeout = 5)
+                    raise RuntimeError(
+                        f"Download stalled for '{model_name}' even with "
+                        f"HF_HUB_DISABLE_XET=1 -- check your network connection"
+                    )
+
+                # Got a response — check success
+                if resp.get("success"):
+                    self._current_transformers_major = needed_major
+                    model_info = resp.get("model_info", {})
+                    self.active_model_name = model_info.get("identifier", model_name)
+                    self.models[self.active_model_name] = {
+                        "is_vision": model_info.get("is_vision", False),
+                        "is_lora": model_info.get("is_lora", False),
+                        "display_name": model_info.get("display_name", model_name),
+                        "is_audio": model_info.get("is_audio", False),
+                        "audio_type": model_info.get("audio_type"),
+                        "has_audio_input": model_info.get("has_audio_input", False),
+                    }
+                    self.loading_models.discard(model_name)
+                    logger.info(
+                        "Model '%s' loaded successfully in subprocess", model_name
+                    )
+                    return True
+                else:
+                    error = resp.get("error", "Failed to load model")
+                    self.loading_models.discard(model_name)
+                    self.active_model_name = None
+                    self.models.clear()
+                    raise Exception(error)

        except Exception:
            self.loading_models.discard(model_name)
@ -661,7 +722,7 @@ class InferenceOrchestrator:
                    "model_name": model_name,
                }
            )
-            resp = self._wait_response("unloaded", timeout = 30)
+            resp = self._wait_response("unloaded")

            # Update local state
            self.models.pop(model_name, None)
--- a/studio/backend/core/inference/tools.py
+++ b/studio/backend/core/inference/tools.py
@ -8,25 +8,260 @@ Supports web search (DuckDuckGo), Python code execution, and terminal commands.
 """

 import ast
+import http.client
 import os

 os.environ["UNSLOTH_IS_PRESENT"] = "1"

+import random
+import re
+import shlex
+import ssl
 import subprocess
 import sys
 import tempfile
 import threading
+import urllib.request

 from loggers import get_logger

 logger = get_logger(__name__)

 _EXEC_TIMEOUT = 300  # 5 minutes
+
+# Pre-import modules used in _sandbox_preexec at module level so that
+# the preexec_fn closure does not trigger the import machinery in the
+# forked child (which can deadlock in multi-threaded servers).
+_libc = None
+if sys.platform == "linux":
+    try:
+        import ctypes
+        import ctypes.util
+
+        _libc_name = ctypes.util.find_library("c")
+        if _libc_name:
+            _libc = ctypes.CDLL(_libc_name, use_errno = True)
+    except (OSError, AttributeError):
+        pass
+
+_resource = None
+if sys.platform != "win32":
+    try:
+        import resource as _resource
+    except ImportError:
+        pass
+
+# Strict raster-image allowlist for sandbox file serving.
+# No .svg (XSS risk via embedded scripts), no .html, no .pdf.
+_IMAGE_EXTS = frozenset({".png", ".jpg", ".jpeg", ".gif", ".webp", ".bmp"})
 _MAX_OUTPUT_CHARS = 8000  # truncate long output
-_BASH_BLOCKED_WORDS = {"rm", "sudo", "dd", "chmod", "mkfs", "shutdown", "reboot"}
+_BLOCKED_COMMANDS_COMMON = frozenset(
+    {
+        "rm",
+        "sudo",
+        "su",
+        "dd",
+        "chmod",
+        "chown",
+        "mkfs",
+        "shutdown",
+        "reboot",
+        "passwd",
+        "mount",
+        "umount",
+        "fdisk",
+        "kill",
+        "killall",
+        "pkill",
+    }
+)
+_BLOCKED_COMMANDS_WIN = frozenset(
+    {
+        "rmdir",
+        "takeown",
+        "icacls",
+        "runas",
+        "powershell",
+        "pwsh",
+    }
+)
+_BLOCKED_COMMANDS = (
+    _BLOCKED_COMMANDS_COMMON | _BLOCKED_COMMANDS_WIN
+    if sys.platform == "win32"
+    else _BLOCKED_COMMANDS_COMMON
+)
+
+
+def _find_blocked_commands(command: str) -> set[str]:
+    """Detect blocked commands using shlex tokenization and regex scanning.
+
+    Catches: full paths (/usr/bin/sudo), quoted strings ("sudo"),
+    split-quotes (su""do), backslash escapes (\\rm), and command-position
+    words after ;, |, &&, $().
+    """
+    blocked = set()
+
+    # 1. shlex tokenization (handles quotes, escapes, concatenation)
+    try:
+        tokens = (
+            shlex.split(command)
+            if sys.platform != "win32"
+            else shlex.split(command, posix = False)
+        )
+    except ValueError:
+        tokens = command.split()
+
+    for token in tokens:
+        base = os.path.basename(token).lower()
+        # Strip common Windows executable extensions so that
+        # runas.exe, shutdown.bat, etc. match the blocklist.
+        stem, ext = os.path.splitext(base)
+        if ext in {".exe", ".com", ".bat", ".cmd"}:
+            base = stem
+        if base in _BLOCKED_COMMANDS:
+            blocked.add(base)
+
+    # 2. Regex: catch blocked words at shell command boundaries
+    #    (semicolons, pipes, &&, ||, backticks, $(), <(), subshells, newlines)
+    #    Uses a single combined pattern for all blocked words.
+    #    Handles optional Unix path prefix (/usr/bin/) and Windows drive
+    #    letter prefix (C:\Windows\...\).
+    lowered = command.lower()
+    if _BLOCKED_COMMANDS:
+        words_alt = "|".join(re.escape(w) for w in sorted(_BLOCKED_COMMANDS))
+        pattern = (
+            rf"(?:^|[;&|`\n(]\s*|[$]\(\s*|<\(\s*)"
+            rf"(?:[\w./\\-]*/|[a-zA-Z]:[/\\][\w./\\-]*)?"
+            rf"({words_alt})(?:\.(?:exe|com|bat|cmd))?\b"
+        )
+        blocked.update(re.findall(pattern, lowered))
+
+    # 3. Check for nested shell invocations (bash -c 'sudo whoami',
+    #    bash -lc '...', bash --login -c '...', cmd /c '...').
+    #    When a -c or /c flag is found, look backwards for a shell name
+    #    (skipping intermediate flags like --login, -l, -x) and recursively
+    #    scan the nested command string.
+    _SHELLS = {"bash", "sh", "zsh", "dash", "ksh", "csh", "tcsh", "fish"}
+    _SHELLS_WIN = {"cmd", "cmd.exe"}
+    for i, token in enumerate(tokens):
+        tok_lower = token.lower()
+        # Match -c exactly, or combined flags ending in c (e.g. -lc, -xc)
+        is_unix_c = tok_lower == "-c" or (
+            tok_lower.startswith("-")
+            and tok_lower.endswith("c")
+            and not tok_lower.startswith("--")
+        )
+        is_win_c = tok_lower == "/c"
+        if not (is_unix_c or is_win_c) or i < 1 or i + 1 >= len(tokens):
+            continue
+        # Look backwards past any flags to find the shell binary.
+        # On Unix, flags start with - (skip those). On Windows, flags
+        # start with / but so do absolute paths, so only skip short
+        # single-char /X flags (not /bin/bash style paths).
+        for j in range(i - 1, -1, -1):
+            prev = tokens[j]
+            if prev.startswith("-"):
+                continue  # skip Unix flags like --login, -l
+            if is_win_c and prev.startswith("/") and len(prev) <= 3:
+                continue  # skip Windows flags like /s, /q (not /bin/bash)
+            prev_base = os.path.basename(prev).lower()
+            if is_unix_c and prev_base in _SHELLS:
+                blocked |= _find_blocked_commands(tokens[i + 1])
+            elif is_win_c and prev_base in _SHELLS_WIN:
+                blocked |= _find_blocked_commands(tokens[i + 1])
+            break  # stop at first non-flag token
+
+    return blocked
+
+
+def _build_safe_env(workdir: str) -> dict[str, str]:
+    """Build a minimal, credential-free environment for sandboxed subprocesses.
+
+    Strips HF_TOKEN, WANDB_API_KEY, AWS_*, GH_TOKEN, LD_PRELOAD, DYLD_*, etc.
+    Preserves the active Python interpreter and virtualenv directories in PATH
+    so that pip, uv, and packages installed in the Studio runtime remain
+    accessible.
+    """
+    # Start with the directory containing the running Python interpreter
+    # so that subprocess calls to 'python', 'pip', etc. resolve to the
+    # same environment the Studio server is running in.
+    exe_dir = os.path.dirname(sys.executable)
+    path_entries = [exe_dir] if exe_dir else []
+
+    # If a virtualenv is active, include its bin/Scripts directory.
+    venv = os.environ.get("VIRTUAL_ENV")
+    if venv:
+        venv_bin = os.path.join(venv, "Scripts" if sys.platform == "win32" else "bin")
+        if venv_bin not in path_entries:
+            path_entries.append(venv_bin)
+
+    if sys.platform == "win32":
+        sysroot = os.environ.get("SystemRoot", r"C:\Windows")
+        path_entries.extend([os.path.join(sysroot, "System32"), sysroot])
+    else:
+        path_entries.extend(["/usr/local/bin", "/usr/bin", "/bin"])
+
+    # Deduplicate while preserving order
+    deduped = list(dict.fromkeys(p for p in path_entries if p))
+
+    env = {
+        "PATH": os.pathsep.join(deduped),
+        "HOME": workdir,
+        "TMPDIR": workdir,
+        "LANG": os.environ.get("LANG", "C.UTF-8"),
+        "TERM": "dumb",
+        "PYTHONIOENCODING": "utf-8",
+    }
+    if venv:
+        env["VIRTUAL_ENV"] = venv
+    # Windows needs SystemRoot for Python/subprocess to work
+    if sys.platform == "win32":
+        env["SystemRoot"] = os.environ.get("SystemRoot", r"C:\Windows")
+    return env
+
+
+def _sandbox_preexec():
+    """Pre-exec hook: drop privilege escalation ability and set resource limits.
+
+    On Linux, applies PR_SET_NO_NEW_PRIVS so sudo/su/pkexec fail at the
+    kernel level. On Linux and macOS, sets RLIMIT_FSIZE.
+    No-op on Windows (use creationflags instead).
+
+    Note: RLIMIT_NPROC is intentionally NOT set because Linux enforces it
+    per real UID, not per process tree, so it would starve the Studio
+    server and other sessions sharing the same user account.
+
+    All modules and handles are resolved at import time (module level) so
+    this function does not trigger Python imports in the forked child,
+    avoiding potential deadlocks in multi-threaded servers.
+    """
+    if _libc is not None:
+        try:
+            # PR_SET_NO_NEW_PRIVS = 38, arg2 = 1 (enable)
+            _libc.prctl(38, 1, 0, 0, 0)
+        except (OSError, AttributeError):
+            pass  # Not available (container, old kernel, etc.)
+
+    if _resource is not None:
+        try:
+            # Limit file size to 100MB (prevents disk filling)
+            _resource.setrlimit(
+                _resource.RLIMIT_FSIZE, (100 * 1024 * 1024, 100 * 1024 * 1024)
+            )
+        except (ValueError, OSError):
+            pass
+
+
+def _get_shell_cmd(command: str) -> list[str]:
+    """Return the platform-appropriate shell invocation for a command string."""
+    if sys.platform == "win32":
+        return ["cmd", "/c", command]
+    return ["bash", "-c", command]
+

 # Per-session working directories so each chat thread gets its own sandbox.
-# Falls back to a shared ~/studio_sandbox/ for API callers without a session_id.
+# Falls back to a shared ~/studio_sandbox/_default for API callers without a
+# session_id.
 _workdirs: dict[str, str] = {}


@ -47,7 +282,7 @@ def _get_workdir(session_id: str | None = None) -> str:
            if not os.path.realpath(workdir).startswith(os.path.realpath(sandbox_root)):
                workdir = os.path.join(sandbox_root, "_invalid")
        else:
-            workdir = sandbox_root
+            workdir = os.path.join(sandbox_root, "_default")
        os.makedirs(workdir, exist_ok = True)
        _workdirs[key] = workdir
    return _workdirs[key]
@ -57,16 +292,23 @@ WEB_SEARCH_TOOL = {
    "type": "function",
    "function": {
        "name": "web_search",
-        "description": "Search the web for current information, recent events, or facts you are uncertain about.",
+        "description": (
+            "Search the web and fetch page content. Returns snippets for all results. "
+            "Use the url parameter to fetch full page text from a specific URL."
+        ),
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query",
-                }
+                },
+                "url": {
+                    "type": "string",
+                    "description": "A URL to fetch full page content from (instead of searching). Use this to read a page found in search results.",
+                },
            },
-            "required": ["query"],
+            "required": [],
        },
    },
 }
@ -131,7 +373,11 @@ def execute_tool(
    )
    effective_timeout = _EXEC_TIMEOUT if timeout is _TIMEOUT_UNSET else timeout
    if name == "web_search":
-        return _web_search(arguments.get("query", ""), timeout = effective_timeout)
+        return _web_search(
+            arguments.get("query", ""),
+            url = arguments.get("url"),
+            timeout = effective_timeout,
+        )
    if name == "python":
        return _python_exec(
            arguments.get("code", ""), cancel_event, effective_timeout, session_id
@ -143,9 +389,226 @@ def execute_tool(
    return f"Unknown tool: {name}"


-def _web_search(query: str, max_results: int = 5, timeout: int = _EXEC_TIMEOUT) -> str:
-    """Search the web using DuckDuckGo and return formatted results."""
-    if not query.strip():
+_MAX_PAGE_CHARS = 16000  # limit fetched page text (after HTML-to-MD conversion)
+# Raw download cap.  Must be larger than _MAX_PAGE_CHARS because SSR pages
+# embed large <head> sections (CSS, JS, SVGs) that are stripped during
+# HTML-to-Markdown conversion.  512 KB is enough to reach article content
+# on GitBook / Next.js / Docusaurus pages whose <head> alone can be 200 KB.
+_MAX_FETCH_BYTES = 512 * 1024
+
+_USER_AGENTS = (
+    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
+    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
+    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
+    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0",
+    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:133.0) Gecko/20100101 Firefox/133.0",
+    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.2 Safari/605.1.15",
+)
+
+_tls_ctx = ssl.create_default_context()
+
+
+class _NoRedirect(urllib.request.HTTPRedirectHandler):
+    def redirect_request(self, req, fp, code, msg, headers, newurl):
+        return None
+
+
+class _PinnedHTTPSConnection(http.client.HTTPSConnection):
+    """HTTPS connection that connects to a pinned IP but uses a different
+    hostname for SNI and certificate verification.
+
+    The SSRF IP-pinning rewrites URLs to raw IPs.  A normal HTTPSConnection
+    would then send no SNI and verify the cert against the IP, both of which
+    fail.  This subclass splits the two concerns: TCP connects to the pinned
+    IP (``host`` parameter) while TLS uses ``sni_hostname`` for the
+    ClientHello and cert check.
+    """
+
+    def __init__(self, host: str, *, sni_hostname: str, **kwargs):
+        super().__init__(host, **kwargs)
+        self._sni_hostname = sni_hostname
+
+    def connect(self):
+        # TCP connect to the pinned IP stored in self.host (+ tunnel if
+        # a proxy is configured via set_tunnel, though we do not use one).
+        http.client.HTTPConnection.connect(self)
+        # TLS handshake with the real hostname for SNI + cert verification.
+        self.sock = self._context.wrap_socket(
+            self.sock,
+            server_hostname = self._sni_hostname,
+        )
+
+
+class _SNIHTTPSHandler(urllib.request.HTTPSHandler):
+    """HTTPS handler that sends the correct SNI hostname during TLS handshake.
+
+    The SSRF IP-pinning rewrites URLs to raw IPs, which breaks SNI and cert
+    verification.  This handler returns a ``_PinnedHTTPSConnection`` that
+    connects to the pinned IP but verifies TLS against the original hostname.
+    """
+
+    def __init__(self, hostname: str):
+        super().__init__(context = _tls_ctx)
+        self._sni_hostname = hostname
+
+    def https_open(self, req):
+        return self.do_open(self._sni_connection, req)
+
+    def _sni_connection(self, host, **kwargs):
+        kwargs["context"] = _tls_ctx
+        return _PinnedHTTPSConnection(host, sni_hostname = self._sni_hostname, **kwargs)
+
+
+def _validate_and_resolve_host(hostname: str, port: int) -> tuple[bool, str, str]:
+    """Resolve *hostname*, reject non-public IPs, return a pinned IP string.
+
+    Returns ``(ok, reason_or_empty, resolved_ip)``.  The caller should
+    connect to *resolved_ip* (with a ``Host`` header) to prevent DNS
+    rebinding between validation and the actual fetch.
+    """
+    import ipaddress
+    import socket
+
+    try:
+        infos = socket.getaddrinfo(hostname, port, type = socket.SOCK_STREAM)
+    except OSError as e:
+        return False, f"Failed to resolve host: {e}", ""
+
+    if not infos:
+        return False, f"Failed to resolve host: no addresses for {hostname!r}", ""
+
+    for *_, sockaddr in infos:
+        ip = ipaddress.ip_address(sockaddr[0])
+        if (
+            ip.is_private
+            or ip.is_loopback
+            or ip.is_link_local
+            or ip.is_multicast
+            or ip.is_reserved
+            or ip.is_unspecified
+        ):
+            return False, f"Blocked: refusing to fetch non-public address {ip}.", ""
+
+    # Return the first resolved address for pinning
+    first_ip = infos[0][4][0]
+    return True, "", first_ip
+
+
+def _fetch_page_text(
+    url: str, max_chars: int = _MAX_PAGE_CHARS, timeout: int = 30
+) -> str:
+    """Fetch a URL and return plain text content (HTML tags stripped).
+
+    Blocks private/loopback/link-local targets (SSRF protection) and caps
+    the download size to avoid unbounded memory usage.
+    """
+    from urllib.parse import urlparse
+
+    parsed = urlparse(url)
+    if parsed.scheme not in ("http", "https"):
+        return f"Blocked: only http/https URLs are allowed (got {parsed.scheme!r})."
+    if not parsed.hostname:
+        return "Blocked: URL is missing a hostname."
+
+    port = parsed.port or (443 if parsed.scheme == "https" else 80)
+    ok, reason, pinned_ip = _validate_and_resolve_host(parsed.hostname, port)
+    if not ok:
+        return reason
+
+    try:
+        from urllib.error import HTTPError as _HTTPError
+        from urllib.parse import urljoin, urlunparse
+
+        max_bytes = _MAX_FETCH_BYTES
+        current_url = url
+        current_host = parsed.hostname
+        ua = random.choice(_USER_AGENTS)
+
+        for _hop in range(5):
+            # Pin to the validated IP to prevent DNS rebinding.
+            # Rewrite the URL to use the IP and set the Host header.
+            cp = urlparse(current_url)
+            # Bracket IPv6 addresses so the netloc is valid in a URL.
+            ip_str = f"[{pinned_ip}]" if ":" in pinned_ip else pinned_ip
+            ip_netloc = f"{ip_str}:{cp.port}" if cp.port else ip_str
+            pinned_url = urlunparse(cp._replace(netloc = ip_netloc))
+
+            opener = urllib.request.build_opener(
+                _NoRedirect,
+                _SNIHTTPSHandler(current_host),
+            )
+
+            req = urllib.request.Request(
+                pinned_url,
+                headers = {
+                    "User-Agent": ua,
+                    "Host": current_host,
+                },
+            )
+            try:
+                resp = opener.open(req, timeout = timeout)
+            except _HTTPError as e:
+                if e.code not in (301, 302, 303, 307, 308):
+                    return (
+                        f"Failed to fetch URL: HTTP {e.code} {getattr(e, 'reason', '')}"
+                    )
+                location = e.headers.get("Location")
+                if not location:
+                    return "Failed to fetch URL: redirect missing Location header."
+                current_url = urljoin(current_url, location)
+                rp = urlparse(current_url)
+                if rp.scheme not in ("http", "https") or not rp.hostname:
+                    return "Blocked: redirect target is not a valid http/https URL."
+                rp_port = rp.port or (443 if rp.scheme == "https" else 80)
+                ok2, reason2, pinned_ip = _validate_and_resolve_host(
+                    rp.hostname,
+                    rp_port,
+                )
+                if not ok2:
+                    return reason2
+                current_host = rp.hostname
+                continue
+            # Success -- read capped body
+            raw_bytes = resp.read(max_bytes)
+            break
+        else:
+            return "Failed to fetch URL: too many redirects."
+
+        charset = resp.headers.get_content_charset() or "utf-8"
+        raw_html = raw_bytes.decode(charset, errors = "replace")
+    except _HTTPError as e:
+        return f"Failed to fetch URL: HTTP {e.code} {getattr(e, 'reason', '')}"
+    except Exception as e:
+        return f"Failed to fetch URL: {e}"
+
+    # Convert HTML to Markdown using the builtin converter (no external deps)
+    from ._html_to_md import html_to_markdown
+
+    text = html_to_markdown(raw_html)
+
+    if not text:
+        return "(page returned no readable text)"
+    if len(text) > max_chars:
+        text = text[:max_chars] + f"\n\n... (truncated, {len(text)} chars total)"
+    return text
+
+
+def _web_search(
+    query: str,
+    max_results: int = 5,
+    timeout: int = _EXEC_TIMEOUT,
+    url: str | None = None,
+) -> str:
+    """Search the web using DuckDuckGo and return formatted results.
+
+    If ``url`` is provided, fetches that page directly instead of searching.
+    """
+    # Direct URL fetch mode
+    if url and url.strip():
+        fetch_timeout = 60 if timeout is None else min(timeout, 60)
+        return _fetch_page_text(url.strip(), timeout = fetch_timeout)
+
+    if not query or not query.strip():
        return "No query provided."
    try:
        from ddgs import DDGS
@ -160,7 +623,13 @@ def _web_search(query: str, max_results: int = 5, timeout: int = _EXEC_TIMEOUT)
                f"URL: {r.get('href', '')}\n"
                f"Snippet: {r.get('body', '')}"
            )
-        return "\n\n---\n\n".join(parts)
+        text = "\n\n---\n\n".join(parts)
+        text += (
+            "\n\n---\n\nIMPORTANT: These are only short snippets. "
+            "To get the full page content, call web_search with "
+            'the url parameter (e.g. {"url": "<URL>"}).'
+        )
+        return text
    except Exception as e:
        return f"Search failed: {e}"

@ -186,6 +655,7 @@ def _check_signal_escape_patterns(code: str):

    signal_tampering = []
    exception_catching = []
+    shell_escapes = []
    warnings = []

    def _ast_name_matches(node, names):
@ -203,10 +673,84 @@ def _check_signal_escape_patterns(code: str):
            return full_name in names
        return False

+    # Dangerous os/subprocess functions that can execute shell commands
+    _SHELL_EXEC_FUNCS = frozenset(
+        {
+            "os.system",
+            "os.popen",
+            "os.popen2",
+            "os.popen3",
+            "os.popen4",
+            "os.execl",
+            "os.execle",
+            "os.execlp",
+            "os.execlpe",
+            "os.execv",
+            "os.execve",
+            "os.execvp",
+            "os.execvpe",
+            "os.spawnl",
+            "os.spawnle",
+            "os.spawnlp",
+            "os.spawnlpe",
+            "os.spawnv",
+            "os.spawnve",
+            "os.spawnvp",
+            "os.spawnvpe",
+            "os.posix_spawn",
+            "os.posix_spawnp",
+            "subprocess.run",
+            "subprocess.call",
+            "subprocess.check_call",
+            "subprocess.check_output",
+            "subprocess.Popen",
+            "subprocess.getoutput",
+            "subprocess.getstatusoutput",
+        }
+    )
+
+    def _extract_string_from_node(node):
+        """Extract a plain string value from an AST node, if it is a constant."""
+        if isinstance(node, ast.Constant) and isinstance(node.value, str):
+            return node.value
+        return None
+
+    def _extract_strings_from_list(node):
+        """Extract string elements from an AST List or Tuple node."""
+        if isinstance(node, (ast.List, ast.Tuple)):
+            parts = []
+            for elt in node.elts:
+                s = _extract_string_from_node(elt)
+                if s is not None:
+                    parts.append(s)
+            return parts
+        return []
+
+    # Keyword argument names that carry command content (as opposed to
+    # control flags like check=True, text=True, capture_output=True).
+    _CMD_KWARGS = frozenset({"args", "command", "executable", "path", "file"})
+
+    def _check_args_for_blocked(args_nodes):
+        """Check if any call arguments contain blocked commands."""
+        found = set()
+        for arg in args_nodes:
+            s = _extract_string_from_node(arg)
+            if s is not None:
+                found |= _find_blocked_commands(s)
+            strs = _extract_strings_from_list(arg)
+            for s in strs:
+                found |= _find_blocked_commands(s)
+        return found
+
    class SignalEscapeVisitor(ast.NodeVisitor):
        def __init__(self):
            self.imports_signal = False
            self.signal_aliases = {"signal"}
+            self.os_aliases = {"os"}
+            self.subprocess_aliases = {"subprocess"}
+            # Maps bare function names to their fully-qualified form
+            # for from-import tracking (e.g. "system" -> "os.system")
+            self.shell_exec_aliases: dict[str, str] = {}
            self.loop_depth = 0

        def visit_Import(self, node):
@ -215,6 +759,10 @@ def _check_signal_escape_patterns(code: str):
                    self.imports_signal = True
                    if alias.asname:
                        self.signal_aliases.add(alias.asname)
+                elif alias.name == "os":
+                    self.os_aliases.add(alias.asname or "os")
+                elif alias.name == "subprocess":
+                    self.subprocess_aliases.add(alias.asname or "subprocess")
            self.generic_visit(node)

        def visit_ImportFrom(self, node):
@ -232,6 +780,16 @@ def _check_signal_escape_patterns(code: str):
                        "alarm",
                    ):
                        self.signal_aliases.add(alias.asname or alias.name)
+            elif node.module in ("os", "subprocess"):
+                if node.module == "os":
+                    self.os_aliases.add("os")
+                else:
+                    self.subprocess_aliases.add("subprocess")
+                # Track from-imports of dangerous functions
+                for alias in node.names:
+                    fq = f"{node.module}.{alias.name}"
+                    if fq in _SHELL_EXEC_FUNCS:
+                        self.shell_exec_aliases[alias.asname or alias.name] = fq
            self.generic_visit(node)

        def visit_While(self, node):
@ -296,6 +854,111 @@ def _check_signal_escape_patterns(code: str):
                            "description": "Modifies signal mask (may block SIGALRM)",
                        }
                    )
+
+            # --- Shell escape detection ---
+            # Resolve the fully qualified function name for os.*/subprocess.*
+            shell_func = None
+            if isinstance(func, ast.Attribute):
+                if isinstance(func.value, ast.Name):
+                    if func.value.id in self.os_aliases:
+                        shell_func = f"os.{func.attr}"
+                    elif func.value.id in self.subprocess_aliases:
+                        shell_func = f"subprocess.{func.attr}"
+            elif isinstance(func, ast.Name):
+                # Check from-import aliases: from os import system; system(...)
+                shell_func = self.shell_exec_aliases.get(func.id)
+
+            if shell_func and shell_func in _SHELL_EXEC_FUNCS:
+                # Expand **kwargs dicts to inspect their keys
+                expanded_kwargs: dict[str, ast.AST] = {}
+                has_opaque_kwargs = False
+                for kw in node.keywords:
+                    if kw.arg is not None:
+                        expanded_kwargs[kw.arg] = kw.value
+                    elif isinstance(kw.value, ast.Dict):
+                        for k, v in zip(kw.value.keys, kw.value.values):
+                            key = _extract_string_from_node(k) if k else None
+                            if key is not None:
+                                expanded_kwargs[key] = v
+                    else:
+                        has_opaque_kwargs = True
+
+                cmd_kw_values = [
+                    v for k, v in expanded_kwargs.items() if k in _CMD_KWARGS
+                ]
+                all_call_args = list(node.args) + cmd_kw_values
+                blocked_in_args = _check_args_for_blocked(all_call_args)
+
+                if has_opaque_kwargs:
+                    # Can't inspect dynamic **kwargs -- flag as unsafe
+                    shell_escapes.append(
+                        {
+                            "type": "shell_escape_dynamic",
+                            "line": node.lineno,
+                            "description": (
+                                f"{shell_func}() called with dynamic **kwargs"
+                            ),
+                        }
+                    )
+                elif blocked_in_args:
+                    shell_escapes.append(
+                        {
+                            "type": "shell_escape",
+                            "line": node.lineno,
+                            "description": (
+                                f"{shell_func}() invokes blocked command(s): "
+                                f"{', '.join(sorted(blocked_in_args))}"
+                            ),
+                        }
+                    )
+                else:
+                    # Only flag dynamic args for functions that interpret
+                    # strings as shell commands, or when shell= might be
+                    # enabled.  Treat any non-literal-False shell= value
+                    # as potentially True (conservative).
+                    _STRING_SHELL_FUNCS = frozenset(
+                        {
+                            "os.system",
+                            "os.popen",
+                            "os.popen2",
+                            "os.popen3",
+                            "os.popen4",
+                            "subprocess.getoutput",
+                            "subprocess.getstatusoutput",
+                        }
+                    )
+                    shell_node = expanded_kwargs.get("shell")
+                    shell_safe = shell_node is None or (
+                        isinstance(shell_node, ast.Constant)
+                        and shell_node.value is False
+                    )
+                    if shell_func in _STRING_SHELL_FUNCS or not shell_safe:
+
+                        def _is_safe_literal(n):
+                            if _extract_string_from_node(n) is not None:
+                                return True
+                            if isinstance(n, (ast.List, ast.Tuple)):
+                                return all(
+                                    _extract_string_from_node(e) is not None
+                                    for e in n.elts
+                                )
+                            return False
+
+                        has_non_literal = any(
+                            not _is_safe_literal(a) for a in all_call_args
+                        )
+                        if has_non_literal:
+                            shell_escapes.append(
+                                {
+                                    "type": "shell_escape_dynamic",
+                                    "line": node.lineno,
+                                    "description": (
+                                        f"{shell_func}() called with non-literal "
+                                        f"shell command (potential shell escape)"
+                                    ),
+                                }
+                            )
+
            self.generic_visit(node)

        def visit_ExceptHandler(self, node):
@ -311,7 +974,12 @@ def _check_signal_escape_patterns(code: str):
                    }
                )
            elif isinstance(node.type, ast.Name):
-                if node.type.id in ("TimeoutError", "BaseException", "Exception"):
+                # Only flag BaseException and TimeoutError, NOT Exception.
+                # except Exception does not catch SystemExit or
+                # KeyboardInterrupt, so it cannot suppress timeout
+                # enforcement.  Flagging Exception causes false positives
+                # on normal error-handling patterns.
+                if node.type.id in ("TimeoutError", "BaseException"):
                    exception_catching.append(
                        {
                            "type": f"catches_{node.type.id}_in_loop",
@ -322,7 +990,7 @@ def _check_signal_escape_patterns(code: str):
            elif isinstance(node.type, ast.Tuple):
                for elt in node.type.elts:
                    if isinstance(elt, ast.Name):
-                        if elt.id in ("TimeoutError", "BaseException", "Exception"):
+                        if elt.id in ("TimeoutError", "BaseException"):
                            exception_catching.append(
                                {
                                    "type": f"catches_{elt.id}_in_loop",
@ -338,10 +1006,15 @@ def _check_signal_escape_patterns(code: str):
    if visitor.imports_signal and not signal_tampering:
        warnings.append("Code imports 'signal' module - review manually for safety")

-    is_safe = len(signal_tampering) == 0 and len(exception_catching) == 0
+    is_safe = (
+        len(signal_tampering) == 0
+        and len(exception_catching) == 0
+        and len(shell_escapes) == 0
+    )
    return is_safe, {
        "signal_tampering": signal_tampering,
        "exception_catching": exception_catching,
+        "shell_escapes": shell_escapes,
        "warnings": warnings,
    }

@ -353,13 +1026,27 @@ def _check_code_safety(code: str) -> str | None:
    """
    safe, info = _check_signal_escape_patterns(code)
    if not safe:
+        # SyntaxError from ast.parse -- let these through so the subprocess
+        # produces a normal Python traceback instead of a misleading
+        # "unsafe code detected" message.
+        if info.get("error"):
+            return None
+
        reasons = [
            item.get("description", "") for item in info.get("signal_tampering", [])
        ]
-        return (
-            f"Error: unsafe code detected ({'; '.join(reasons)}). "
-            f"Please remove signal manipulation from your code."
-        )
+        shell_reasons = [
+            item.get("description", "") for item in info.get("shell_escapes", [])
+        ]
+        exception_reasons = [
+            item.get("description", "") for item in info.get("exception_catching", [])
+        ]
+        all_reasons = [r for r in reasons + shell_reasons + exception_reasons if r]
+        if all_reasons:
+            return (
+                f"Error: unsafe code detected ({'; '.join(all_reasons)}). "
+                f"Please remove unsafe patterns from your code."
+            )

    return None

@ -396,6 +1083,17 @@ def _python_exec(

    tmp_path = None
    workdir = _get_workdir(session_id)
+    # Snapshot image mtimes so we detect both new and overwritten files.
+    _before: dict[str, int] = {}
+    if os.path.isdir(workdir):
+        for _name in os.listdir(workdir):
+            if os.path.splitext(_name)[1].lower() in _IMAGE_EXTS:
+                _p = os.path.join(workdir, _name)
+                if os.path.isfile(_p):
+                    try:
+                        _before[_name] = os.stat(_p).st_mtime_ns
+                    except OSError:
+                        pass
    try:
        fd, tmp_path = tempfile.mkstemp(
            suffix = ".py", prefix = "studio_exec_", dir = workdir
@ -403,13 +1101,20 @@ def _python_exec(
        with os.fdopen(fd, "w") as f:
            f.write(code)

-        proc = subprocess.Popen(
-            [sys.executable, tmp_path],
+        safe_env = _build_safe_env(workdir)
+        popen_kwargs = dict(
            stdout = subprocess.PIPE,
            stderr = subprocess.STDOUT,
            text = True,
            cwd = workdir,
+            env = safe_env,
        )
+        if sys.platform != "win32":
+            popen_kwargs["preexec_fn"] = _sandbox_preexec
+        else:
+            popen_kwargs["creationflags"] = subprocess.CREATE_NO_WINDOW
+
+        proc = subprocess.Popen([sys.executable, tmp_path], **popen_kwargs)

        # Spawn cancel watcher if we have a cancel event
        if cancel_event is not None:
@ -431,7 +1136,29 @@ def _python_exec(
        result = output or ""
        if proc.returncode != 0:
            result = f"Exit code {proc.returncode}:\n{result}"
-        return _truncate(result) if result.strip() else "(no output)"
+        result = _truncate(result) if result.strip() else "(no output)"
+
+        # Detect new or overwritten image files and append sentinel for frontend
+        if session_id and os.path.isdir(workdir):
+            new_images = []
+            for _name in os.listdir(workdir):
+                if os.path.splitext(_name)[1].lower() not in _IMAGE_EXTS:
+                    continue
+                _p = os.path.join(workdir, _name)
+                if not os.path.isfile(_p):
+                    continue
+                try:
+                    _mtime = os.stat(_p).st_mtime_ns
+                except OSError:
+                    continue
+                if _name not in _before or _mtime != _before[_name]:
+                    new_images.append(_name)
+            if new_images:
+                import json as _json
+
+                result += f"\n__IMAGES__:{_json.dumps(sorted(new_images))}"
+
+        return result

    except Exception as e:
        return f"Execution error: {e}"
@ -453,21 +1180,27 @@ def _bash_exec(
    if not command or not command.strip():
        return "No command provided."

-    # Block dangerous commands
-    tokens = set(command.lower().split())
-    blocked = tokens & _BASH_BLOCKED_WORDS
+    # Block dangerous commands (shlex + regex based)
+    blocked = _find_blocked_commands(command)
    if blocked:
        return f"Blocked command(s) for safety: {', '.join(sorted(blocked))}"

    try:
        workdir = _get_workdir(session_id)
-        proc = subprocess.Popen(
-            ["bash", "-c", command],
+        safe_env = _build_safe_env(workdir)
+        popen_kwargs = dict(
            stdout = subprocess.PIPE,
            stderr = subprocess.STDOUT,
            text = True,
            cwd = workdir,
+            env = safe_env,
        )
+        if sys.platform != "win32":
+            popen_kwargs["preexec_fn"] = _sandbox_preexec
+        else:
+            popen_kwargs["creationflags"] = subprocess.CREATE_NO_WINDOW
+
+        proc = subprocess.Popen(_get_shell_cmd(command), **popen_kwargs)

        if cancel_event is not None:
            watcher = threading.Thread(
--- a/studio/backend/core/inference/worker.py
+++ b/studio/backend/core/inference/worker.py
@ -22,6 +22,7 @@ from loggers import get_logger
 import os
 import queue as _queue
 import sys
+import threading
 import time
 import traceback
 from io import BytesIO
@ -29,40 +30,19 @@ from pathlib import Path
 from typing import Any

 logger = get_logger(__name__)
+from utils.hardware import apply_gpu_ids


 def _activate_transformers_version(model_name: str) -> None:
-    """Activate the correct transformers version BEFORE any ML imports.
-
-    If the model needs transformers 5.x, prepend the pre-installed .venv_t5/
-    directory to sys.path. Otherwise do nothing (default 4.57.x in .venv/).
-    """
+    """Activate the correct transformers version BEFORE any ML imports."""
    # Ensure backend is on path for utils imports
    backend_path = str(Path(__file__).resolve().parent.parent.parent)
    if backend_path not in sys.path:
        sys.path.insert(0, backend_path)

-    from utils.transformers_version import (
-        needs_transformers_5,
-        _resolve_base_model,
-        _ensure_venv_t5_exists,
-        _VENV_T5_DIR,
-    )
+    from utils.transformers_version import activate_transformers_for_subprocess

-    resolved = _resolve_base_model(model_name)
-    if needs_transformers_5(resolved):
-        if not _ensure_venv_t5_exists():
-            raise RuntimeError(
-                f"Cannot activate transformers 5.x: .venv_t5 missing at {_VENV_T5_DIR}"
-            )
-        if _VENV_T5_DIR not in sys.path:
-            sys.path.insert(0, _VENV_T5_DIR)
-        logger.info("Activated transformers 5.x from %s", _VENV_T5_DIR)
-        # Propagate to child subprocesses (e.g. GGUF converter)
-        _pp = os.environ.get("PYTHONPATH", "")
-        os.environ["PYTHONPATH"] = _VENV_T5_DIR + (os.pathsep + _pp if _pp else "")
-    else:
-        logger.info("Using default transformers (4.57.x) for %s", model_name)
+    activate_transformers_for_subprocess(model_name)


 def _decode_image(image_base64: str):
@ -113,6 +93,157 @@ def _build_model_config(config: dict):
    return mc


+def _get_hf_download_state(
+    model_names: list[str] | None = None,
+) -> tuple[int, bool] | None:
+    """Return (total_bytes, has_incomplete) for the HF Hub cache, or None on error.
+
+    When *model_names* is provided, only those models' ``blobs/``
+    directories are checked instead of scanning every cached model --
+    much faster on systems with many models. Accepts multiple names so
+    that LoRA loads can watch both the adapter repo and the base model
+    repo simultaneously.
+
+    *has_incomplete* is True when any ``*.incomplete`` files exist in the
+    watched blobs directories, indicating that ``huggingface_hub`` is
+    actively downloading.
+
+    Returns None if the state cannot be determined (import error,
+    permission error, etc.) so callers can skip stall logic.
+    """
+    try:
+        from huggingface_hub.constants import HF_HUB_CACHE
+
+        cache = Path(HF_HUB_CACHE)
+        if not cache.exists():
+            return (0, False)
+
+        total = 0
+        has_incomplete = False
+        blobs_dirs: list[Path] = []
+
+        if model_names:
+            from utils.paths import resolve_cached_repo_id_case
+
+            for name in model_names:
+                if not name:
+                    continue
+                # Skip local filesystem paths -- HF model IDs use forward
+                # slashes (org/model) but never start with / . ~ or contain
+                # backslashes. This distinguishes them from absolute paths,
+                # relative paths, and Windows paths.
+                if name.startswith(("/", ".", "~")) or "\\" in name:
+                    continue
+                name = resolve_cached_repo_id_case(name)
+                # HF cache dir format: models--org--name (slashes -> --)
+                cache_dir_name = "models--" + name.replace("/", "--")
+                blobs_dir = cache / cache_dir_name / "blobs"
+                if blobs_dir.exists():
+                    blobs_dirs.append(blobs_dir)
+        else:
+            blobs_dirs = list(cache.glob("models--*/blobs"))
+
+        for bdir in blobs_dirs:
+            for f in bdir.iterdir():
+                try:
+                    if f.is_file():
+                        total += f.stat().st_size
+                        if f.name.endswith(".incomplete"):
+                            has_incomplete = True
+                except OSError:
+                    pass
+
+        return (total, has_incomplete)
+    except Exception as e:
+        logger.debug("Failed to determine HF download state: %s", e)
+        return None
+
+
+def _start_heartbeat(
+    resp_queue: Any,
+    interval: float = 30.0,
+    stall_timeout: float = 180.0,
+    xet_disabled: bool = False,
+    model_names: list[str] | None = None,
+) -> threading.Event:
+    """Start a daemon thread that sends periodic status heartbeats.
+
+    Monitors the HF Hub cache directory for download activity. A stall
+    is only reported when ``*.incomplete`` files are present (indicating
+    ``huggingface_hub`` is actively downloading) **and** the total cache
+    size has not changed for *stall_timeout* seconds.
+
+    Once the download finishes (no more ``.incomplete`` files), the stall
+    timer resets, so post-download initialization (quantization, GPU
+    weight loading) is never misclassified as a stalled download.
+
+    Returns a stop event -- set it to terminate the heartbeat thread.
+    """
+    stop = threading.Event()
+    transport = "https" if xet_disabled else "xet"
+
+    def _beat():
+        state = _get_hf_download_state(model_names)
+        last_size = state[0] if state is not None else 0
+        last_change = time.monotonic()
+
+        while not stop.wait(interval):
+            state = _get_hf_download_state(model_names)
+            now = time.monotonic()
+
+            # Skip stall logic if we cannot measure the cache
+            if state is None:
+                _send_response(
+                    resp_queue,
+                    {
+                        "type": "status",
+                        "message": f"Loading model ({transport} transport)...",
+                        "ts": time.time(),
+                    },
+                )
+                continue
+
+            current_size, has_incomplete = state
+
+            if current_size != last_size:
+                last_size = current_size
+                last_change = now
+
+            # Only fire stall when .incomplete files are present,
+            # confirming a download is actively in progress.
+            # Once downloads finish (no .incomplete), reset the timer
+            # so model init time is not counted as a stall.
+            if not has_incomplete:
+                last_change = now
+            elif now - last_change >= stall_timeout:
+                _send_response(
+                    resp_queue,
+                    {
+                        "type": "stall",
+                        "message": (
+                            f"Download appears stalled ({transport} transport) "
+                            f"-- no progress for {int(now - last_change)}s"
+                        ),
+                        "ts": time.time(),
+                    },
+                )
+                # Only fire once -- the orchestrator will kill us
+                return
+
+            _send_response(
+                resp_queue,
+                {
+                    "type": "status",
+                    "message": f"Loading model ({transport} transport)...",
+                    "ts": time.time(),
+                },
+            )
+
+    t = threading.Thread(target = _beat, daemon = True)
+    t.start()
+    return stop
+
+
 def _handle_load(backend, config: dict, resp_queue: Any) -> None:
    """Handle a load command: load a model into the backend."""
    try:
@ -156,13 +287,52 @@ def _handle_load(backend, config: dict, resp_queue: Any) -> None:
                except Exception as e:
                    logger.warning("Could not read adapter_config.json: %s", e)

-        success = backend.load_model(
-            config = mc,
-            max_seq_length = config.get("max_seq_length", 2048),
-            load_in_4bit = load_in_4bit,
-            hf_token = hf_token,
-            trust_remote_code = config.get("trust_remote_code", False),
+        # Auto-enable trust_remote_code for NemotronH/Nano models only.
+        # NemotronH has config parsing bugs requiring trust_remote_code=True.
+        # Other transformers 5.x models are native and do NOT need it.
+        # NOTE: Must NOT match Llama-Nemotron (standard Llama architecture).
+        _NEMOTRON_TRUST_SUBSTRINGS = ("nemotron_h", "nemotron-h", "nemotron-3-nano")
+        trust_remote_code = config.get("trust_remote_code", False)
+        if not trust_remote_code:
+            model_name = config["model_name"]
+            _mn_lower = model_name.lower()
+            if any(sub in _mn_lower for sub in _NEMOTRON_TRUST_SUBSTRINGS) and (
+                _mn_lower.startswith("unsloth/") or _mn_lower.startswith("nvidia/")
+            ):
+                trust_remote_code = True
+                logger.info(
+                    "Auto-enabled trust_remote_code for Nemotron model: %s",
+                    model_name,
+                )
+
+        # Send heartbeats every 30s so the orchestrator knows we're still alive
+        # (download / weight loading can take a long time on slow connections)
+        xet_disabled = os.environ.get("HF_HUB_DISABLE_XET") == "1"
+
+        # Watch both the model repo and base model repo (for LoRA loads
+        # where the base model download is the actual bottleneck)
+        watch_repos = [mc.identifier]
+        base = getattr(mc, "base_model", None)
+        if base and str(base) != mc.identifier:
+            watch_repos.append(str(base))
+
+        heartbeat_stop = _start_heartbeat(
+            resp_queue,
+            interval = 30.0,
+            xet_disabled = xet_disabled,
+            model_names = watch_repos,
        )
+        try:
+            success = backend.load_model(
+                config = mc,
+                max_seq_length = config.get("max_seq_length", 2048),
+                load_in_4bit = load_in_4bit,
+                hf_token = hf_token,
+                trust_remote_code = trust_remote_code,
+                gpu_ids = config.get("resolved_gpu_ids"),
+            )
+        finally:
+            heartbeat_stop.set()

        if success:
            # Build model_info for the parent to mirror
@ -474,6 +644,10 @@ def run_inference_process(
        "ignore"  # Suppress warnings at C-level before imports
    )

+    if config.get("disable_xet"):
+        os.environ["HF_HUB_DISABLE_XET"] = "1"
+        logger.info("Xet transport disabled (HF_HUB_DISABLE_XET=1)")
+
    import warnings
    from loggers.config import LogConfig

@ -485,6 +659,8 @@ def run_inference_process(
        env = os.getenv("ENVIRONMENT_TYPE", "production"),
    )

+    apply_gpu_ids(config.get("resolved_gpu_ids"))
+
    model_name = config["model_name"]

    # ── 1. Activate correct transformers version BEFORE any ML imports ──
--- a/studio/backend/core/training/trainer.py
+++ b/studio/backend/core/training/trainer.py
@ -33,7 +33,14 @@ if sys.platform in ("win32", "darwin"):
        sys.path.insert(0, _compile_cache)

 import torch
-from utils.hardware import clear_gpu_cache, safe_num_proc, dataset_map_num_proc
+from utils.hardware import (
+    clear_gpu_cache,
+    safe_num_proc,
+    dataset_map_num_proc,
+    get_device_map,
+    raise_if_offloaded,
+    get_visible_gpu_count,
+)

 torch._dynamo.config.recompile_limit = 64
 from unsloth import FastLanguageModel, FastVisionModel, is_bfloat16_supported
@ -81,8 +88,8 @@ class TrainingProgress:
    epoch: float = 0
    step: int = 0
    total_steps: int = 0
-    loss: float = 0.0
-    learning_rate: float = 0.0
+    loss: Optional[float] = None
+    learning_rate: Optional[float] = None
    is_training: bool = False
    is_completed: bool = False
    error: Optional[str] = None
@ -183,7 +190,11 @@ class UnslothTrainer:
            self._cuda_audio_used = False

        # --- Detect VLM ---
-        vision = is_vision_model(model_name) if not self.is_audio else False
+        vision = (
+            is_vision_model(model_name, hf_token = hf_token)
+            if not self.is_audio
+            else False
+        )
        self.is_vlm = not self.is_audio_vlm and vision and is_dataset_image

        logger.info(
@ -244,7 +255,7 @@ class UnslothTrainer:
            def on_log(self, args, state, control, logs = None, **kwargs):
                if not logs:
                    return
-                loss_value = logs.get("loss", logs.get("train_loss", 0.0))
+                loss_value = logs.get("loss", logs.get("train_loss", None))
                current_step = state.global_step
                grad_norm = logs.get("grad_norm", None)

@ -268,7 +279,7 @@ class UnslothTrainer:
                    step = current_step,
                    epoch = round(state.epoch, 2) if state.epoch else 0,
                    loss = loss_value,
-                    learning_rate = logs.get("learning_rate", 0.0),
+                    learning_rate = logs.get("learning_rate", None),
                    elapsed_seconds = elapsed_seconds,
                    eta_seconds = eta_seconds,
                    grad_norm = grad_norm,
@ -487,6 +498,7 @@ class UnslothTrainer:
        is_dataset_audio: bool = False,
        trust_remote_code: bool = False,
        full_finetuning: bool = False,
+        gpu_ids: Optional[list[int]] = None,
    ) -> bool:
        """Load model for training (supports both text and vision models)"""
        self.load_in_4bit = load_in_4bit  # Store for training_meta.json
@ -550,7 +562,11 @@ class UnslothTrainer:
                self._cuda_audio_used = False

            # VLM: vision model with image dataset (mutually exclusive with audio paths)
-            vision = is_vision_model(model_name) if not self.is_audio else False
+            vision = (
+                is_vision_model(model_name, hf_token = hf_token)
+                if not self.is_audio
+                else False
+            )
            self.is_vlm = not self.is_audio_vlm and vision and is_dataset_image
            self.model_name = model_name
            self.max_seq_length = max_seq_length
@ -624,6 +640,11 @@ class UnslothTrainer:
                        self._update_progress(error = friendly, is_training = False)
                        return False

+            device_map = get_device_map(gpu_ids)
+            logger.info(
+                f"Using device_map='{device_map}' ({get_visible_gpu_count()} GPU(s) visible)"
+            )
+
            # Branch based on model type
            if self._audio_type == "csm":
                # CSM: FastModel + auto_model=CsmForConditionalGeneration + load_in_4bit=False
@ -636,6 +657,7 @@ class UnslothTrainer:
                    dtype = None,
                    auto_model = CsmForConditionalGeneration,
                    load_in_4bit = False,
+                    device_map = device_map,
                    full_finetuning = full_finetuning,
                    token = hf_token,
                    trust_remote_code = trust_remote_code,
@ -651,6 +673,7 @@ class UnslothTrainer:
                    model_name = model_name,
                    dtype = None,
                    load_in_4bit = False,
+                    device_map = device_map,
                    full_finetuning = full_finetuning,
                    auto_model = WhisperForConditionalGeneration,
                    whisper_language = "English",
@ -672,6 +695,7 @@ class UnslothTrainer:
                    max_seq_length = max_seq_length,
                    dtype = None,
                    load_in_4bit = load_in_4bit,
+                    device_map = device_map,
                    full_finetuning = full_finetuning,
                    token = hf_token,
                    trust_remote_code = trust_remote_code,
@ -711,6 +735,7 @@ class UnslothTrainer:
                    max_seq_length = max_seq_length,
                    dtype = torch.float32,  # Spark-TTS requires float32
                    load_in_4bit = False,
+                    device_map = device_map,
                    full_finetuning = full_finetuning,
                    token = hf_token,
                    trust_remote_code = trust_remote_code,
@ -725,6 +750,7 @@ class UnslothTrainer:
                    model_name,
                    max_seq_length = max_seq_length,
                    load_in_4bit = False,
+                    device_map = device_map,
                    full_finetuning = full_finetuning,
                    token = hf_token,
                    trust_remote_code = trust_remote_code,
@ -741,6 +767,7 @@ class UnslothTrainer:
                    max_seq_length = max_seq_length,
                    dtype = None,
                    load_in_4bit = load_in_4bit,
+                    device_map = device_map,
                    full_finetuning = full_finetuning,
                    token = hf_token,
                    trust_remote_code = trust_remote_code,
@ -754,6 +781,7 @@ class UnslothTrainer:
                    max_seq_length = max_seq_length,
                    dtype = None,  # Auto-detect
                    load_in_4bit = load_in_4bit,
+                    device_map = device_map,
                    full_finetuning = full_finetuning,
                    token = hf_token,
                    trust_remote_code = trust_remote_code,
@ -786,12 +814,15 @@ class UnslothTrainer:
                    max_seq_length = max_seq_length,
                    dtype = None,  # Auto-detect
                    load_in_4bit = load_in_4bit,
+                    device_map = device_map,
                    full_finetuning = full_finetuning,
                    token = hf_token,
                    trust_remote_code = trust_remote_code,
                )
                logger.info("Loaded text model")

+            raise_if_offloaded(self.model, device_map, "Studio training")
+
            if self.should_stop:
                return False

@ -824,6 +855,7 @@ class UnslothTrainer:
                    is_dataset_audio = is_dataset_audio,
                    trust_remote_code = trust_remote_code,
                    full_finetuning = full_finetuning,
+                    gpu_ids = gpu_ids,
                )
            error_msg = str(e)
            error_lower = error_msg.lower()
@ -2634,14 +2666,14 @@ class UnslothTrainer:
        eval_steps: float = 0.00,
        output_dir: str | None = None,
        num_epochs: int = 3,
-        learning_rate: float = 5e-5,
+        learning_rate: float = 2e-4,
        batch_size: int = 2,
        gradient_accumulation_steps: int = 4,
        warmup_steps: int = None,
        warmup_ratio: float = None,
        max_steps: int = 0,
        save_steps: int = 0,
-        weight_decay: float = 0.01,
+        weight_decay: float = 0.001,
        random_seed: int = 3407,
        packing: bool = False,
        train_on_completions: bool = False,
@ -3010,7 +3042,7 @@ class UnslothTrainer:
                "fp16": not is_bfloat16_supported(),
                "bf16": is_bfloat16_supported(),
                "logging_steps": 1,
-                "weight_decay": training_args.get("weight_decay", 0.01),
+                "weight_decay": training_args.get("weight_decay", 0.001),
                "seed": training_args.get("random_seed", 3407),
                "output_dir": output_dir,
                "report_to": _build_report_targets(training_args),
--- a/studio/backend/core/training/training.py
+++ b/studio/backend/core/training/training.py
@ -14,18 +14,21 @@ worker's mp.Queue, and exposes the same API surface to routes/training.py.
 Pattern follows core/data_recipe/jobs/manager.py.
 """

+import json as _json
 import math
 import multiprocessing as mp
 import queue
 import threading
 import time
 import structlog
+from datetime import datetime, timezone
 from loggers import get_logger
 from dataclasses import dataclass, field
 from pathlib import Path
 from typing import Optional, Tuple, Any

 import matplotlib.pyplot as plt
+from utils.hardware import prepare_gpu_selection

 logger = get_logger(__name__)

@ -44,8 +47,8 @@ class TrainingProgress:
    epoch: float = 0
    step: int = 0
    total_steps: int = 0
-    loss: float = 0.0
-    learning_rate: float = 0.0
+    loss: Optional[float] = None
+    learning_rate: Optional[float] = None
    is_training: bool = False
    is_completed: bool = False
    error: Optional[str] = None
@ -63,6 +66,8 @@ class TrainingBackend:
    Launches a fresh subprocess per training job, communicates via mp.Queue.
    """

+    FLUSH_THRESHOLD: int = 10
+
    def __init__(self):
        # Subprocess state
        self._proc: Optional[mp.Process] = None
@ -91,13 +96,21 @@ class TrainingBackend:
        self.current_job_id: Optional[str] = None
        self._output_dir: Optional[str] = None

+        # DB persistence
+        self._metric_buffer: list[dict] = []
+        self._run_finalized: bool = False
+        self._db_run_created: bool = False
+        self._db_total_steps_set: bool = False
+        self._db_config: Optional[dict] = None
+        self._db_started_at: Optional[str] = None
+
        logger.info("TrainingBackend initialized (subprocess mode)")

    # ------------------------------------------------------------------
    # Public API (called by routes/training.py)
    # ------------------------------------------------------------------

-    def start_training(self, **kwargs) -> bool:
+    def start_training(self, job_id: str, **kwargs) -> bool:
        """Spawn a subprocess to run the full training pipeline.

        All kwargs are serialized into a config dict and sent to the worker.
@ -108,30 +121,16 @@ class TrainingBackend:
                logger.warning("Training subprocess already running")
                return False

-        # Join prior pump thread to prevent it from consuming events
-        # from the new job's queue (it reads self._event_queue dynamically).
+        # Join prior pump thread — refuse to start if it won't die
        if self._pump_thread is not None and self._pump_thread.is_alive():
            self._pump_thread.join(timeout = 5.0)
            if self._pump_thread.is_alive():
-                logger.warning("Previous pump thread did not exit within 5s")
+                logger.warning(
+                    "Previous pump thread did not exit within 5s — refusing to start"
+                )
+                return False
        self._pump_thread = None

-        # Reset state
-        self._should_stop = False
-        self._cancel_requested = False
-        self._progress = TrainingProgress(
-            is_training = True, status_message = "Initializing training..."
-        )
-        self.loss_history.clear()
-        self.lr_history.clear()
-        self.step_history.clear()
-        self.grad_norm_history.clear()
-        self.grad_norm_step_history.clear()
-        self.eval_loss_history.clear()
-        self.eval_step_history.clear()
-        self.eval_enabled = False
-        self._output_dir = None
-
        # Build config dict for the subprocess
        config = {
            "model_name": kwargs["model_name"],
@ -161,7 +160,7 @@ class TrainingBackend:
            "warmup_ratio": kwargs.get("warmup_ratio"),
            "max_steps": kwargs.get("max_steps", 0),
            "save_steps": kwargs.get("save_steps", 0),
-            "weight_decay": kwargs.get("weight_decay", 0.01),
+            "weight_decay": kwargs.get("weight_decay", 0.001),
            "random_seed": kwargs.get("random_seed", 3407),
            "packing": kwargs.get("packing", False),
            "optim": kwargs.get("optim", "adamw_8bit"),
@ -187,29 +186,85 @@ class TrainingBackend:
            "enable_tensorboard": kwargs.get("enable_tensorboard", False),
            "tensorboard_dir": kwargs.get("tensorboard_dir", "runs"),
            "trust_remote_code": kwargs.get("trust_remote_code", False),
+            "gpu_ids": kwargs.get("gpu_ids"),
        }

        # Derive load_in_4bit from training_type
        if config["training_type"] != "LoRA/QLoRA":
            config["load_in_4bit"] = False

-        # Spawn subprocess
+        # Spawn subprocess — use locals so state is untouched on failure
+        resolved_gpu_ids, gpu_selection = prepare_gpu_selection(
+            kwargs.get("gpu_ids"),
+            model_name = config["model_name"],
+            hf_token = config["hf_token"] or None,
+            training_type = config["training_type"],
+            load_in_4bit = config["load_in_4bit"],
+            batch_size = config.get("batch_size", 4),
+            max_seq_length = config.get("max_seq_length", 2048),
+            lora_rank = config.get("lora_r", 16),
+            target_modules = config.get("target_modules"),
+            gradient_checkpointing = config.get("gradient_checkpointing", "unsloth"),
+            optimizer = config.get("optim", "adamw_8bit"),
+        )
+        config["resolved_gpu_ids"] = resolved_gpu_ids
+        config["gpu_selection"] = gpu_selection
+
        from .worker import run_training_process

-        self._event_queue = _CTX.Queue()
-        self._stop_queue = _CTX.Queue()
+        event_queue = _CTX.Queue()
+        stop_queue = _CTX.Queue()

-        self._proc = _CTX.Process(
+        proc = _CTX.Process(
            target = run_training_process,
            kwargs = {
-                "event_queue": self._event_queue,
-                "stop_queue": self._stop_queue,
+                "event_queue": event_queue,
+                "stop_queue": stop_queue,
                "config": config,
            },
            daemon = True,
        )
-        self._proc.start()
-        logger.info("Training subprocess started (pid=%s)", self._proc.pid)
+        try:
+            proc.start()
+        except Exception:
+            logger.error("Failed to start training subprocess", exc_info = True)
+            return False
+
+        logger.info("Training subprocess started (pid=%s)", proc.pid)
+
+        # Reset state — safe because old pump thread is confirmed dead
+        # and proc.start() succeeded
+        self.current_job_id = job_id
+        self._should_stop = False
+        self._cancel_requested = False
+        self._progress = TrainingProgress(
+            is_training = True, status_message = "Initializing training..."
+        )
+        self.loss_history.clear()
+        self.lr_history.clear()
+        self.step_history.clear()
+        self.grad_norm_history.clear()
+        self.grad_norm_step_history.clear()
+        self.eval_loss_history.clear()
+        self.eval_step_history.clear()
+        self.eval_enabled = False
+        self._output_dir = None
+        self._metric_buffer.clear()
+        self._run_finalized = False
+        self._db_run_created = False
+        self._db_total_steps_set = False
+        self._db_config = {
+            k: v for k, v in config.items() if k not in {"hf_token", "wandb_token"}
+        }
+        self._db_started_at = datetime.now(timezone.utc).isoformat()
+
+        # Assign subprocess handles after state reset
+        self._event_queue = event_queue
+        self._stop_queue = stop_queue
+        self._proc = proc
+
+        # Eagerly create DB run row so the run appears in history during model loading
+        self._ensure_db_run_created()

        # Start event pump thread
        self._pump_thread = threading.Thread(target = self._pump_loop, daemon = True)
@ -252,6 +307,11 @@ class TrainingBackend:
                proc.kill()
                proc.join(timeout = 2.0)

+        # Wait for pump thread to finish DB finalization before returning
+        # (8s covers SQLite's default 5s lock timeout plus execution overhead)
+        if self._pump_thread is not None and self._pump_thread.is_alive():
+            self._pump_thread.join(timeout = 8.0)
+
    def is_training_active(self) -> bool:
        """Check if training is currently active."""
        with self._lock:
@ -389,20 +449,54 @@ class TrainingBackend:
                            self._progress.error
                            or "Training process exited unexpectedly"
                        )
+
+            self._ensure_db_run_created()
+            self._finalize_run_in_db(
+                status = "stopped" if self._should_stop else "error",
+                error_message = None
+                if self._should_stop
+                else "Training process terminated unexpectedly",
+            )
            return

    def _handle_event(self, event: dict) -> None:
-        """Apply a subprocess event to local state."""
+        """Apply a subprocess event to local state.
+
+        State updates happen inside self._lock; DB I/O happens after
+        releasing it so status-polling API endpoints are never blocked
+        by slow SQLite writes.
+        """
        etype = event.get("type")
+        db_action: Optional[str] = None
+        db_action_kwargs: dict = {}

        with self._lock:
            if etype == "progress":
                self._progress.step = event.get("step", self._progress.step)
                self._progress.epoch = event.get("epoch", self._progress.epoch)
-                self._progress.loss = event.get("loss", self._progress.loss)
-                self._progress.learning_rate = event.get(
-                    "learning_rate", self._progress.learning_rate
-                )
+                # loss/lr are sanitized below; update progress after coercion
+                _raw_loss = event.get("loss")
+                _raw_lr = event.get("learning_rate")
+                try:
+                    _safe_loss = float(_raw_loss) if _raw_loss is not None else None
+                except (TypeError, ValueError):
+                    logger.debug("Could not convert loss to float: %s", _raw_loss)
+                    _safe_loss = None
+                if _safe_loss is not None and not math.isfinite(_safe_loss):
+                    _safe_loss = None
+                try:
+                    _safe_lr = float(_raw_lr) if _raw_lr is not None else None
+                except (TypeError, ValueError):
+                    logger.debug(
+                        "Could not convert learning_rate to float: %s", _raw_lr
+                    )
+                    _safe_lr = None
+                if _safe_lr is not None and not math.isfinite(_safe_lr):
+                    _safe_lr = None
+                if _safe_loss is not None:
+                    self._progress.loss = _safe_loss
+                if _safe_lr is not None:
+                    self._progress.learning_rate = _safe_lr
                self._progress.total_steps = event.get(
                    "total_steps", self._progress.total_steps
                )
@ -416,30 +510,85 @@ class TrainingBackend:
                if status:
                    self._progress.status_message = status

-                # Update metric histories
+                # Update metric histories — reuse sanitized values from above
                step = event.get("step", 0)
-                loss = event.get("loss", 0.0)
-                lr = event.get("learning_rate", 0.0)
-                if step >= 0 and loss > 0:
+                loss = _safe_loss
+                lr = _safe_lr
+                if step > 0 and loss is not None:
                    self.loss_history.append(loss)
-                    self.lr_history.append(lr)
+                    self.lr_history.append(lr if lr is not None else 0.0)
                    self.step_history.append(step)

                grad_norm = event.get("grad_norm")
+                gn = None
                if grad_norm is not None:
                    try:
                        gn = float(grad_norm)
                    except (TypeError, ValueError):
                        gn = None
-                    if gn is not None and math.isfinite(gn):
+                    if step > 0 and gn is not None and math.isfinite(gn):
                        self.grad_norm_history.append(gn)
                        self.grad_norm_step_history.append(step)
+                    else:
+                        gn = None

                eval_loss = event.get("eval_loss")
                if eval_loss is not None:
-                    self.eval_loss_history.append(eval_loss)
-                    self.eval_step_history.append(step)
-                    self.eval_enabled = True
+                    try:
+                        eval_loss = float(eval_loss)
+                    except (TypeError, ValueError):
+                        logger.debug(
+                            "Could not convert eval_loss to float: %s", eval_loss
+                        )
+                        eval_loss = None
+                    if step > 0 and eval_loss is not None and math.isfinite(eval_loss):
+                        self.eval_loss_history.append(eval_loss)
+                        self.eval_step_history.append(step)
+                        self.eval_enabled = True
+                    else:
+                        eval_loss = None
+
+                # Buffer metric for DB flush (loss/lr already sanitized above)
+                self._metric_buffer.append(
+                    {
+                        "step": step,
+                        "loss": loss,
+                        "learning_rate": lr,
+                        "grad_norm": gn,
+                        "eval_loss": eval_loss,
+                        "epoch": event.get("epoch"),
+                        "num_tokens": event.get("num_tokens"),
+                        "elapsed_seconds": event.get("elapsed_seconds"),
+                    }
+                )
+
+                # Decide which DB action to take after releasing the lock
+                if not self._db_run_created and self.current_job_id and self._db_config:
+                    db_action = "create_run"
+                    db_action_kwargs = {
+                        "job_id": self.current_job_id,
+                        "model_name": self._db_config["model_name"],
+                        "dataset_name": self._db_config.get("hf_dataset")
+                        or next(
+                            iter(self._db_config.get("local_datasets") or []), "unknown"
+                        ),
+                        "config_json": _json.dumps(self._db_config),
+                        "started_at": self._db_started_at
+                        or datetime.now(timezone.utc).isoformat(),
+                        "total_steps": event.get("total_steps"),
+                    }
+                elif (
+                    event.get("total_steps")
+                    and self._db_run_created
+                    and not self._db_total_steps_set
+                ):
+                    db_action = "update_total_steps"
+                    db_action_kwargs = {
+                        "job_id": self.current_job_id,
+                        "total_steps": event["total_steps"],
+                    }
+                elif len(self._metric_buffer) >= self.FLUSH_THRESHOLD:
+                    db_action = "flush"

            elif etype == "eval_configured":
                self.eval_enabled = True
@ -454,6 +603,14 @@ class TrainingBackend:
                self._output_dir = event.get("output_dir")
                msg = event.get("status_message", "Training completed")
                self._progress.status_message = msg
+                if not self._db_run_created and self.current_job_id and self._db_config:
+                    db_action = "create_and_finalize"
+                else:
+                    db_action = "finalize"
+                db_action_kwargs = {
+                    "status": "stopped" if self._should_stop else "completed",
+                    "output_dir": self._output_dir,
+                }

            elif etype == "error":
                self._progress.is_training = False
@ -462,6 +619,149 @@ class TrainingBackend:
                stack = event.get("stack", "")
                if stack:
                    logger.error("Stack trace:\n%s", stack)
+                if not self._db_run_created and self.current_job_id and self._db_config:
+                    db_action = "create_and_finalize"
+                else:
+                    db_action = "finalize"
+                db_action_kwargs = {
+                    "status": "stopped" if self._should_stop else "error",
+                    "error_message": event.get("error", "Unknown error"),
+                }
+
+        # --- DB I/O outside the lock ---
+        if db_action == "create_run":
+            try:
+                from storage.studio_db import create_run
+
+                create_run(
+                    id = db_action_kwargs["job_id"],
+                    model_name = db_action_kwargs["model_name"],
+                    dataset_name = db_action_kwargs["dataset_name"],
+                    config_json = db_action_kwargs["config_json"],
+                    started_at = db_action_kwargs["started_at"],
+                    total_steps = db_action_kwargs["total_steps"],
+                )
+                self._db_run_created = True
+                if db_action_kwargs["total_steps"]:
+                    self._db_total_steps_set = True
+            except Exception:
+                logger.warning("Failed to create DB run record", exc_info = True)
+        elif db_action == "create_and_finalize":
+            self._ensure_db_run_created()
+            self._finalize_run_in_db(**db_action_kwargs)
+        elif db_action == "update_total_steps":
+            try:
+                from storage.studio_db import update_run_total_steps
+
+                update_run_total_steps(
+                    db_action_kwargs["job_id"], db_action_kwargs["total_steps"]
+                )
+                self._db_total_steps_set = True
+            except Exception:
+                logger.warning("Failed to update total_steps in DB", exc_info = True)
+        elif db_action == "flush":
+            self._flush_metrics_to_db()
+        elif db_action == "finalize":
+            self._finalize_run_in_db(**db_action_kwargs)
+
+    def _ensure_db_run_created(self) -> None:
+        """Create the DB row if it doesn't exist yet. Called outside the lock."""
+        if self._db_run_created or not self.current_job_id or not self._db_config:
+            return
+        try:
+            from storage.studio_db import create_run
+
+            dataset_name = self._db_config.get("hf_dataset") or next(
+                iter(self._db_config.get("local_datasets") or []), "unknown"
+            )
+            create_run(
+                id = self.current_job_id,
+                model_name = self._db_config["model_name"],
+                dataset_name = dataset_name,
+                config_json = _json.dumps(self._db_config),
+                started_at = self._db_started_at
+                or datetime.now(timezone.utc).isoformat(),
+                total_steps = self._progress.total_steps or None,
+            )
+            self._db_run_created = True
+        except Exception:
+            logger.warning(
+                "Failed to create DB run record for early failure", exc_info = True
+            )
+
+    def _finalize_run_in_db(
+        self,
+        status: str,
+        error_message: Optional[str] = None,
+        output_dir: Optional[str] = None,
+    ) -> None:
+        """Flush remaining metrics and mark a run as finished in the DB."""
+        if not self.current_job_id or not self._db_run_created or self._run_finalized:
+            return
+        self._flush_metrics_to_db()
+        try:
+            from storage.studio_db import finish_run
+            from utils.downsample import downsample
+
+            sparkline = downsample(self.loss_history, 50)
+            finish_run(
+                id = self.current_job_id,
+                status = status,
+                ended_at = datetime.now(timezone.utc).isoformat(),
+                final_step = self._progress.step,
+                final_loss = self._progress.loss
+                if (
+                    self._progress.loss is not None
+                    and math.isfinite(self._progress.loss)
+                )
+                else None,
+                duration_seconds = self._progress.elapsed_seconds,
+                loss_sparkline = _json.dumps(sparkline),
+                output_dir = output_dir,
+                error_message = error_message,
+            )
+            self._run_finalized = True
+        except Exception:
+            logger.warning(
+                "Failed to finalize run in DB (status=%s)", status, exc_info = True
+            )
+
+    def _flush_metrics_to_db(self) -> None:
+        """Flush buffered metrics to the database and update live progress."""
+        if (
+            not self._metric_buffer
+            or not self.current_job_id
+            or not self._db_run_created
+        ):
+            return
+        # Cap buffer to prevent unbounded memory growth
+        if len(self._metric_buffer) > 500:
+            logger.warning(
+                "Metric buffer exceeded 500 entries (%d) — trimming oldest",
+                len(self._metric_buffer),
+            )
+            self._metric_buffer = self._metric_buffer[-500:]
+        # Snapshot before insert so metrics arriving during the write are preserved
+        batch = list(self._metric_buffer)
+        try:
+            from storage.studio_db import insert_metrics_batch, update_run_progress
+
+            insert_metrics_batch(self.current_job_id, batch)
+            del self._metric_buffer[: len(batch)]
+            update_run_progress(
+                id = self.current_job_id,
+                step = self._progress.step,
+                loss = self._progress.loss
+                if (
+                    self._progress.loss is not None
+                    and math.isfinite(self._progress.loss)
+                )
+                else None,
+                duration_seconds = self._progress.elapsed_seconds,
+            )
+        except Exception:
+            # Leave buffer intact for retry on next flush
+            logger.warning("Failed to flush metrics to DB", exc_info = True)

    @staticmethod
    def _read_queue(q: Any, timeout_sec: float) -> Optional[dict]:
@ -561,11 +861,13 @@ class TrainingBackend:
            if progress.error:
                title = f"Error: {progress.error}"
            elif progress.is_completed:
-                title = f"Training completed! Final loss: {progress.loss:.4f}"
+                loss_str = f"{progress.loss:.4f}" if progress.loss is not None else "--"
+                title = f"Training completed! Final loss: {loss_str}"
            elif progress.status_message:
                title = progress.status_message
            elif progress.step > 0:
-                title = f"Epoch: {progress.epoch} | Step: {progress.step}/{progress.total_steps} | Loss: {progress.loss:.4f}"
+                loss_str = f"{progress.loss:.4f}" if progress.loss is not None else "--"
+                title = f"Epoch: {progress.epoch} | Step: {progress.step}/{progress.total_steps} | Loss: {loss_str}"
            else:
                title = "Training Loss"

--- a/studio/backend/core/training/worker.py
+++ b/studio/backend/core/training/worker.py
@ -16,47 +16,315 @@ from __future__ import annotations
 import structlog
 from loggers import get_logger
 import os
+import shutil
 import sys
 import time
 import traceback
+import subprocess as _sp
 from pathlib import Path
-from typing import Any
+from typing import Any, Callable

 logger = get_logger(__name__)
+from utils.hardware import apply_gpu_ids
+from utils.wheel_utils import (
+    direct_wheel_url,
+    flash_attn_wheel_url,
+    install_wheel,
+    probe_torch_wheel_env,
+    url_exists,
+)
+
+
+_CAUSAL_CONV1D_RELEASE_TAG = "v1.6.1.post4"
+_CAUSAL_CONV1D_PACKAGE_VERSION = "1.6.1"
+_MAMBA_SSM_RELEASE_TAG = "v2.3.1"
+_MAMBA_SSM_PACKAGE_VERSION = "2.3.1"
+_FLASH_ATTN_RUNTIME_MIN_SEQ_LEN = 32768
+_FLASH_ATTN_SKIP_ENV = "UNSLOTH_STUDIO_SKIP_FLASHATTN_INSTALL"
+
+
+def _model_wants_causal_conv1d(model_name: str) -> bool:
+    name = model_name.lower()
+    return any(
+        key in name
+        for key in (
+            "qwen3.5",
+            "qwen3_5",
+            "qwen3-next",
+            "qwen3_next",
+            "nemotron_h",
+            "nemotron-h",
+            "nemotron-3-nano",
+            "falcon_h1",
+            "falcon-h1",
+            "granite-4.0-h",
+            "granitemoehybrid",
+            "lfm2",
+        )
+    )
+
+
+def _install_package_wheel_first(
+    *,
+    event_queue: Any,
+    import_name: str,
+    display_name: str,
+    pypi_name: str,
+    pypi_version: str | None = None,
+    filename_prefix: str | None = None,
+    release_tag: str | None = None,
+    release_base_url: str | None = None,
+    wheel_url_builder: Callable[[dict[str, str] | None], str | None] | None = None,
+    pypi_spec: str | None = None,
+    pypi_status_message: str | None = None,
+) -> bool:
+    try:
+        __import__(import_name)
+        logger.info("%s already installed", display_name)
+        return True
+    except ImportError:
+        pass
+
+    env = probe_torch_wheel_env(timeout = 30)
+    if wheel_url_builder is not None:
+        wheel_url = wheel_url_builder(env)
+    else:
+        wheel_url = direct_wheel_url(
+            filename_prefix = filename_prefix,
+            package_version = pypi_version,
+            release_tag = release_tag,
+            release_base_url = release_base_url,
+            env = env,
+        )
+
+    if wheel_url is None:
+        logger.info("No compatible %s wheel candidate", display_name)
+    elif url_exists(wheel_url):
+        _send_status(event_queue, f"Installing prebuilt {display_name} wheel...")
+        for installer, result in install_wheel(
+            wheel_url,
+            python_executable = sys.executable,
+            use_uv = bool(shutil.which("uv")),
+            run = _sp.run,
+        ):
+            if result.returncode == 0:
+                logger.info("Installed prebuilt %s wheel successfully", display_name)
+                return True
+            logger.warning(
+                "%s failed to install %s wheel:\n%s",
+                installer,
+                display_name,
+                result.stdout,
+            )
+    else:
+        logger.info("No published %s wheel found: %s", display_name, wheel_url)
+
+    is_hip = env and env.get("hip_version")
+    if is_hip and not shutil.which("hipcc"):
+        logger.error(
+            "%s requires hipcc for source compilation on ROCm. "
+            "Install the ROCm HIP SDK: https://rocm.docs.amd.com",
+            display_name,
+        )
+        _send_status(
+            event_queue,
+            f"{display_name}: hipcc not found (ROCm HIP SDK required)",
+        )
+        return False
+
+    if pypi_spec is None:
+        pypi_spec = f"{pypi_name}=={pypi_version}"
+
+    if pypi_status_message is None:
+        if is_hip:
+            pypi_status_message = (
+                f"Compiling {display_name} from source for ROCm "
+                "(this may take several minutes)..."
+            )
+        else:
+            pypi_status_message = f"Installing {display_name} from PyPI..."
+
+    _send_status(event_queue, pypi_status_message)
+
+    # Prefer uv for faster dependency resolution when available
+    plain_pypi_install = pypi_version is None
+    if plain_pypi_install:
+        if shutil.which("uv"):
+            pypi_cmd = [
+                "uv",
+                "pip",
+                "install",
+                "--python",
+                sys.executable,
+                pypi_spec,
+            ]
+        else:
+            pypi_cmd = [sys.executable, "-m", "pip", "install", pypi_spec]
+    else:
+        if shutil.which("uv"):
+            pypi_cmd = [
+                "uv",
+                "pip",
+                "install",
+                "--python",
+                sys.executable,
+                "--no-build-isolation",
+                "--no-deps",
+            ]
+            # Avoid stale cache artifacts from partial HIP source builds
+            if is_hip:
+                pypi_cmd.append("--no-cache")
+            pypi_cmd.append(pypi_spec)
+        else:
+            pypi_cmd = [
+                sys.executable,
+                "-m",
+                "pip",
+                "install",
+                "--no-build-isolation",
+                "--no-deps",
+                "--no-cache-dir",
+                pypi_spec,
+            ]
+
+    # Source compilation on ROCm can take 10-30 minutes; use a generous
+    # timeout. Non-HIP installs preserve the pre-existing "no timeout"
+    # behaviour so unrelated slow installs (e.g. causal-conv1d source
+    # build on Linux aarch64 or unsupported torch/CUDA combinations)
+    # are not aborted at 5 minutes by this PR.
+    _run_kwargs: dict[str, Any] = {
+        "stdout": _sp.PIPE,
+        "stderr": _sp.STDOUT,
+        "text": True,
+    }
+    if is_hip:
+        _run_kwargs["timeout"] = 1800
+
+    try:
+        result = _sp.run(pypi_cmd, **_run_kwargs)
+    except _sp.TimeoutExpired:
+        logger.error(
+            "%s installation timed out after %ds",
+            display_name,
+            _run_kwargs.get("timeout"),
+        )
+        _send_status(
+            event_queue,
+            f"{display_name} installation timed out after "
+            f"{_run_kwargs.get('timeout')}s",
+        )
+        return False
+
+    if result.returncode != 0:
+        if is_hip:
+            # Surface a clear error for ROCm source build failures
+            error_lines = (result.stdout or "").strip().splitlines()
+            snippet = "\n".join(error_lines[-5:]) if error_lines else "(no output)"
+            logger.error(
+                "Failed to compile %s for ROCm:\n%s",
+                display_name,
+                result.stdout,
+            )
+            _send_status(
+                event_queue,
+                f"Failed to compile {display_name} for ROCm. "
+                "Check that hipcc and ROCm development headers are installed.\n"
+                f"{snippet}",
+            )
+        else:
+            logger.error(
+                "Failed to install %s from PyPI:\n%s",
+                display_name,
+                result.stdout,
+            )
+        return False
+
+    if is_hip:
+        logger.info("Compiled and installed %s from source for ROCm", display_name)
+    else:
+        logger.info("Installed %s from PyPI", display_name)
+    return True
+
+
+def _ensure_causal_conv1d_fast_path(event_queue: Any, model_name: str) -> None:
+    if not _model_wants_causal_conv1d(model_name):
+        return
+
+    _install_package_wheel_first(
+        event_queue = event_queue,
+        import_name = "causal_conv1d",
+        display_name = "causal-conv1d",
+        pypi_name = "causal-conv1d",
+        pypi_version = _CAUSAL_CONV1D_PACKAGE_VERSION,
+        filename_prefix = "causal_conv1d",
+        release_tag = _CAUSAL_CONV1D_RELEASE_TAG,
+        release_base_url = "https://github.com/Dao-AILab/causal-conv1d/releases/download",
+    )
+
+
+_SSM_MODEL_SUBSTRINGS = (
+    "nemotron_h",
+    "nemotron-h",
+    "nemotron-3-nano",
+    "falcon_h1",
+    "falcon-h1",
+    "granite-4.0-h",
+    "granitemoehybrid",
+)
+
+
+def _ensure_mamba_ssm(event_queue: Any, model_name: str) -> None:
+    if not any(sub in model_name.lower() for sub in _SSM_MODEL_SUBSTRINGS):
+        return
+
+    logger.info("SSM model detected; setting up mamba-ssm after causal-conv1d")
+    _install_package_wheel_first(
+        event_queue = event_queue,
+        import_name = "mamba_ssm",
+        display_name = "mamba-ssm",
+        pypi_name = "mamba-ssm",
+        pypi_version = _MAMBA_SSM_PACKAGE_VERSION,
+        filename_prefix = "mamba_ssm",
+        release_tag = _MAMBA_SSM_RELEASE_TAG,
+        release_base_url = "https://github.com/state-spaces/mamba/releases/download",
+    )
+
+
+def _should_try_runtime_flash_attn_install(max_seq_length: int) -> bool:
+    if os.getenv(_FLASH_ATTN_SKIP_ENV) == "1":
+        return False
+    if max_seq_length < _FLASH_ATTN_RUNTIME_MIN_SEQ_LEN:
+        return False
+    return sys.platform.startswith("linux")
+
+
+def _ensure_flash_attn_for_long_context(event_queue: Any, max_seq_length: int) -> None:
+    if not _should_try_runtime_flash_attn_install(max_seq_length):
+        return
+
+    installed = _install_package_wheel_first(
+        event_queue = event_queue,
+        import_name = "flash_attn",
+        display_name = "flash-attn",
+        pypi_name = "flash-attn",
+        wheel_url_builder = flash_attn_wheel_url,
+        pypi_spec = "flash-attn",
+        pypi_status_message = "Installing flash-attn from PyPI for long-context training...",
+    )
+    if not installed:
+        _send_status(event_queue, "Continuing without flash-attn")


 def _activate_transformers_version(model_name: str) -> None:
-    """Activate the correct transformers version BEFORE any ML imports.
-
-    If the model needs transformers 5.x, prepend the pre-installed .venv_t5/
-    directory to sys.path. Otherwise do nothing (default 4.57.x in .venv/).
-    """
+    """Activate the correct transformers version BEFORE any ML imports."""
    # Ensure backend is on path for utils imports
    backend_path = str(Path(__file__).resolve().parent.parent.parent)
    if backend_path not in sys.path:
        sys.path.insert(0, backend_path)

-    from utils.transformers_version import (
-        needs_transformers_5,
-        _resolve_base_model,
-        _ensure_venv_t5_exists,
-        _VENV_T5_DIR,
-    )
+    from utils.transformers_version import activate_transformers_for_subprocess

-    resolved = _resolve_base_model(model_name)
-    if needs_transformers_5(resolved):
-        if not _ensure_venv_t5_exists():
-            raise RuntimeError(
-                f"Cannot activate transformers 5.x: .venv_t5 missing at {_VENV_T5_DIR}"
-            )
-        if _VENV_T5_DIR not in sys.path:
-            sys.path.insert(0, _VENV_T5_DIR)
-        logger.info("Activated transformers 5.x from %s", _VENV_T5_DIR)
-        # Propagate to child subprocesses (e.g. GGUF converter)
-        _pp = os.environ.get("PYTHONPATH", "")
-        os.environ["PYTHONPATH"] = _VENV_T5_DIR + (os.pathsep + _pp if _pp else "")
-    else:
-        logger.info("Using default transformers (4.57.x) for %s", model_name)
+    activate_transformers_for_subprocess(model_name)


 def run_training_process(
@ -88,6 +356,8 @@ def run_training_process(
        env = os.getenv("ENVIRONMENT_TYPE", "production"),
    )

+    apply_gpu_ids(config.get("resolved_gpu_ids"))
+
    model_name = config["model_name"]

    # ── 1. Activate correct transformers version BEFORE any ML imports ──
@ -104,62 +374,47 @@ def run_training_process(
        )
        return

-    # ── 1a. Auto-enable trust_remote_code for unsloth/* transformers 5.x models ──
-    # Some newer architectures (e.g. NemotronH) have config parsing bugs in
-    # transformers that require trust_remote_code=True as a workaround.
-    # Only auto-enable for unsloth/* prefixed models (trusted source).
-    from utils.transformers_version import needs_transformers_5
-
+    # ── 1a. Auto-enable trust_remote_code for NemotronH/Nano models ──
+    # NemotronH has config parsing bugs in transformers that require
+    # trust_remote_code=True as a workaround. Other transformers 5.x models
+    # (Qwen3.5, Gemma 4, etc.) are native and do NOT need it — enabling it
+    # bypasses the compiler (disabling fused CE).
+    # NOTE: Must NOT match Llama-Nemotron (standard Llama architecture).
+    _NEMOTRON_TRUST_SUBSTRINGS = ("nemotron_h", "nemotron-h", "nemotron-3-nano")
+    _lowered = model_name.lower()
    if (
-        needs_transformers_5(model_name)
-        and model_name.lower().startswith("unsloth/")
+        any(sub in _lowered for sub in _NEMOTRON_TRUST_SUBSTRINGS)
+        and (_lowered.startswith("unsloth/") or _lowered.startswith("nvidia/"))
        and not config.get("trust_remote_code", False)
    ):
        config["trust_remote_code"] = True
        logger.info(
-            "Auto-enabled trust_remote_code for unsloth/* transformers 5.x model: %s",
+            "Auto-enabled trust_remote_code for Nemotron model: %s",
            model_name,
        )

-    # ── 1b. Auto-install mamba-ssm for SSM/hybrid models (NemotronH, Falcon-H1) ──
-    _SSM_MODEL_SUBSTRINGS = ("nemotron_h", "nemotron-3-nano", "falcon_h1", "falcon-h1")
-    if any(sub in model_name.lower() for sub in _SSM_MODEL_SUBSTRINGS):
-        try:
-            import mamba_ssm  # noqa: F401
-
-            logger.info("mamba-ssm already installed")
-        except ImportError:
-            logger.info(
-                "SSM model detected — installing mamba-ssm and causal-conv1d (this may take several minutes)..."
-            )
-            _send_status(
-                event_queue, "Installing mamba-ssm (first time only, ~7 min)..."
-            )
-            import subprocess as _sp
-
-            # --no-build-isolation: compile against current torch (no version conflicts)
-            # --no-deps: don't pull in torch/transformers/triton (already installed)
-            for _pkg in ["causal_conv1d", "mamba_ssm"]:
-                _r = _sp.run(
-                    [
-                        sys.executable,
-                        "-m",
-                        "pip",
-                        "install",
-                        "--no-build-isolation",
-                        "--no-deps",
-                        "--no-cache-dir",
-                        _pkg,
-                    ],
-                    stdout = _sp.PIPE,
-                    stderr = _sp.STDOUT,
-                    text = True,
-                )
-                if _r.returncode != 0:
-                    logger.error("Failed to install %s:\n%s", _pkg, _r.stdout)
-                else:
-                    logger.info("Installed %s successfully", _pkg)
-            logger.info("mamba-ssm installation complete")
+    # ── 1b. Set up causal-conv1d first, then install mamba-ssm if needed ──
+    try:
+        _ensure_causal_conv1d_fast_path(event_queue, model_name)
+        _ensure_mamba_ssm(event_queue, model_name)
+        _ensure_flash_attn_for_long_context(
+            event_queue,
+            int(config.get("max_seq_length", 2048)),
+        )
+    except Exception as exc:
+        event_queue.put(
+            {
+                "type": "error",
+                "error": (
+                    f"Please choose another model to train, since "
+                    f"causal-conv1d / mamba-ssm failed to install "
+                    f"with error: {exc}"
+                ),
+                "stack": traceback.format_exc(limit = 20),
+                "ts": time.time(),
+            }
+        )
+        return

    # ── 1c. Set fork start method so dataset.map() can multiprocess ──
    # The parent launched us via spawn (clean process), but the compiled
@ -242,7 +497,7 @@ def run_training_process(

    # Wire up progress callback → event_queue
    def _on_progress(progress: TrainingProgress):
-        has_train_loss = progress.step >= 0 and progress.loss > 0
+        has_train_loss = progress.step > 0 and progress.loss is not None
        has_eval_loss = progress.eval_loss is not None
        if has_train_loss or has_eval_loss:
            event_queue.put(
@ -424,6 +679,7 @@ def run_training_process(
            is_dataset_image = config.get("is_dataset_image", False),
            is_dataset_audio = config.get("is_dataset_audio", False),
            trust_remote_code = config.get("trust_remote_code", False),
+            gpu_ids = config.get("resolved_gpu_ids"),
        )
        if not success or trainer.should_stop:
            if trainer.should_stop:
@ -533,7 +789,7 @@ def run_training_process(
            warmup_ratio = config.get("warmup_ratio"),
            max_steps = max_steps if max_steps and max_steps > 0 else 0,
            save_steps = save_steps if save_steps and save_steps > 0 else 0,
-            weight_decay = config.get("weight_decay", 0.01),
+            weight_decay = config.get("weight_decay", 0.001),
            random_seed = config.get("random_seed", 3407),
            packing = config.get("packing", False),
            train_on_completions = config.get("train_on_completions", False),
@ -879,7 +1135,7 @@ def _run_embedding_training(event_queue: Any, stop_queue: Any, config: dict) ->
        "lr_scheduler_type": config.get("lr_scheduler_type", "linear"),
        "batch_sampler": BatchSamplers.NO_DUPLICATES,
        "optim": config.get("optim", "adamw_8bit"),
-        "weight_decay": config.get("weight_decay", 0.01),
+        "weight_decay": config.get("weight_decay", 0.001),
        "seed": config.get("random_seed", 3407),
    }

@ -918,7 +1174,7 @@ def _run_embedding_training(event_queue: Any, stop_queue: Any, config: dict) ->
        def on_log(self, args, state, control, logs = None, **kwargs):
            if not logs:
                return
-            loss_value = logs.get("loss", logs.get("train_loss", 0.0))
+            loss_value = logs.get("loss", logs.get("train_loss", None))
            current_step = state.global_step

            elapsed = time.time() - training_start_time
@ -934,7 +1190,7 @@ def _run_embedding_training(event_queue: Any, stop_queue: Any, config: dict) ->
                    "step": current_step,
                    "epoch": round(state.epoch, 2) if state.epoch else 0,
                    "loss": loss_value,
-                    "learning_rate": logs.get("learning_rate", 0.0),
+                    "learning_rate": logs.get("learning_rate", None),
                    "total_steps": total_steps,
                    "elapsed_seconds": elapsed,
                    "eta_seconds": eta,
--- a/studio/backend/main.py
+++ b/studio/backend/main.py
@ -23,9 +23,23 @@ if _backend_dir not in sys.path:
 # See: https://github.com/python/cpython/issues/102396
 import _platform_compat  # noqa: F401

+import mimetypes
 import shutil
 import warnings
 from contextlib import asynccontextmanager
+from importlib.metadata import PackageNotFoundError, version as package_version
+
+# Fix broken Windows registry MIME types.  Some Windows installs map .js to
+# "text/plain" in the registry (HKCR\.js\Content Type).  Python's mimetypes
+# module reads from the registry, and FastAPI/Starlette's StaticFiles uses
+# mimetypes.guess_type() to set Content-Type headers.  Browsers enforce strict
+# MIME checking for ES module scripts (<script type="module">) and will refuse
+# to execute .js files served as text/plain — resulting in a blank page.
+# Calling add_type() *before* StaticFiles is instantiated ensures the correct
+# types are used regardless of the OS registry.
+if sys.platform == "win32":
+    mimetypes.add_type("application/javascript", ".js")
+    mimetypes.add_type("text/css", ".css")

 # Suppress annoying dependency warnings in production
 if os.getenv("ENVIRONMENT_TYPE", "production") == "production":
@ -34,7 +48,7 @@ if os.getenv("ENVIRONMENT_TYPE", "production") == "production":
    # warnings.filterwarnings("ignore", category=DeprecationWarning)
    # warnings.filterwarnings("ignore", module="triton.*")

-from fastapi import FastAPI
+from fastapi import Depends, FastAPI, Request
 from fastapi.middleware.cors import CORSMiddleware
 from fastapi.staticfiles import StaticFiles
 from fastapi.responses import FileResponse, HTMLResponse, Response
@ -49,15 +63,43 @@ from routes import (
    export_router,
    inference_router,
    models_router,
+    training_history_router,
    training_router,
 )
 from auth import storage
-from utils.hardware import detect_hardware, get_device, DeviceType
+from auth.authentication import get_current_subject
+from utils.hardware import (
+    detect_hardware,
+    get_device,
+    DeviceType,
+    get_backend_visible_gpu_info,
+)
 import utils.hardware.hardware as _hw_module

 from utils.cache_cleanup import clear_unsloth_compiled_cache


+def get_unsloth_version() -> str:
+    try:
+        return package_version("unsloth")
+    except PackageNotFoundError:
+        pass
+
+    version_file = (
+        _Path(__file__).resolve().parents[2] / "unsloth" / "models" / "_utils.py"
+    )
+    try:
+        for line in version_file.read_text(encoding = "utf-8").splitlines():
+            if line.startswith("__version__ = "):
+                return line.split("=", 1)[1].strip().strip('"').strip("'")
+    except OSError:
+        pass
+    return "dev"
+
+
+UNSLOTH_VERSION = get_unsloth_version()
+
+
@asynccontextmanager
 async def lifespan(app: FastAPI):
    """Startup: detect hardware, seed default admin if needed. Shutdown: clean up compiled cache."""
@ -73,6 +115,17 @@ async def lifespan(app: FastAPI):
    # Detect hardware first — sets DEVICE global used everywhere
    detect_hardware()

+    from storage.studio_db import cleanup_orphaned_runs
+
+    try:
+        cleanup_orphaned_runs()
+    except Exception as exc:
+        import structlog
+
+        structlog.get_logger(__name__).warning(
+            "cleanup_orphaned_runs failed at startup: %s", exc
+        )
+
    # Pre-cache the helper GGUF model for LLM-assisted dataset detection.
    # Runs in a background thread so it doesn't block server startup.
    import threading
@ -90,13 +143,13 @@ async def lifespan(app: FastAPI):
    if storage.ensure_default_admin():
        bootstrap_pw = storage.get_bootstrap_password()
        app.state.bootstrap_password = bootstrap_pw
+
+        bootstrap_path = storage.DB_PATH.parent / ".bootstrap_password"
        print("\n" + "=" * 60)
        print("DEFAULT ADMIN ACCOUNT CREATED")
-        print(
-            "Sign in with the seeded credentials and change the password immediately:\n"
-        )
        print(f"    username: {storage.DEFAULT_ADMIN_USERNAME}")
-        print(f"    password: {bootstrap_pw}\n")
+        print(f"    password saved to: {bootstrap_path}")
+        print("    Open the Studio UI to sign in and change it.")
        print("=" * 60 + "\n")
    else:
        app.state.bootstrap_password = storage.get_bootstrap_password()
@ -109,7 +162,7 @@ async def lifespan(app: FastAPI):
 # Create FastAPI app
 app = FastAPI(
    title = "Unsloth UI Backend",
-    version = "1.0.0",
+    version = UNSLOTH_VERSION,
    description = "Backend API for Unsloth UI - Training and Model Management",
    lifespan = lifespan,
 )
@ -149,6 +202,9 @@ app.include_router(inference_router, prefix = "/v1", tags = ["openai-compat"])
 app.include_router(datasets_router, prefix = "/api/datasets", tags = ["datasets"])
 app.include_router(data_recipe_router, prefix = "/api/data-recipe", tags = ["data-recipe"])
 app.include_router(export_router, prefix = "/api/export", tags = ["export"])
+app.include_router(
+    training_history_router, prefix = "/api/train", tags = ["training-history"]
+)


 # ============ Health and System Endpoints ============
@ -164,78 +220,53 @@ async def health_check():
        "status": "healthy",
        "timestamp": datetime.now().isoformat(),
        "service": "Unsloth UI Backend",
+        "version": UNSLOTH_VERSION,
        "device_type": device_type,
        "chat_only": _hw_module.CHAT_ONLY,
    }


+@app.post("/api/shutdown")
+async def shutdown_server(
+    request: Request,
+    current_subject: str = Depends(get_current_subject),
+):
+    """Gracefully shut down the Unsloth Studio server.
+
+    Called by the frontend quit dialog so users can stop the server from the UI
+    without needing to use the CLI or kill the process manually.
+    """
+    import asyncio
+
+    async def _delayed_shutdown():
+        await asyncio.sleep(0.2)  # Let the HTTP response return first
+        trigger = getattr(request.app.state, "trigger_shutdown", None)
+        if trigger is not None:
+            trigger()
+        else:
+            # Fallback when not launched via run_server() (e.g. direct uvicorn)
+            import signal
+            import os
+
+            os.kill(os.getpid(), signal.SIGTERM)
+
+    request.app.state._shutdown_task = asyncio.create_task(_delayed_shutdown())
+    return {"status": "shutting_down"}
+
+
@app.get("/api/system")
 async def get_system_info():
    """Get system information"""
    import platform
-    import subprocess
    import psutil
-    from utils.hardware import get_device, get_gpu_memory_info, DeviceType
+    from utils.hardware import get_device
+    from utils.hardware.hardware import _backend_label

-    # GPU Info — query nvidia-smi for physical GPUs, filtered by
-    # CUDA_VISIBLE_DEVICES when set (the frontend uses this for GGUF
-    # fit estimation and llama-server respects CVD too).
-    import os
-
-    gpu_info: dict = {"available": False, "devices": []}
-
-    device = get_device()
-    if device == DeviceType.CUDA:
-        # Parse CUDA_VISIBLE_DEVICES allowlist
-        allowed_indices = None
-        cvd = os.environ.get("CUDA_VISIBLE_DEVICES")
-        if cvd is not None and cvd.strip():
-            try:
-                allowed_indices = set(int(x.strip()) for x in cvd.split(","))
-            except ValueError:
-                pass  # Non-numeric (e.g. GPU-uuid), show all
-
-        try:
-            result = subprocess.run(
-                [
-                    "nvidia-smi",
-                    "--query-gpu=index,name,memory.total",
-                    "--format=csv,noheader,nounits",
-                ],
-                capture_output = True,
-                text = True,
-                timeout = 10,
-            )
-            if result.returncode == 0:
-                for line in result.stdout.strip().splitlines():
-                    parts = [p.strip() for p in line.split(",")]
-                    if len(parts) == 3:
-                        idx = int(parts[0])
-                        if allowed_indices is not None and idx not in allowed_indices:
-                            continue
-                        gpu_info["devices"].append(
-                            {
-                                "index": idx,
-                                "name": parts[1],
-                                "memory_total_gb": round(int(parts[2]) / 1024, 2),
-                            }
-                        )
-                gpu_info["available"] = len(gpu_info["devices"]) > 0
-        except Exception:
-            pass
-
-    # Fallback to torch-based single-GPU detection
-    if not gpu_info["available"]:
-        mem_info = get_gpu_memory_info()
-        if mem_info.get("available"):
-            gpu_info["available"] = True
-            gpu_info["devices"].append(
-                {
-                    "index": mem_info.get("device", 0),
-                    "name": mem_info.get("device_name", "Unknown"),
-                    "memory_total_gb": round(mem_info.get("total_gb", 0), 2),
-                }
-            )
+    visibility_info = get_backend_visible_gpu_info()
+    gpu_info = {
+        "available": visibility_info["available"],
+        "devices": visibility_info["devices"],
+    }

    # CPU & Memory
    memory = psutil.virtual_memory()
@ -243,7 +274,10 @@ async def get_system_info():
    return {
        "platform": platform.platform(),
        "python_version": platform.python_version(),
-        "device_backend": get_device().value,
+        # Use the centralized _backend_label helper so the /api/system
+        # endpoint reports "rocm" on AMD hosts instead of "cuda", matching
+        # the /api/hardware and /api/gpu-visibility endpoints.
+        "device_backend": _backend_label(get_device()),
        "cpu_count": psutil.cpu_count(),
        "memory": {
            "total_gb": round(memory.total / 1e9, 2),
@ -254,6 +288,13 @@ async def get_system_info():
    }


+@app.get("/api/system/gpu-visibility")
+async def get_gpu_visibility(
+    current_subject: str = Depends(get_current_subject),
+):
+    return get_backend_visible_gpu_info()
+
+
@app.get("/api/system/hardware")
 async def get_hardware_info():
    """Return GPU name, total VRAM, and key ML package versions."""
@ -335,7 +376,7 @@ def setup_frontend(app: FastAPI, build_path: Path):

    @app.get("/{full_path:path}")
    async def serve_frontend(full_path: str):
-        if full_path.startswith("api"):
+        if full_path in {"api", "v1"} or full_path.startswith(("api/", "v1/")):
            return {"error": "API endpoint not found"}

        file_path = (build_path / full_path).resolve()
--- a/studio/backend/models/init.py
+++ b/studio/backend/models/init.py
@ -10,6 +10,11 @@ from .training import (
    TrainingJobResponse,
    TrainingStatus,
    TrainingProgress,
+    TrainingRunSummary,
+    TrainingRunListResponse,
+    TrainingRunMetrics,
+    TrainingRunDetailResponse,
+    TrainingRunDeleteResponse,
 )
 from .models import (
    CheckpointInfo,
@ -71,6 +76,11 @@ __all__ = [
    "TrainingJobResponse",
    "TrainingStatus",
    "TrainingProgress",
+    "TrainingRunSummary",
+    "TrainingRunListResponse",
+    "TrainingRunMetrics",
+    "TrainingRunDetailResponse",
+    "TrainingRunDeleteResponse",
    # Model management schemas
    "ModelDetails",
    "LocalModelInfo",
--- a/studio/backend/models/auth.py
+++ b/studio/backend/models/auth.py
@ -5,6 +5,8 @@
 Pydantic schemas for Authentication API
 """

+from typing import Optional
+
 from pydantic import BaseModel, Field


@ -45,3 +47,44 @@ class ChangePasswordRequest(BaseModel):
    new_password: str = Field(
        ..., min_length = 8, description = "Replacement password (minimum 8 characters)"
    )
+
+
+# ---------------------------------------------------------------------------
+# API key schemas
+# ---------------------------------------------------------------------------
+
+
+class CreateApiKeyRequest(BaseModel):
+    """Request body to create a new API key."""
+
+    name: str = Field(..., description = "Human-readable label for this key")
+    expires_in_days: Optional[int] = Field(
+        None, description = "Number of days until the key expires (None = never)"
+    )
+
+
+class ApiKeyResponse(BaseModel):
+    """Public representation of an API key (never contains the raw key)."""
+
+    id: int
+    name: str
+    key_prefix: str = Field(
+        ..., description = "First 8 characters after sk-unsloth- for display"
+    )
+    created_at: str
+    last_used_at: Optional[str] = None
+    expires_at: Optional[str] = None
+    is_active: bool
+
+
+class CreateApiKeyResponse(BaseModel):
+    """Returned once when a key is created -- ``key`` is never shown again."""
+
+    key: str = Field(..., description = "Full API key (shown once)")
+    api_key: ApiKeyResponse
+
+
+class ApiKeyListResponse(BaseModel):
+    """List of API keys for the authenticated user."""
+
+    api_keys: list[ApiKeyResponse]
--- a/studio/backend/models/inference.py
+++ b/studio/backend/models/inference.py
@ -11,7 +11,7 @@ import time
 import uuid
 from typing import Annotated, Any, Dict, Literal, Optional, List, Union

-from pydantic import BaseModel, Discriminator, Field, Tag
+from pydantic import BaseModel, Discriminator, Field, Tag, model_validator


 class LoadRequest(BaseModel):
@ -22,7 +22,10 @@ class LoadRequest(BaseModel):
        None, description = "HuggingFace token for gated models"
    )
    max_seq_length: int = Field(
-        4096, ge = 128, le = 32768, description = "Maximum sequence length"
+        0,
+        ge = 0,
+        le = 1048576,
+        description = "Maximum sequence length (0 = model default for GGUF)",
    )
    load_in_4bit: bool = Field(True, description = "Load model in 4-bit quantization")
    is_lora: bool = Field(False, description = "Whether this is a LoRA adapter")
@ -41,6 +44,14 @@ class LoadRequest(BaseModel):
        None,
        description = "KV cache data type for both K and V (e.g. 'f16', 'bf16', 'q8_0', 'q4_1', 'q5_1')",
    )
+    gpu_ids: Optional[List[int]] = Field(
+        None,
+        description = "Physical GPU indices to use, for example [0, 1]. Omit or pass [] to use automatic selection. Explicit gpu_ids are unsupported when the parent CUDA_VISIBLE_DEVICES uses UUID/MIG entries. Not supported for GGUF models.",
+    )
+    speculative_type: Optional[str] = Field(
+        None,
+        description = "Speculative decoding mode for GGUF models (e.g. 'ngram-simple', 'ngram-mod'). Ignored for non-GGUF and vision models.",
+    )


 class UnloadRequest(BaseModel):
@ -83,6 +94,10 @@ class ValidateModelResponse(BaseModel):
    is_gguf: bool = Field(False, description = "Whether this is a GGUF model (llama.cpp)")
    is_lora: bool = Field(False, description = "Whether this is a LoRA adapter")
    is_vision: bool = Field(False, description = "Whether this is a vision-capable model")
+    requires_trust_remote_code: bool = Field(
+        False,
+        description = "Whether the model defaults require trust_remote_code to be enabled for loading.",
+    )


 class GenerateRequest(BaseModel):
@ -126,13 +141,28 @@ class LoadResponse(BaseModel):
    inference: dict = Field(
        ..., description = "Inference parameters (temperature, top_p, top_k, min_p)"
    )
+    requires_trust_remote_code: bool = Field(
+        False,
+        description = "Whether the model defaults require trust_remote_code to be enabled for loading.",
+    )
    context_length: Optional[int] = Field(
        None, description = "Model's native context length (from GGUF metadata)"
    )
+    max_context_length: Optional[int] = Field(
+        None, description = "Maximum context length currently available on this hardware"
+    )
+    native_context_length: Optional[int] = Field(
+        None,
+        description = "Model's native context length from GGUF metadata (not capped by VRAM)",
+    )
    supports_reasoning: bool = Field(
        False,
        description = "Whether model supports thinking/reasoning mode (enable_thinking)",
    )
+    reasoning_always_on: bool = Field(
+        False,
+        description = "Whether reasoning is always on (hardcoded <think> tags, not toggleable)",
+    )
    supports_tools: bool = Field(
        False,
        description = "Whether model supports tool calling (web search, etc.)",
@ -145,6 +175,10 @@ class LoadResponse(BaseModel):
        None,
        description = "Jinja2 chat template string (from GGUF metadata or tokenizer)",
    )
+    speculative_type: Optional[str] = Field(
+        None,
+        description = "Active speculative decoding mode (e.g. 'ngram-simple', 'ngram-mod'), or None if disabled",
+    )


 class UnloadResponse(BaseModel):
@ -154,6 +188,39 @@ class UnloadResponse(BaseModel):
    model: str = Field(..., description = "Model identifier that was unloaded")


+class LoadProgressResponse(BaseModel):
+    """Progress of the active GGUF load, sampled on demand.
+
+    Used by the UI to show a real progress bar during the
+    post-download warmup window (mmap + CUDA upload), rather than a
+    generic "Starting model..." spinner that freezes for minutes on
+    large MoE models.
+    """
+
+    phase: Optional[str] = Field(
+        None,
+        description = (
+            "Load phase: 'mmap' (weights paging into RAM via mmap), "
+            "'ready' (llama-server reported healthy), or null when no "
+            "load is in flight."
+        ),
+    )
+    bytes_loaded: int = Field(
+        0,
+        description = (
+            "Bytes of the model already resident in the llama-server "
+            "process (VmRSS on Linux)."
+        ),
+    )
+    bytes_total: int = Field(
+        0,
+        description = "Total bytes across all GGUF shards for the active model.",
+    )
+    fraction: float = Field(
+        0.0, description = "bytes_loaded / bytes_total, clamped to 0..1."
+    )
+
+
 class InferenceStatusResponse(BaseModel):
    """Current inference backend status"""

@ -187,15 +254,34 @@ class InferenceStatusResponse(BaseModel):
    inference: Optional[Dict[str, Any]] = Field(
        None, description = "Recommended inference parameters for the active model"
    )
+    requires_trust_remote_code: bool = Field(
+        False,
+        description = "Whether the active model requires trust_remote_code to be enabled for loading.",
+    )
    supports_reasoning: bool = Field(
        False, description = "Whether the active model supports reasoning/thinking mode"
    )
+    reasoning_always_on: bool = Field(
+        False, description = "Whether reasoning is always on (not toggleable)"
+    )
    supports_tools: bool = Field(
        False, description = "Whether the active model supports tool calling"
    )
    context_length: Optional[int] = Field(
        None, description = "Context length of the active model"
    )
+    max_context_length: Optional[int] = Field(
+        None,
+        description = "Maximum context length currently available for the active model",
+    )
+    native_context_length: Optional[int] = Field(
+        None,
+        description = "Model's native context length from GGUF metadata (not capped by VRAM)",
+    )
+    speculative_type: Optional[str] = Field(
+        None,
+        description = "Active speculative decoding mode (e.g. 'ngram-simple', 'ngram-mod'), or None if disabled",
+    )


 # =====================================================================
@ -252,14 +338,68 @@ class ChatMessage(BaseModel):

    ``content`` may be a plain string (text-only) or a list of
    content parts for multimodal messages (OpenAI vision format).
+    Assistant messages that only contain tool calls may set ``content``
+    to ``None`` with ``tool_calls`` populated. ``role="tool"`` messages
+    carry the result of a client-executed tool call and require
+    ``tool_call_id`` per the OpenAI spec.
    """

-    role: Literal["system", "user", "assistant"] = Field(
+    role: Literal["system", "user", "assistant", "tool"] = Field(
        ..., description = "Message role"
    )
-    content: Union[str, list[ContentPart]] = Field(
-        ..., description = "Message content (string or multimodal parts)"
+    content: Optional[Union[str, list[ContentPart]]] = Field(
+        None, description = "Message content (string or multimodal parts)"
    )
+    tool_call_id: Optional[str] = Field(
+        None,
+        description = "OpenAI tool-result messages: id of the tool call this result belongs to.",
+    )
+    tool_calls: Optional[list[dict]] = Field(
+        None,
+        description = "OpenAI assistant messages: structured tool calls the model decided to make.",
+    )
+    name: Optional[str] = Field(
+        None,
+        description = "OpenAI tool-result messages: name of the tool whose result this is.",
+    )
+
+    @model_validator(mode = "after")
+    def _validate_role_shape(self) -> "ChatMessage":
+        # Enforce the per-role OpenAI spec shape at the request boundary.
+        # Without this, malformed messages (e.g. user entries with no
+        # content, tool_calls on a user/system role, role="tool" without
+        # tool_call_id) would be silently forwarded to llama-server via
+        # the passthrough path, surfacing as opaque upstream errors or
+        # broken tool-call reconciliation downstream.
+
+        # Tool-call metadata must appear only on the appropriate role.
+        if self.tool_calls is not None and self.role != "assistant":
+            raise ValueError('"tool_calls" is only valid on role="assistant" messages.')
+        if self.tool_call_id is not None and self.role != "tool":
+            raise ValueError('"tool_call_id" is only valid on role="tool" messages.')
+        if self.name is not None and self.role != "tool":
+            raise ValueError('"name" is only valid on role="tool" messages.')
+
+        # Per-role content requirements.
+        if self.role == "tool":
+            if not self.tool_call_id:
+                raise ValueError(
+                    'role="tool" messages require "tool_call_id" per the OpenAI spec.'
+                )
+            if not self.content:
+                raise ValueError('role="tool" messages require non-empty "content".')
+        elif self.role == "assistant":
+            # Assistant messages may omit content when tool_calls is set.
+            if not self.content and not self.tool_calls:
+                raise ValueError(
+                    'role="assistant" messages require either "content" or "tool_calls".'
+                )
+        else:  # "user" | "system"
+            if not self.content:
+                raise ValueError(
+                    f'role="{self.role}" messages require non-empty "content".'
+                )
+        return self


 class ChatCompletionRequest(BaseModel):
@ -269,18 +409,49 @@ class ChatCompletionRequest(BaseModel):
    Extensions (non-OpenAI fields) are marked with 'x-unsloth'.
    """

+    # Accept unknown fields defensively so future OpenAI fields (seed,
+    # response_format, logprobs, frequency_penalty, etc.) don't get
+    # silently dropped by Pydantic before route code runs. Mirrors
+    # AnthropicMessagesRequest and ResponsesRequest.
+    model_config = {"extra": "allow"}
+
    model: str = Field(
        "default",
        description = "Model identifier (informational; the active model is used)",
    )
    messages: list[ChatMessage] = Field(..., description = "Conversation messages")
-    stream: bool = Field(True, description = "Whether to stream the response via SSE")
+    stream: bool = Field(
+        False,
+        description = (
+            "Whether to stream the response via SSE. Default matches OpenAI's "
+            "spec (`false`); opt into streaming by sending `stream: true`."
+        ),
+    )
    temperature: float = Field(0.6, ge = 0.0, le = 2.0)
    top_p: float = Field(0.95, ge = 0.0, le = 1.0)
    max_tokens: Optional[int] = Field(
        None, ge = 1, description = "Maximum tokens to generate (None = until EOS)"
    )
    presence_penalty: float = Field(0.0, ge = 0.0, le = 2.0, description = "Presence penalty")
+    stop: Optional[Union[str, list[str]]] = Field(
+        None,
+        description = "OpenAI stop sequences: a single string or list of strings at which generation halts.",
+    )
+    tools: Optional[list[dict]] = Field(
+        None,
+        description = (
+            "OpenAI function-tool definitions. When provided without `enable_tools=true`, "
+            "Studio forwards the tools to the backend so the model returns structured "
+            "tool_calls for the client to execute (standard OpenAI function calling)."
+        ),
+    )
+    tool_choice: Optional[Union[str, dict]] = Field(
+        None,
+        description = (
+            "OpenAI tool choice: 'auto' | 'required' | 'none' | "
+            "{'type': 'function', 'function': {'name': ...}}"
+        ),
+    )

    # ── Unsloth extensions (ignored by standard OpenAI clients) ──
    top_k: int = Field(20, ge = -1, le = 100, description = "[x-unsloth] Top-k sampling")
@ -288,7 +459,7 @@ class ChatCompletionRequest(BaseModel):
        0.01, ge = 0.0, le = 1.0, description = "[x-unsloth] Min-p sampling threshold"
    )
    repetition_penalty: float = Field(
-        1.1, ge = 1.0, le = 2.0, description = "[x-unsloth] Repetition penalty"
+        1.0, ge = 1.0, le = 2.0, description = "[x-unsloth] Repetition penalty"
    )
    image_base64: Optional[str] = Field(
        None, description = "[x-unsloth] Base64-encoded image for vision models"
@ -323,7 +494,7 @@ class ChatCompletionRequest(BaseModel):
        description = "[x-unsloth] Auto-detect and fix malformed tool calls from model output.",
    )
    max_tool_calls_per_message: Optional[int] = Field(
-        10,
+        25,
        ge = 0,
        description = "[x-unsloth] Maximum number of tool call iterations per message (0 = disabled, 9999 = unlimited).",
    )
@ -403,3 +574,434 @@ class ChatCompletion(BaseModel):
    model: str = "default"
    choices: list[CompletionChoice]
    usage: CompletionUsage = Field(default_factory = CompletionUsage)
+
+
+# =====================================================================
+# OpenAI Responses API Models  (/v1/responses)
+# =====================================================================
+
+
+# ── Request models ──────────────────────────────────────────────
+
+
+class ResponsesInputTextPart(BaseModel):
+    """Text content part in a Responses API message (type=input_text)."""
+
+    type: Literal["input_text"]
+    text: str
+
+
+class ResponsesInputImagePart(BaseModel):
+    """Image content part in a Responses API message (type=input_image)."""
+
+    type: Literal["input_image"]
+    image_url: str = Field(..., description = "data:image/png;base64,... or https://...")
+    detail: Optional[Literal["auto", "low", "high"]] = "auto"
+
+
+class ResponsesOutputTextPart(BaseModel):
+    """Assistant ``output_text`` content part replayed on subsequent turns.
+
+    When a client (OpenAI Codex CLI, OpenAI Python SDK agents) loops on a
+    stateless Responses endpoint, prior assistant messages are round-tripped
+    as ``{"role":"assistant","content":[{"type":"output_text","text":...,
+    "annotations":[],"logprobs":[]}]}``. We preserve the text and ignore
+    the annotations/logprobs metadata when flattening into Chat Completions.
+    """
+
+    type: Literal["output_text"]
+    text: str
+    annotations: Optional[list] = None
+    logprobs: Optional[list] = None
+
+    model_config = {"extra": "allow"}
+
+
+class ResponsesUnknownContentPart(BaseModel):
+    """Catch-all for content-part types we don't model explicitly.
+
+    Keeps validation green when a client sends newer part types (e.g.
+    ``input_audio``, ``input_file``) we haven't mapped; these are silently
+    skipped during normalisation rather than rejected with a 422.
+    """
+
+    type: str
+
+    model_config = {"extra": "allow"}
+
+
+ResponsesContentPart = Union[
+    ResponsesInputTextPart,
+    ResponsesInputImagePart,
+    ResponsesOutputTextPart,
+    ResponsesUnknownContentPart,
+]
+
+
+class ResponsesInputMessage(BaseModel):
+    """A single message in the Responses API input array."""
+
+    type: Optional[Literal["message"]] = None
+    role: Literal["system", "user", "assistant", "developer"]
+    content: Union[str, list[ResponsesContentPart]]
+
+    # Codex (gpt-5.3-codex+) attaches a `phase` field ("commentary" |
+    # "final_answer") to assistant messages and requires clients to preserve
+    # it on subsequent turns. We accept and round-trip it; llama-server does
+    # not care about it.
+    model_config = {"extra": "allow"}
+
+
+class ResponsesFunctionCallInputItem(BaseModel):
+    """A prior assistant function_call being replayed in a multi-turn Responses input.
+
+    The Responses API represents tool calls as top-level input items (not
+    nested inside assistant messages), correlated across turns by ``call_id``.
+    """
+
+    type: Literal["function_call"]
+    id: Optional[str] = Field(
+        None, description = "Item id assigned by the server (e.g. fc_...)"
+    )
+    call_id: str = Field(
+        ...,
+        description = "Correlation id matching a function_call_output on the next turn.",
+    )
+    name: str
+    arguments: str = Field(
+        ..., description = "JSON string of the arguments the model produced."
+    )
+    status: Optional[Literal["in_progress", "completed", "incomplete"]] = None
+
+
+class ResponsesFunctionCallOutputInputItem(BaseModel):
+    """A tool result supplied by the client for a prior function_call.
+
+    Replaces Chat Completions' ``role="tool"`` message. Correlated to the
+    originating call by ``call_id``.
+    """
+
+    type: Literal["function_call_output"]
+    id: Optional[str] = None
+    call_id: str
+    output: Union[str, list] = Field(
+        ..., description = "String or content-array result of the tool call."
+    )
+    status: Optional[Literal["in_progress", "completed", "incomplete"]] = None
+
+
+class ResponsesUnknownInputItem(BaseModel):
+    """Catch-all for Responses input item types we don't model explicitly.
+
+    Covers ``reasoning`` items (replayed from prior o-series / gpt-5 turns)
+    and any future item types the client may send. These items are dropped
+    during normalisation — llama-server-backed GGUFs cannot consume them —
+    but keeping them in the request-model union stops unrelated turns from
+    failing validation with a 422.
+    """
+
+    type: str
+
+    model_config = {"extra": "allow"}
+
+
+def _responses_input_item_discriminator(v: Any) -> str:
+    """Route a Responses input item to the correct tagged variant.
+
+    Pydantic's default smart-union matching fails when one variant in the
+    union is tagged with a strict ``Literal`` (``function_call`` /
+    ``function_call_output``) and the incoming dict uses a different
+    ``type`` — the other variants' validation errors are hidden and the
+    outer ``Union[str, list[...]]`` reports a misleading "Input should be a
+    valid string" error. An explicit discriminator makes the routing
+    deterministic and lets us fall through to the catch-all.
+    """
+    if isinstance(v, dict):
+        t = v.get("type")
+        r = v.get("role")
+    else:
+        t = getattr(v, "type", None)
+        r = getattr(v, "role", None)
+    if t == "function_call":
+        return "function_call"
+    if t == "function_call_output":
+        return "function_call_output"
+    if r is not None or t == "message":
+        return "message"
+    return "unknown"
+
+
+ResponsesInputItem = Annotated[
+    Union[
+        Annotated[ResponsesInputMessage, Tag("message")],
+        Annotated[ResponsesFunctionCallInputItem, Tag("function_call")],
+        Annotated[ResponsesFunctionCallOutputInputItem, Tag("function_call_output")],
+        Annotated[ResponsesUnknownInputItem, Tag("unknown")],
+    ],
+    Discriminator(_responses_input_item_discriminator),
+]
+
+
+class ResponsesFunctionTool(BaseModel):
+    """Flat function-tool definition used by the Responses API request.
+
+    Unlike Chat Completions (which nests ``{"name": ..., "parameters": ...}``
+    inside a ``"function"`` key), the Responses API uses a flat shape with
+    ``type``, ``name``, ``description``, ``parameters``, and ``strict`` at the
+    top level of each tool entry.
+    """
+
+    type: Literal["function"]
+    name: str
+    description: Optional[str] = None
+    parameters: Optional[dict] = None
+    strict: Optional[bool] = None
+
+
+class ResponsesRequest(BaseModel):
+    """OpenAI Responses API request."""
+
+    model: str = Field("default", description = "Model identifier")
+    input: Union[str, list[ResponsesInputItem]] = Field(
+        default = [],
+        description = "Input text or list of messages / function_call / function_call_output items",
+    )
+    instructions: Optional[str] = Field(
+        None, description = "System / developer instructions"
+    )
+    temperature: Optional[float] = Field(None, ge = 0.0, le = 2.0)
+    top_p: Optional[float] = Field(None, ge = 0.0, le = 1.0)
+    max_output_tokens: Optional[int] = Field(None, ge = 1)
+    stream: bool = Field(False, description = "Whether to stream the response via SSE")
+
+    # OpenAI function-calling fields — forwarded to llama-server via the
+    # Chat Completions pass-through (see routes/inference.py). Typed as a
+    # plain list so built-in tool shapes (``web_search``, ``file_search``,
+    # ``mcp``, ...) round-trip without validation errors — the translator
+    # picks out only ``type=="function"`` entries for forwarding.
+    tools: Optional[list[dict]] = Field(
+        None,
+        description = (
+            "Responses-shape function tool definitions. Entries with "
+            '`type="function"` are translated to the Chat Completions nested '
+            "shape before being forwarded to llama-server; other tool types "
+            "(built-in web_search, file_search, mcp, ...) are accepted for SDK "
+            "compatibility but ignored on the llama-server passthrough."
+        ),
+    )
+    tool_choice: Optional[Any] = Field(
+        None,
+        description = (
+            "'auto' | 'required' | 'none' | {'type': 'function', 'name': ...} — "
+            "the Responses-shape forcing object is translated to the Chat "
+            "Completions nested shape internally."
+        ),
+    )
+    parallel_tool_calls: Optional[bool] = None
+
+    previous_response_id: Optional[str] = None
+    store: Optional[bool] = None
+    metadata: Optional[dict] = None
+    truncation: Optional[Any] = None
+    user: Optional[str] = None
+    text: Optional[Any] = None
+    reasoning: Optional[Any] = None
+
+    model_config = {"extra": "allow"}
+
+
+# ── Response models ─────────────────────────────────────────────
+
+
+class ResponsesOutputTextContent(BaseModel):
+    """A text content block inside an output message."""
+
+    type: Literal["output_text"] = "output_text"
+    text: str
+    annotations: list = Field(default_factory = list)
+
+
+class ResponsesOutputMessage(BaseModel):
+    """An output message in the Responses API response."""
+
+    type: Literal["message"] = "message"
+    id: str = Field(default_factory = lambda: f"msg_{uuid.uuid4().hex[:12]}")
+    status: Literal["completed", "in_progress"] = "completed"
+    role: Literal["assistant"] = "assistant"
+    content: list[ResponsesOutputTextContent] = Field(default_factory = list)
+
+
+class ResponsesOutputFunctionCall(BaseModel):
+    """A function-call output item in the Responses API response.
+
+    Unlike Chat Completions (which nests tool calls inside the assistant
+    message), the Responses API emits each tool call as its own top-level
+    ``output`` item so clients can correlate results via ``call_id`` on the
+    next turn.
+    """
+
+    type: Literal["function_call"] = "function_call"
+    id: str = Field(default_factory = lambda: f"fc_{uuid.uuid4().hex[:12]}")
+    call_id: str
+    name: str
+    arguments: str = Field(
+        ..., description = "JSON string of the arguments the model produced."
+    )
+    status: Literal["completed", "in_progress", "incomplete"] = "completed"
+
+
+ResponsesOutputItem = Union[ResponsesOutputMessage, ResponsesOutputFunctionCall]
+
+
+class ResponsesUsage(BaseModel):
+    """Token usage for a Responses API response (input_tokens, not prompt_tokens)."""
+
+    input_tokens: int = 0
+    output_tokens: int = 0
+    total_tokens: int = 0
+
+
+class ResponsesResponse(BaseModel):
+    """Top-level Responses API response object."""
+
+    id: str = Field(default_factory = lambda: f"resp_{uuid.uuid4().hex[:12]}")
+    object: Literal["response"] = "response"
+    created_at: int = Field(default_factory = lambda: int(time.time()))
+    status: Literal["completed", "in_progress", "failed"] = "completed"
+    model: str = "default"
+    output: list[ResponsesOutputItem] = Field(default_factory = list)
+    usage: ResponsesUsage = Field(default_factory = ResponsesUsage)
+    error: Optional[Any] = None
+    incomplete_details: Optional[Any] = None
+    instructions: Optional[str] = None
+    metadata: dict = Field(default_factory = dict)
+    temperature: Optional[float] = None
+    top_p: Optional[float] = None
+    max_output_tokens: Optional[int] = None
+    previous_response_id: Optional[str] = None
+    text: Optional[Any] = None
+    tool_choice: Optional[Any] = None
+    tools: list = Field(default_factory = list)
+    truncation: Optional[Any] = None
+
+
+# =====================================================================
+# Anthropic Messages API Models  (/v1/messages)
+# =====================================================================
+
+
+# ── Request models ─────────────────────────────────────────────
+
+
+class AnthropicTextBlock(BaseModel):
+    type: Literal["text"]
+    text: str
+
+
+class AnthropicImageSource(BaseModel):
+    type: Literal["base64", "url"]
+    media_type: Optional[str] = None
+    data: Optional[str] = None
+    url: Optional[str] = None
+
+
+class AnthropicImageBlock(BaseModel):
+    type: Literal["image"]
+    source: AnthropicImageSource
+
+
+class AnthropicToolUseBlock(BaseModel):
+    type: Literal["tool_use"]
+    id: str
+    name: str
+    input: dict
+
+
+class AnthropicToolResultBlock(BaseModel):
+    type: Literal["tool_result"]
+    tool_use_id: str
+    content: Union[str, list] = ""
+
+
+AnthropicContentBlock = Union[
+    AnthropicTextBlock,
+    AnthropicImageBlock,
+    AnthropicToolUseBlock,
+    AnthropicToolResultBlock,
+]
+
+
+class AnthropicMessage(BaseModel):
+    role: Literal["user", "assistant"]
+    content: Union[str, list[AnthropicContentBlock]]
+
+
+class AnthropicTool(BaseModel):
+    name: str
+    description: Optional[str] = None
+    input_schema: dict
+
+
+class AnthropicMessagesRequest(BaseModel):
+    model: str = "default"
+    max_tokens: Optional[int] = None
+    messages: list[AnthropicMessage]
+    system: Optional[Union[str, list]] = None
+    tools: Optional[list[AnthropicTool]] = None
+    tool_choice: Optional[Any] = None
+    stream: bool = False
+    temperature: Optional[float] = None
+    top_p: Optional[float] = None
+    top_k: Optional[int] = None
+    stop_sequences: Optional[list[str]] = None
+    metadata: Optional[dict] = None
+    # [x-unsloth] extensions — mirror the OpenAI endpoint convenience fields
+    min_p: Optional[float] = Field(
+        None, ge = 0.0, le = 1.0, description = "[x-unsloth] Min-p sampling threshold"
+    )
+    repetition_penalty: Optional[float] = Field(
+        None, ge = 1.0, le = 2.0, description = "[x-unsloth] Repetition penalty"
+    )
+    presence_penalty: Optional[float] = Field(
+        None, ge = 0.0, le = 2.0, description = "[x-unsloth] Presence penalty"
+    )
+    enable_tools: Optional[bool] = None
+    enabled_tools: Optional[list[str]] = None
+    session_id: Optional[str] = None
+    model_config = {"extra": "allow"}
+
+
+# ── Response models ────────────────────────────────────────────
+
+
+class AnthropicUsage(BaseModel):
+    input_tokens: int = 0
+    output_tokens: int = 0
+
+
+class AnthropicResponseTextBlock(BaseModel):
+    type: Literal["text"] = "text"
+    text: str
+
+
+class AnthropicResponseToolUseBlock(BaseModel):
+    type: Literal["tool_use"] = "tool_use"
+    id: str
+    name: str
+    input: dict
+
+
+AnthropicResponseBlock = Union[
+    AnthropicResponseTextBlock, AnthropicResponseToolUseBlock
+]
+
+
+class AnthropicMessagesResponse(BaseModel):
+    id: str = Field(default_factory = lambda: f"msg_{uuid.uuid4().hex[:24]}")
+    type: Literal["message"] = "message"
+    role: Literal["assistant"] = "assistant"
+    content: list[AnthropicResponseBlock] = Field(default_factory = list)
+    model: str = "default"
+    stop_reason: Optional[str] = None
+    stop_sequence: Optional[str] = None
+    usage: AnthropicUsage = Field(default_factory = AnthropicUsage)
--- a/studio/backend/models/models.py
+++ b/studio/backend/models/models.py
@ -165,7 +165,7 @@ class LocalModelInfo(BaseModel):
    id: str = Field(..., description = "Identifier to use for loading/training")
    display_name: str = Field(..., description = "Display label")
    path: str = Field(..., description = "Local path where model data was discovered")
-    source: Literal["models_dir", "hf_cache"] = Field(
+    source: Literal["models_dir", "hf_cache", "lmstudio", "custom"] = Field(
        ...,
        description = "Discovery source",
    )
@ -189,7 +189,92 @@ class LocalModelListResponse(BaseModel):
        None,
        description = "HF cache root that was scanned",
    )
+    lmstudio_dirs: List[str] = Field(
+        default_factory = list,
+        description = "LM Studio model directories that were scanned",
+    )
    models: List[LocalModelInfo] = Field(
        default_factory = list,
        description = "Discovered local/cached models",
    )
+
+
+class AddScanFolderRequest(BaseModel):
+    """Request body for adding a custom scan folder."""
+
+    path: str = Field(
+        ..., description = "Absolute or relative directory path to scan for models"
+    )
+
+
+class ScanFolderInfo(BaseModel):
+    """A registered custom model scan folder."""
+
+    id: int = Field(..., description = "Database row ID")
+    path: str = Field(..., description = "Normalized absolute path")
+    created_at: str = Field(..., description = "ISO 8601 creation timestamp")
+
+
+class BrowseEntry(BaseModel):
+    """A directory entry surfaced by the folder browser."""
+
+    name: str = Field(..., description = "Entry name (basename, not full path)")
+    has_models: bool = Field(
+        False,
+        description = (
+            "Hint that the directory likely contains models "
+            "(*.gguf, *.safetensors, config.json, or HF-style "
+            "`models--*` subfolders). Used by the UI to highlight "
+            "promising candidates; the scanner itself is authoritative."
+        ),
+    )
+    hidden: bool = Field(
+        False,
+        description = "Name starts with a dot (e.g. `.cache`)",
+    )
+
+
+class BrowseFoldersResponse(BaseModel):
+    """Response schema for the folder browser endpoint."""
+
+    current: str = Field(..., description = "Absolute path of the directory just listed")
+    parent: Optional[str] = Field(
+        None,
+        description = (
+            "Parent directory of `current`, or null if `current` is the "
+            "filesystem root. The frontend uses this to render an `Up` row."
+        ),
+    )
+    entries: List[BrowseEntry] = Field(
+        default_factory = list,
+        description = (
+            "Subdirectories of `current`. Sorted with model-bearing "
+            "directories first, then alphabetically case-insensitive; "
+            "hidden entries come last within each group."
+        ),
+    )
+    suggestions: List[str] = Field(
+        default_factory = list,
+        description = (
+            "Handy starting points (home, HF cache, already-registered "
+            "scan folders). Rendered as quick-pick chips above the list."
+        ),
+    )
+    truncated: bool = Field(
+        False,
+        description = (
+            "True when the listing was capped because the directory had "
+            "more subfolders than the server is willing to enumerate in "
+            "one request. The UI should show a hint telling the user to "
+            "narrow their path."
+        ),
+    )
+    model_files_here: int = Field(
+        0,
+        description = (
+            "Count of GGUF/safetensors files immediately inside "
+            "``current``. Used by the UI to surface a hint on leaf "
+            "model directories (which otherwise look `empty` because "
+            "they contain only files, no subdirectories)."
+        ),
+    )
--- a/studio/backend/models/training.py
+++ b/studio/backend/models/training.py
@ -81,7 +81,7 @@ class TrainingStartRequest(BaseModel):
    warmup_ratio: Optional[float] = Field(None, description = "Warmup ratio")
    max_steps: Optional[int] = Field(None, description = "Maximum training steps")
    save_steps: int = Field(100, description = "Steps between checkpoints")
-    weight_decay: float = Field(0.01, description = "Weight decay")
+    weight_decay: float = Field(0.001, description = "Weight decay")
    random_seed: int = Field(42, description = "Random seed")
    packing: bool = Field(False, description = "Enable sequence packing")
    optim: str = Field("adamw_8bit", description = "Optimizer")
@ -128,6 +128,12 @@ class TrainingStartRequest(BaseModel):
    enable_tensorboard: bool = Field(False, description = "Enable TensorBoard logging")
    tensorboard_dir: Optional[str] = Field(None, description = "TensorBoard directory")

+    # GPU selection
+    gpu_ids: Optional[List[int]] = Field(
+        None,
+        description = "Physical GPU indices to use, for example [0, 1]. Omit or pass [] to use automatic selection. Explicit gpu_ids are unsupported when the parent CUDA_VISIBLE_DEVICES uses UUID/MIG entries.",
+    )
+

 class TrainingJobResponse(BaseModel):
    """Immediate response when training is initiated"""
@ -177,8 +183,8 @@ class TrainingProgress(BaseModel):
    job_id: str = Field(..., description = "Training job identifier")
    step: int = Field(..., description = "Current training step")
    total_steps: int = Field(..., description = "Total training steps")
-    loss: float = Field(..., description = "Current loss value")
-    learning_rate: float = Field(..., description = "Current learning rate")
+    loss: Optional[float] = Field(None, description = "Current loss value")
+    learning_rate: Optional[float] = Field(None, description = "Current learning rate")
    progress_percent: float = Field(
        ..., description = "Progress percentage (0.0 to 100.0)"
    )
@ -196,3 +202,59 @@ class TrainingProgress(BaseModel):
    eval_loss: Optional[float] = Field(
        None, description = "Eval loss from the most recent evaluation step"
    )
+
+
+class TrainingRunSummary(BaseModel):
+    """Summary of a training run for list views."""
+
+    id: str
+    status: Literal["running", "completed", "stopped", "error"]
+    model_name: str
+    dataset_name: str
+    started_at: str
+    ended_at: Optional[str] = None
+    total_steps: Optional[int] = None
+    final_step: Optional[int] = None
+    final_loss: Optional[float] = None
+    output_dir: Optional[str] = None
+    duration_seconds: Optional[float] = None
+    error_message: Optional[str] = None
+    loss_sparkline: Optional[List[float]] = None
+
+
+class TrainingRunListResponse(BaseModel):
+    """Response for listing training runs."""
+
+    runs: List[TrainingRunSummary]
+    total: int
+
+
+class TrainingRunMetrics(BaseModel):
+    """Metrics arrays for a training run, using paired step arrays per metric."""
+
+    step_history: List[int] = Field(default_factory = list)
+    loss_history: List[float] = Field(default_factory = list)
+    loss_step_history: List[int] = Field(default_factory = list)
+    lr_history: List[float] = Field(default_factory = list)
+    lr_step_history: List[int] = Field(default_factory = list)
+    grad_norm_history: List[float] = Field(default_factory = list)
+    grad_norm_step_history: List[int] = Field(default_factory = list)
+    eval_loss_history: List[float] = Field(default_factory = list)
+    eval_step_history: List[int] = Field(default_factory = list)
+    final_epoch: Optional[float] = None
+    final_num_tokens: Optional[int] = None
+
+
+class TrainingRunDetailResponse(BaseModel):
+    """Response for a single training run with config and metrics."""
+
+    run: TrainingRunSummary
+    config: dict
+    metrics: TrainingRunMetrics
+
+
+class TrainingRunDeleteResponse(BaseModel):
+    """Response for deleting a training run."""
+
+    status: str
+    message: str
--- a/studio/backend/plugins/data-designer-unstructured-seed/pyproject.toml
+++ b/studio/backend/plugins/data-designer-unstructured-seed/pyproject.toml
@ -11,7 +11,7 @@ version = "0.1.0"
 description = "Local Data Designer unstructured seed reader plugin"
 requires-python = ">=3.11"
 dependencies = [
-  "data-designer-engine>=0.5.1,<0.6",
+  "data-designer-engine>=0.5.4,<0.6",
  "pandas>=2,<3",
  "pymupdf>=1.24.0",
  "pymupdf4llm>=0.0.17",
--- a/studio/backend/requirements/extras-no-deps.txt
+++ b/studio/backend/requirements/extras-no-deps.txt
@ -2,9 +2,13 @@
 descript-audio-codec
 descript-audiotools
 julius
-torchcodec
+torchcodec==0.10.0
 snac

+# peft 0.19.0 causes export subprocess shutdown issues in Studio;
+# installing with --no-deps to avoid pulling in torch>=0.11.0
+peft==0.18.1
+
 # TRL and related packages
 trl==0.23.1
 git+https://github.com/meta-pytorch/OpenEnv.git
@ -12,3 +16,5 @@ git+https://github.com/meta-pytorch/OpenEnv.git
 torch-c-dlpack-ext
 sentence_transformers==5.2.0
 transformers==4.57.6
+pytorch_tokenizers
+kernels==0.12.1
--- a/studio/backend/requirements/no-torch-runtime.txt
+++ b/studio/backend/requirements/no-torch-runtime.txt
@ -0,0 +1,50 @@
+# Runtime dependencies for no-torch (GGUF-only) mode.
+# Installed with --no-deps to prevent transitive torch resolution
+# from packages like accelerate, peft, trl, sentence-transformers.
+#
+# Includes unsloth's own direct deps (typer, pydantic, pyyaml,
+# nest-asyncio) since unsloth is also installed with --no-deps
+# (current PyPI metadata still declares torch as a hard dep).
+
+# unsloth direct deps (from pyproject.toml [project].dependencies)
+typer
+pydantic
+pyyaml
+nest-asyncio
+
+# HF ecosystem (from [huggingfacenotorch] extras in pyproject.toml)
+wheel>=0.42.0
+packaging
+numpy
+tqdm
+psutil
+tyro
+protobuf
+sentencepiece>=0.2.0
+safetensors>=0.4.3
+datasets>=3.4.1,!=4.0.*,!=4.1.0,<4.4.0
+accelerate>=0.34.1
+peft>=0.18.0,!=0.11.0
+huggingface_hub>=0.34.0
+hf_transfer
+diffusers
+
+# Transitive deps required because this file is installed with --no-deps.
+# Without these, `from transformers import AutoConfig` fails at import time.
+regex
+typing_extensions
+filelock
+httpx
+httpcore
+certifi
+idna
+anyio
+sniffio
+h11
+
+tokenizers
+transformers>=4.51.3,!=4.52.0,!=4.52.1,!=4.52.2,!=4.52.3,!=4.53.0,!=4.54.0,!=4.55.0,!=4.55.1,!=4.57.0,!=4.57.4,!=4.57.5,!=5.0.0,!=5.1.0,<=5.3.0
+trl>=0.18.2,!=0.19.0,<=0.24.0
+sentence-transformers
+cut_cross_entropy
+pillow
--- a/studio/backend/requirements/overrides.txt
+++ b/studio/backend/requirements/overrides.txt
@ -1,6 +1,2 @@
 # Torch AO overrides (installed with --force-reinstall --no-cache-dir)
 torchao==0.14.0
-pytorch_tokenizers
-
-# Kernel packages
-kernels
--- a/studio/backend/requirements/single-env/data-designer-deps.txt
+++ b/studio/backend/requirements/single-env/data-designer-deps.txt
@ -1,22 +1,25 @@
 # Data Designer runtime deps installed explicitly (single-env mode).
-# DuckDB 1.5 removed Relation.record_batch(); keep <1.5 until upstream ships the fix.
+# Synced with data-designer-engine==0.5.4 requirements.
 anyascii<1,>=0.3.3
-duckdb<1.5,>=1.1.3
+chardet<6,>=3.0.2
+duckdb<2,>=1.5.0
 faker<21,>=20.1.0
+fsspec<2026,>=2025.3.0
 httpx<1,>=0.27.2
 httpx-retries<1,>=0.4.2
 json-repair<1,>=0.48.0
 jsonpath-rust-bindings<2,>=1.0
 jsonschema<5,>=4.0.0
-litellm<1.80.12,>=1.73.6
 lxml<7,>=6.0.2
 marko<3,>=2.1.2
+mcp<2,>=1.26.0
 networkx<4,>=3.0
-python-json-logger<4,>=3
+python-json-logger>=3,<4
 ruff<1,>=0.14.10
 scipy<2,>=1.11.0
 sqlfluff<4,>=3.2.0
 tiktoken<1,>=0.8.0
+# Unstructured-seed plugin deps (plugin installed with --no-deps)
 pymupdf>=1.24.0
 pymupdf4llm>=0.0.17
 mammoth>=1.8.0
--- a/studio/backend/requirements/single-env/data-designer.txt
+++ b/studio/backend/requirements/single-env/data-designer.txt
@ -1,5 +1,5 @@
 # Install Data Designer in same env as Unsloth.
-data-designer==0.5.2
-data-designer-config==0.5.2
-data-designer-engine==0.5.2
+data-designer==0.5.4
+data-designer-config==0.5.4
+data-designer-engine==0.5.4
 prompt-toolkit>=3,<4
--- a/studio/backend/routes/init.py
+++ b/studio/backend/routes/init.py
@ -12,6 +12,7 @@ from routes.datasets import router as datasets_router
 from routes.auth import router as auth_router
 from routes.data_recipe import router as data_recipe_router
 from routes.export import router as export_router
+from routes.training_history import router as training_history_router

 __all__ = [
    "training_router",
@ -21,4 +22,5 @@ __all__ = [
    "auth_router",
    "data_recipe_router",
    "export_router",
+    "training_history_router",
 ]
--- a/studio/backend/routes/auth.py
+++ b/studio/backend/routes/auth.py
@ -7,11 +7,17 @@ Authentication API routes

 from fastapi import APIRouter, Depends, HTTPException, status

+from datetime import datetime, timedelta, timezone
+
 from models.auth import (
+    ApiKeyListResponse,
+    ApiKeyResponse,
    AuthLoginRequest,
-    RefreshTokenRequest,
    AuthStatusResponse,
    ChangePasswordRequest,
+    CreateApiKeyRequest,
+    CreateApiKeyResponse,
+    RefreshTokenRequest,
 )
 from models.users import Token
 from auth import storage, hashing
@ -131,3 +137,68 @@ async def change_password(
        token_type = "bearer",
        must_change_password = False,
    )
+
+
+# ---------------------------------------------------------------------------
+# API key management
+# ---------------------------------------------------------------------------
+
+
+def _row_to_api_key_response(row: dict) -> ApiKeyResponse:
+    return ApiKeyResponse(
+        id = row["id"],
+        name = row["name"],
+        key_prefix = row["key_prefix"],
+        created_at = row["created_at"],
+        last_used_at = row.get("last_used_at"),
+        expires_at = row.get("expires_at"),
+        is_active = bool(row["is_active"]),
+    )
+
+
+@router.post("/api-keys", response_model = CreateApiKeyResponse)
+async def create_api_key(
+    payload: CreateApiKeyRequest,
+    current_subject: str = Depends(get_current_subject),
+) -> CreateApiKeyResponse:
+    """Create a new API key. The raw key is returned once and cannot be retrieved later."""
+    expires_at = None
+    if payload.expires_in_days is not None:
+        expires_at = (
+            datetime.now(timezone.utc) + timedelta(days = payload.expires_in_days)
+        ).isoformat()
+
+    raw_key, row = storage.create_api_key(
+        username = current_subject,
+        name = payload.name,
+        expires_at = expires_at,
+    )
+    return CreateApiKeyResponse(
+        key = raw_key,
+        api_key = _row_to_api_key_response(row),
+    )
+
+
+@router.get("/api-keys", response_model = ApiKeyListResponse)
+async def list_api_keys(
+    current_subject: str = Depends(get_current_subject),
+) -> ApiKeyListResponse:
+    """List all API keys for the authenticated user (raw keys are never exposed)."""
+    rows = storage.list_api_keys(current_subject)
+    return ApiKeyListResponse(
+        api_keys = [_row_to_api_key_response(r) for r in rows],
+    )
+
+
+@router.delete("/api-keys/{key_id}")
+async def revoke_api_key(
+    key_id: int,
+    current_subject: str = Depends(get_current_subject),
+) -> dict:
+    """Revoke (soft-delete) an API key."""
+    if not storage.revoke_api_key(current_subject, key_id):
+        raise HTTPException(
+            status_code = status.HTTP_404_NOT_FOUND,
+            detail = "API key not found",
+        )
+    return {"detail": "API key revoked"}
--- a/studio/backend/routes/data_recipe/jobs.py
+++ b/studio/backend/routes/data_recipe/jobs.py
@ -5,7 +5,9 @@

 from __future__ import annotations

+from datetime import timedelta
 from typing import Any
+from urllib.parse import urlparse

 from fastapi import APIRouter, HTTPException, Query, Request
 from fastapi.responses import JSONResponse, StreamingResponse
@ -26,6 +28,161 @@ from models.data_recipe import (
 router = APIRouter()


+def _resolve_local_v1_endpoint(request: Request) -> str:
+    """Return the loopback /v1 URL for the actual backend listen port.
+
+    Resolution order:
+      1. ``app.state.server_port`` - explicitly published by run.py after
+         the uvicorn server has bound. This is the most reliable source
+         because it survives reverse proxies, TLS terminators and tunnels.
+      2. ``request.scope["server"]`` - the real (host, port) tuple uvicorn
+         sets when the request is dispatched. Used when Studio is started
+         outside ``run_server`` (e.g. ``uvicorn studio.backend.main:app``).
+      3. ``request.base_url`` parsed - last resort for test fixtures that
+         do not route through a live uvicorn server.
+    """
+    port: Any = getattr(request.app.state, "server_port", None)
+    if not isinstance(port, int) or port <= 0:
+        server = request.scope.get("server")
+        if (
+            isinstance(server, tuple)
+            and len(server) >= 2
+            and isinstance(server[1], int)
+            and server[1] > 0
+        ):
+            port = server[1]
+        else:
+            parsed = urlparse(str(request.base_url))
+            port = parsed.port if parsed.port is not None else 8888
+    return f"http://127.0.0.1:{int(port)}/v1"
+
+
+def _used_llm_model_aliases(recipe: dict[str, Any]) -> set[str]:
+    """Return the set of model_aliases that are actually referenced by an
+    LLM column. Used to narrow the "Chat model loaded" gate so that orphan
+    model_config nodes on the canvas do not block unrelated recipe runs.
+
+    The ``llm-`` prefix matches the existing convention in
+    ``core/data_recipe/service.py::_recipe_has_llm_columns`` and covers all
+    LLM column types emitted by the frontend (llm-text, llm-code,
+    llm-structured, llm-judge).
+    """
+    aliases: set[str] = set()
+    for column in recipe.get("columns", []):
+        if not isinstance(column, dict):
+            continue
+        column_type = column.get("column_type")
+        if not isinstance(column_type, str) or not column_type.startswith("llm-"):
+            continue
+        alias = column.get("model_alias")
+        if isinstance(alias, str) and alias:
+            aliases.add(alias)
+    return aliases
+
+
+def _inject_local_providers(recipe: dict[str, Any], request: Request) -> None:
+    """
+    Mutate recipe dict in-place: for any provider with is_local=True,
+    generate a JWT and fill in the endpoint pointing at this server.
+    """
+    providers = recipe.get("model_providers")
+    if not providers:
+        return
+
+    # Collect local providers and pop is_local from ALL dicts unconditionally.
+    # Strict `is True` guard so malformed payloads (is_local: 1,
+    # is_local: "true") do not accidentally trigger the loopback rewrite.
+    local_indices: list[int] = []
+    for i, provider in enumerate(providers):
+        if not isinstance(provider, dict):
+            continue
+        is_local = provider.pop("is_local", None)
+        if is_local is True:
+            local_indices.append(i)
+
+    if not local_indices:
+        return
+
+    endpoint = _resolve_local_v1_endpoint(request)
+
+    # Only gate on model-loaded if a local provider is actually reachable
+    # from an LLM column through a model_config. Orphan model_config nodes
+    # that reference a local provider but that no LLM column uses should
+    # not block runs; the recipe would never call /v1 for them.
+    local_names = {
+        providers[i].get("name") for i in local_indices if providers[i].get("name")
+    }
+    used_aliases = _used_llm_model_aliases(recipe)
+    referenced_providers = {
+        mc.get("provider")
+        for mc in recipe.get("model_configs", [])
+        if (
+            isinstance(mc, dict)
+            and mc.get("provider")
+            and mc.get("alias") in used_aliases
+        )
+    }
+
+    token = ""
+    if local_names & referenced_providers:
+        # Verify a model is loaded.
+        # NOTE: This is a point-in-time check (TOCTOU). The model could be unloaded
+        # or swapped after this check but before the recipe subprocess calls /v1.
+        # The inference endpoint returns a clear 400 in that case.
+        #
+        # Imports are deferred to avoid circular dependencies with inference modules.
+        from routes.inference import get_llama_cpp_backend
+        from core.inference import get_inference_backend
+
+        llama = get_llama_cpp_backend()
+        model_loaded = llama.is_loaded
+        if not model_loaded:
+            backend = get_inference_backend()
+            model_loaded = bool(backend.active_model_name)
+        if not model_loaded:
+            raise ValueError(
+                "No model loaded in Chat. Load a model first, then run the recipe."
+            )
+
+        from auth.authentication import (
+            create_access_token,
+        )  # deferred: avoids circular import
+
+        # Uses the "unsloth" admin subject. If the user changes their password,
+        # the JWT secret rotates and this token becomes invalid mid-run.
+        # Acceptable for v1 - recipes typically finish well within one session.
+        token = create_access_token(
+            subject = "unsloth",
+            expires_delta = timedelta(hours = 24),
+        )
+
+    # Defensively strip any stale "external"-only fields the frontend may
+    # have left on the dict (extra_headers/extra_body/api_key_env). The UI
+    # hides these inputs in local mode but the payload builder still serializes
+    # them, so a previously external provider that flipped to local can carry
+    # invalid JSON or rogue auth headers into the local /v1 call.
+    for i in local_indices:
+        providers[i]["endpoint"] = endpoint
+        providers[i]["api_key"] = token
+        providers[i]["provider_type"] = "openai"
+        providers[i].pop("api_key_env", None)
+        providers[i].pop("extra_headers", None)
+        providers[i].pop("extra_body", None)
+
+    # Force skip_health_check on any model_config that references a local
+    # provider. The local /v1/models endpoint only lists the real loaded
+    # model (e.g. "unsloth/llama-3.2-1b") and not the placeholder "local"
+    # that the recipe sends as the model id, so data_designer's pre-flight
+    # health check would otherwise fail before the first completion call.
+    # The backend route ignores the model id field in chat completions, so
+    # skipping the check is safe.
+    for mc in recipe.get("model_configs", []):
+        if not isinstance(mc, dict):
+            continue
+        if mc.get("provider") in local_names:
+            mc["skip_health_check"] = True
+
+
 def _normalize_run_name(value: Any) -> str | None:
    if value is None:
        return None
@ -40,7 +197,7 @@ def _normalize_run_name(value: Any) -> str | None:


@router.post("/jobs", response_class = JSONResponse, response_model = JobCreateResponse)
-def create_job(payload: RecipePayload):
+def create_job(payload: RecipePayload, request: Request):
    recipe = payload.recipe
    if not recipe.get("columns"):
        raise HTTPException(status_code = 400, detail = "Recipe must include columns.")
@ -67,6 +224,11 @@ def create_job(payload: RecipePayload):
                status_code = 400, detail = f"invalid run_config: {exc}"
            ) from exc

+    try:
+        _inject_local_providers(recipe, request)
+    except ValueError as exc:
+        raise HTTPException(status_code = 400, detail = str(exc)) from exc
+
    mgr = get_job_manager()
    try:
        job_id = mgr.start(recipe = recipe, run = run)
--- a/studio/backend/routes/data_recipe/seed.py
+++ b/studio/backend/routes/data_recipe/seed.py
@ -388,7 +388,7 @@ def _extract_text_from_file(file_path: Path, ext: str) -> str:
        import pymupdf4llm

        raw = pymupdf4llm.to_markdown(
-            str(file_path), write_images = False, show_progress = False
+            str(file_path), write_images = False, show_progress = False, use_ocr = False
        )
    elif ext == ".docx":
        import mammoth
--- a/studio/backend/routes/data_recipe/validate.py
+++ b/studio/backend/routes/data_recipe/validate.py
@ -68,6 +68,20 @@ def _collect_validation_errors(recipe: dict[str, Any]) -> list[ValidateError]:
    return errors


+def _patch_local_providers(recipe: dict[str, Any]) -> None:
+    """Strip is_local and fill a dummy endpoint so validation doesn't choke.
+
+    Uses a strict `is True` check to match _inject_local_providers in
+    jobs.py - malformed payloads with truthy but non-boolean is_local
+    values should not be treated as local.
+    """
+    for provider in recipe.get("model_providers", []):
+        if not isinstance(provider, dict):
+            continue
+        if provider.pop("is_local", None) is True:
+            provider["endpoint"] = "http://127.0.0.1"
+
+
@router.post("/validate", response_model = ValidateResponse)
 def validate(payload: RecipePayload) -> ValidateResponse:
    recipe = payload.recipe
@ -77,6 +91,8 @@ def validate(payload: RecipePayload) -> ValidateResponse:
            errors = [ValidateError(message = "Recipe must include columns.")],
        )

+    _patch_local_providers(recipe)
+
    try:
        validate_recipe(recipe)
    except RuntimeError as exc:
--- a/studio/backend/routes/datasets.py
+++ b/studio/backend/routes/datasets.py
@ -11,10 +11,55 @@ import json
 import sys
 from pathlib import Path
 from uuid import uuid4
-from fastapi import APIRouter, Depends, HTTPException, UploadFile
+from typing import Optional
+from fastapi import APIRouter, Depends, HTTPException, Query, UploadFile
+import re as _re
 import structlog
 from loggers import get_logger

+_VALID_REPO_ID = _re.compile(r"^[A-Za-z0-9._-]+/[A-Za-z0-9._-]+$")
+
+
+def _is_valid_repo_id(repo_id: str) -> bool:
+    return bool(_VALID_REPO_ID.fullmatch(repo_id))
+
+
+_dataset_size_cache: dict[str, int] = {}
+
+
+def _get_dataset_size_cached(repo_id: str) -> int:
+    if repo_id in _dataset_size_cache:
+        return _dataset_size_cache[repo_id]
+    try:
+        from huggingface_hub import dataset_info as hf_dataset_info
+
+        info = hf_dataset_info(repo_id, token = None, files_metadata = True)
+        total = sum(s.size for s in info.siblings if getattr(s, "size", None))
+        _dataset_size_cache[repo_id] = total
+        return total
+    except Exception:
+        return 0
+
+
+def _resolve_hf_cache_realpath(repo_dir: Path) -> Optional[str]:
+    """Pick the most useful on-disk path for a HF cache repo dir.
+
+    Mirrors the helper in routes/models.py: prefer the most-recent
+    snapshot dir, fall back to the cache repo root, return resolved
+    realpath. Duplicated here to keep routes/datasets.py self-contained.
+    """
+    try:
+        snapshots_dir = repo_dir / "snapshots"
+        if snapshots_dir.is_dir():
+            snaps = [s for s in snapshots_dir.iterdir() if s.is_dir()]
+            if snaps:
+                latest = max(snaps, key = lambda s: s.stat().st_mtime)
+                return str(latest.resolve())
+        return str(repo_dir.resolve())
+    except Exception:
+        return None
+
+
 # Add backend directory to path
 backend_path = Path(__file__).parent.parent.parent
 if str(backend_path) not in sys.path:
@ -308,6 +353,89 @@ def list_local_datasets(
    return LocalDatasetsResponse(datasets = _build_local_dataset_items())


+@router.get("/download-progress")
+async def get_dataset_download_progress(
+    repo_id: str = Query(
+        ..., description = "HuggingFace dataset repo ID, e.g. 'unsloth/LaTeX_OCR'"
+    ),
+    current_subject: str = Depends(get_current_subject),
+):
+    """Return download progress for a HuggingFace dataset repo.
+
+    Mirrors ``GET /api/models/download-progress`` but scans the
+    ``datasets--owner--name`` cache directory under HF_HUB_CACHE.
+    Modern ``datasets``/``huggingface_hub`` caches both raw model and
+    raw dataset blobs in HF_HUB_CACHE; the ``datasets`` library writes
+    its processed Arrow shards elsewhere, but the in-progress *download*
+    bytes are observable here. Returns ``cache_path`` so the UI can
+    show users where the dataset blobs landed on disk.
+    """
+    _empty = {
+        "downloaded_bytes": 0,
+        "expected_bytes": 0,
+        "progress": 0,
+        "cache_path": None,
+    }
+    try:
+        if not _is_valid_repo_id(repo_id):
+            return _empty
+
+        from huggingface_hub import constants as hf_constants
+
+        cache_dir = Path(hf_constants.HF_HUB_CACHE)
+        target = f"datasets--{repo_id.replace('/', '--')}".lower()
+        completed_bytes = 0
+        in_progress_bytes = 0
+        cache_path: Optional[str] = None
+
+        if cache_dir.is_dir():
+            for entry in cache_dir.iterdir():
+                if entry.name.lower() != target:
+                    continue
+                cache_path = _resolve_hf_cache_realpath(entry)
+                blobs_dir = entry / "blobs"
+                if not blobs_dir.is_dir():
+                    break
+                for f in blobs_dir.iterdir():
+                    if not f.is_file():
+                        continue
+                    if f.name.endswith(".incomplete"):
+                        in_progress_bytes += f.stat().st_size
+                    else:
+                        completed_bytes += f.stat().st_size
+                break
+
+        downloaded_bytes = completed_bytes + in_progress_bytes
+        if downloaded_bytes == 0:
+            return {**_empty, "cache_path": cache_path}
+
+        expected_bytes = _get_dataset_size_cached(repo_id)
+        if expected_bytes <= 0:
+            return {
+                "downloaded_bytes": downloaded_bytes,
+                "expected_bytes": 0,
+                "progress": 0,
+                "cache_path": cache_path,
+            }
+
+        # Same 95% completion threshold as the model endpoint -- HF blob
+        # dedup makes completed_bytes drift slightly under expected_bytes,
+        # and inter-file gaps would otherwise look like "done".
+        if completed_bytes >= expected_bytes * 0.95:
+            progress = 1.0
+        else:
+            progress = min(downloaded_bytes / expected_bytes, 0.99)
+        return {
+            "downloaded_bytes": downloaded_bytes,
+            "expected_bytes": expected_bytes,
+            "progress": round(progress, 3),
+            "cache_path": cache_path,
+        }
+    except Exception as e:
+        logger.warning(f"Error checking dataset download progress for {repo_id}: {e}")
+        return _empty
+
+
@router.post("/check-format", response_model = CheckFormatResponse)
 def check_format(
    request: CheckFormatRequest,
--- a/studio/backend/routes/export.py
+++ b/studio/backend/routes/export.py
@ -5,9 +5,15 @@
 Export API routes: checkpoint discovery and model export operations.
 """

+import asyncio
+import json
 import sys
+import time
 from pathlib import Path
-from fastapi import APIRouter, Depends, HTTPException, Query
+from typing import Any, AsyncGenerator, Dict, List, Optional, Tuple
+
+from fastapi import APIRouter, Depends, HTTPException, Query, Request
+from fastapi.responses import StreamingResponse
 import structlog
 from loggers import get_logger

@ -97,7 +103,11 @@ async def load_checkpoint(
            logger.warning("Could not stop training: %s", e)

        backend = get_export_backend()
-        success, message = backend.load_checkpoint(
+        # load_checkpoint spawns and waits on a subprocess and can take
+        # minutes. Run it in a worker thread so the event loop stays
+        # free to serve the live log SSE stream concurrently.
+        success, message = await asyncio.to_thread(
+            backend.load_checkpoint,
            checkpoint_path = request.checkpoint_path,
            max_seq_length = request.max_seq_length,
            load_in_4bit = request.load_in_4bit,
@ -129,7 +139,7 @@ async def cleanup_export_memory(
    """
    try:
        backend = get_export_backend()
-        success = backend.cleanup_memory()
+        success = await asyncio.to_thread(backend.cleanup_memory)

        if not success:
            raise HTTPException(
@ -173,6 +183,17 @@ async def get_export_status(
        )


+def _export_details(output_path: Optional[str]) -> Optional[Dict[str, Any]]:
+    """Wrap the resolved on-disk export path into the details dict the
+    frontend reads to populate the Export Complete screen. Returns None
+    when the export had no local component (Hub-only push) so the
+    Pydantic field stays absent rather than ``{"output_path": null}``.
+    """
+    if not output_path:
+        return None
+    return {"output_path": output_path}
+
+
@router.post("/export/merged", response_model = ExportOperationResponse)
 async def export_merged_model(
    request: ExportMergedModelRequest,
@ -185,7 +206,8 @@ async def export_merged_model(
    """
    try:
        backend = get_export_backend()
-        success, message = backend.export_merged_model(
+        success, message, output_path = await asyncio.to_thread(
+            backend.export_merged_model,
            save_directory = request.save_directory,
            format_type = request.format_type,
            push_to_hub = request.push_to_hub,
@ -197,7 +219,11 @@ async def export_merged_model(
        if not success:
            raise HTTPException(status_code = 400, detail = message)

-        return ExportOperationResponse(success = True, message = message)
+        return ExportOperationResponse(
+            success = True,
+            message = message,
+            details = _export_details(output_path),
+        )
    except HTTPException:
        raise
    except Exception as e:
@ -220,7 +246,8 @@ async def export_base_model(
    """
    try:
        backend = get_export_backend()
-        success, message = backend.export_base_model(
+        success, message, output_path = await asyncio.to_thread(
+            backend.export_base_model,
            save_directory = request.save_directory,
            push_to_hub = request.push_to_hub,
            repo_id = request.repo_id,
@ -232,7 +259,11 @@ async def export_base_model(
        if not success:
            raise HTTPException(status_code = 400, detail = message)

-        return ExportOperationResponse(success = True, message = message)
+        return ExportOperationResponse(
+            success = True,
+            message = message,
+            details = _export_details(output_path),
+        )
    except HTTPException:
        raise
    except Exception as e:
@ -255,7 +286,8 @@ async def export_gguf(
    """
    try:
        backend = get_export_backend()
-        success, message = backend.export_gguf(
+        success, message, output_path = await asyncio.to_thread(
+            backend.export_gguf,
            save_directory = request.save_directory,
            quantization_method = request.quantization_method,
            push_to_hub = request.push_to_hub,
@ -266,7 +298,11 @@ async def export_gguf(
        if not success:
            raise HTTPException(status_code = 400, detail = message)

-        return ExportOperationResponse(success = True, message = message)
+        return ExportOperationResponse(
+            success = True,
+            message = message,
+            details = _export_details(output_path),
+        )
    except HTTPException:
        raise
    except Exception as e:
@ -289,7 +325,8 @@ async def export_lora_adapter(
    """
    try:
        backend = get_export_backend()
-        success, message = backend.export_lora_adapter(
+        success, message, output_path = await asyncio.to_thread(
+            backend.export_lora_adapter,
            save_directory = request.save_directory,
            push_to_hub = request.push_to_hub,
            repo_id = request.repo_id,
@ -300,7 +337,11 @@ async def export_lora_adapter(
        if not success:
            raise HTTPException(status_code = 400, detail = message)

-        return ExportOperationResponse(success = True, message = message)
+        return ExportOperationResponse(
+            success = True,
+            message = message,
+            details = _export_details(output_path),
+        )
    except HTTPException:
        raise
    except Exception as e:
@ -309,3 +350,155 @@ async def export_lora_adapter(
            status_code = 500,
            detail = f"Failed to export LoRA adapter: {str(e)}",
        )
+
+
+# ─────────────────────────────────────────────────────────────────────
+# Live export log stream (Server-Sent Events)
+# ─────────────────────────────────────────────────────────────────────
+#
+# The export worker subprocess redirects its stdout/stderr into a pipe
+# that a reader thread forwards to the orchestrator as log entries (see
+# core/export/worker.py::_setup_log_capture and
+# core/export/orchestrator.py::_append_log). This endpoint streams
+# those entries to the browser so the export dialog can show a live
+# terminal-style output panel while load_checkpoint / export_merged /
+# export_gguf / export_lora / export_base run.
+#
+# Shape follows the training progress SSE endpoint
+# (routes/training.py::stream_training_progress): each event carries
+# `id`, `event`, and `data` fields, the stream starts with a `retry:`
+# directive, and `Last-Event-ID` is honored on reconnect.
+
+
+def _format_sse(data: str, event: str, event_id: Optional[int] = None) -> str:
+    """Format a single SSE message with id/event/data fields."""
+    lines = []
+    if event_id is not None:
+        lines.append(f"id: {event_id}")
+    lines.append(f"event: {event}")
+    lines.append(f"data: {data}")
+    lines.append("")
+    lines.append("")
+    return "\n".join(lines)
+
+
+@router.get("/logs/stream")
+async def stream_export_logs(
+    request: Request,
+    since: Optional[int] = Query(
+        None,
+        description = "Return log entries with seq strictly greater than this cursor.",
+    ),
+    current_subject: str = Depends(get_current_subject),
+):
+    """
+    Stream live stdout/stderr output from the export worker subprocess
+    as Server-Sent Events.
+
+    Events:
+      - `log`      : a single log line (data: {"stream","line","ts"})
+      - `heartbeat`: periodic keepalive when no new lines are available
+      - `complete` : emitted once the export worker is idle and no new
+                     lines arrived for ~1 second. Clients should close.
+      - `error`    : unrecoverable server-side error
+
+    The `id:` field on each event is the log entry's monotonic seq
+    number so the browser can resume via `Last-Event-ID` on reconnect.
+    """
+    backend = get_export_backend()
+
+    # Determine starting cursor. Explicit `since` wins, then
+    # Last-Event-ID header on reconnect, otherwise start from the
+    # run-start snapshot captured by clear_logs() so the client sees
+    # every line emitted since the current run began -- even if the
+    # SSE connection opened after the POST that kicked off the export.
+    # Using get_current_log_seq() here would lose the early bootstrap
+    # lines that arrive in the gap between POST and SSE connect.
+    last_event_id = request.headers.get("last-event-id")
+    if since is None and last_event_id is not None:
+        try:
+            since = int(last_event_id)
+        except ValueError:
+            pass
+
+    if since is None:
+        cursor = backend.get_run_start_seq()
+    else:
+        cursor = max(0, int(since))
+
+    async def event_generator() -> AsyncGenerator[str, None]:
+        nonlocal cursor
+        # Tell the browser to reconnect after 3 seconds if the
+        # connection drops mid-export.
+        yield "retry: 3000\n\n"
+
+        last_yield = time.monotonic()
+        idle_since: Optional[float] = None
+        try:
+            while True:
+                if await request.is_disconnected():
+                    return
+
+                entries, new_cursor = backend.get_logs_since(cursor)
+                if entries:
+                    for entry in entries:
+                        payload = json.dumps(
+                            {
+                                "stream": entry.get("stream", "stdout"),
+                                "line": entry.get("line", ""),
+                                "ts": entry.get("ts"),
+                            }
+                        )
+                        yield _format_sse(
+                            payload,
+                            event = "log",
+                            event_id = int(entry.get("seq", 0)),
+                        )
+                    cursor = new_cursor
+                    last_yield = time.monotonic()
+                    idle_since = None
+                else:
+                    now = time.monotonic()
+                    if now - last_yield > 10.0:
+                        yield _format_sse("{}", event = "heartbeat")
+                        last_yield = now
+                    if not backend.is_export_active():
+                        # Give the reader thread a moment to drain any
+                        # trailing lines the worker process printed
+                        # just before signalling done.
+                        if idle_since is None:
+                            idle_since = now
+                        elif now - idle_since > 1.0:
+                            yield _format_sse(
+                                "{}",
+                                event = "complete",
+                                event_id = cursor,
+                            )
+                            return
+                    else:
+                        idle_since = None
+
+                await asyncio.sleep(0.1)
+        except asyncio.CancelledError:
+            # Client disconnected mid-yield. Don't re-raise, just end
+            # the generator cleanly so StreamingResponse finalizes.
+            return
+        except Exception as exc:
+            logger.error("Export log stream failed: %s", exc, exc_info = True)
+            try:
+                yield _format_sse(
+                    json.dumps({"error": str(exc)}),
+                    event = "error",
+                )
+            except Exception:
+                pass
+
+    return StreamingResponse(
+        event_generator(),
+        media_type = "text/event-stream",
+        headers = {
+            "Cache-Control": "no-cache",
+            "Connection": "keep-alive",
+            "X-Accel-Buffering": "no",
+        },
+    )
--- a/studio/backend/routes/inference.py
+++ b/studio/backend/routes/inference.py
--- a/studio/backend/routes/models.py
+++ b/studio/backend/routes/models.py
--- a/studio/backend/routes/training.py
+++ b/studio/backend/routes/training.py
@ -14,6 +14,7 @@ import structlog
 from loggers import get_logger
 import asyncio
 from datetime import datetime
+import uuid as _uuid

 # Add backend directory to path
 # The backend code should be in the same directory structure
@ -87,14 +88,22 @@ async def get_hardware_utilization(
    Get a live snapshot of GPU hardware utilization.

    Designed to be polled by the frontend during training.
-    Returns GPU utilization %, temperature, VRAM usage, and power draw
-    via nvidia-smi for maximum accuracy.
+    Returns live GPU memory usage information for the active backend.
    """
    from utils.hardware import get_gpu_utilization

    return get_gpu_utilization()


+@router.get("/hardware/visible")
+async def get_visible_hardware_utilization(
+    current_subject: str = Depends(get_current_subject),
+):
+    from utils.hardware import get_visible_gpu_utilization
+
+    return get_visible_gpu_utilization()
+
+
@router.post("/start")
 async def start_training(
    request: TrainingStartRequest,
@ -115,15 +124,11 @@ async def start_training(

        backend = get_training_backend()

-        # Generate job ID and attach to backend for later status/progress calls
-        job_id = f"job_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
-        backend.current_job_id = job_id
-
-        # Check if training is already active
+        # Check if training is already active (before mutating any state)
        if backend.is_training_active():
            existing_job_id: Optional[str] = getattr(backend, "current_job_id", "")
            return TrainingJobResponse(
-                job_id = existing_job_id or job_id,
+                job_id = existing_job_id or "",
                status = "error",
                message = (
                    "Training is already in progress. "
@ -132,6 +137,12 @@ async def start_training(
                error = "Training already active",
            )

+        # Generate job ID — passed into start_training() which sets it on the
+        # backend only after confirming the old pump thread is dead.
+        job_id = (
+            f"job_{datetime.now().strftime('%Y%m%d_%H%M%S')}_{_uuid.uuid4().hex[:8]}"
+        )
+
        # Validate dataset paths if provided
        if request.local_datasets:
            request.local_datasets = _validate_local_dataset_paths(
@ -199,6 +210,7 @@ async def start_training(
            "enable_tensorboard": request.enable_tensorboard,
            "tensorboard_dir": request.tensorboard_dir or "",
            "trust_remote_code": request.trust_remote_code,
+            "gpu_ids": request.gpu_ids,
        }

        # Training page has no trust_remote_code toggle — the value comes from
@ -248,12 +260,12 @@ async def start_training(
            logger.warning("Could not shut down export subprocess: %s", e)

        # start_training now spawns a subprocess (non-blocking)
-        success = backend.start_training(**training_kwargs)
+        success = backend.start_training(job_id = job_id, **training_kwargs)

        if not success:
            progress_error = backend.trainer.training_progress.error
            return TrainingJobResponse(
-                job_id = job_id,
+                job_id = backend.current_job_id or "",
                status = "error",
                message = progress_error or "Failed to start training subprocess",
                error = progress_error or "subprocess_start_failed",
@ -266,6 +278,9 @@ async def start_training(
            error = None,
        )

+    except ValueError as e:
+        logger.warning("Rejected training GPU selection: %s", e)
+        raise HTTPException(status_code = 400, detail = str(e))
    except Exception as e:
        logger.error(f"Error starting training: {e}", exc_info = True)
        raise HTTPException(
@ -345,7 +360,7 @@ async def reset_training(
            error = None,
            status_message = "Ready to train",
            step = 0,
-            loss = 0.0,
+            loss = None,
            epoch = 0,
            total_steps = 0,
        )
@ -419,8 +434,8 @@ async def get_training_status(
                "epoch": getattr(progress, "epoch", 0),
                "step": getattr(progress, "step", 0),
                "total_steps": getattr(progress, "total_steps", 0),
-                "loss": getattr(progress, "loss", 0.0),
-                "learning_rate": getattr(progress, "learning_rate", 0.0),
+                "loss": getattr(progress, "loss", None),
+                "learning_rate": getattr(progress, "learning_rate", None),
            }

        # Build metric history for chart recovery after SSE reconnection
@ -526,8 +541,8 @@ async def stream_training_progress(
        # ── Helpers ──────────────────────────────────────────────
        def build_progress(
            step: int,
-            loss: float,
-            learning_rate: float,
+            loss: Optional[float],
+            learning_rate: Optional[float],
            total_steps: int,
            epoch: Optional[float] = None,
            progress: Optional[Any] = None,
@ -604,10 +619,10 @@ async def stream_training_progress(
                    loss_val = (
                        backend.loss_history[i]
                        if i < len(backend.loss_history)
-                        else 0.0
+                        else None
                    )
                    lr_val = (
-                        backend.lr_history[i] if i < len(backend.lr_history) else 0.0
+                        backend.lr_history[i] if i < len(backend.lr_history) else None
                    )
                    tp_replay = getattr(
                        getattr(backend, "trainer", None), "training_progress", None
@ -645,8 +660,8 @@ async def stream_training_progress(

            initial_progress = build_progress(
                step = 0,
-                loss = 0.0,
-                learning_rate = 0.0,
+                loss = None,
+                learning_rate = None,
                total_steps = initial_total_steps,
                epoch = initial_epoch,
                progress = tp,
@ -660,9 +675,9 @@ async def stream_training_progress(
                if backend.step_history:
                    final_step = backend.step_history[-1]
                    final_loss = (
-                        backend.loss_history[-1] if backend.loss_history else 0.0
+                        backend.loss_history[-1] if backend.loss_history else None
                    )
-                    final_lr = backend.lr_history[-1] if backend.lr_history else 0.0
+                    final_lr = backend.lr_history[-1] if backend.lr_history else None
                    final_total_steps = (
                        getattr(tp, "total_steps", final_step) if tp else final_step
                    )
@ -680,7 +695,9 @@ async def stream_training_progress(
                    )
                else:
                    yield format_sse(
-                        build_progress(-1, 0.0, 0.0, 0, progress = tp).model_dump_json(),
+                        build_progress(
+                            -1, None, None, 0, progress = tp
+                        ).model_dump_json(),
                        event = "complete",
                        event_id = 0,
                    )
@ -698,9 +715,9 @@ async def stream_training_progress(
                if backend.step_history:
                    current_step = backend.step_history[-1]
                    current_loss = (
-                        backend.loss_history[-1] if backend.loss_history else 0.0
+                        backend.loss_history[-1] if backend.loss_history else None
                    )
-                    current_lr = backend.lr_history[-1] if backend.lr_history else 0.0
+                    current_lr = backend.lr_history[-1] if backend.lr_history else None
                    tp_inner = getattr(
                        getattr(backend, "trainer", None), "training_progress", None
                    )
@ -763,8 +780,8 @@ async def stream_training_progress(
                        )
                        preparing_payload = build_progress(
                            0,
-                            0.0,
-                            0.0,
+                            None,
+                            None,
                            prep_total,
                            progress = tp_prep,
                        )
@ -781,7 +798,7 @@ async def stream_training_progress(
                        getattr(backend, "trainer", None), "training_progress", None
                    )
                    timeout_payload = build_progress(
-                        last_step, 0.0, 0.0, 0, progress = tp_timeout
+                        last_step, None, None, 0, progress = tp_timeout
                    )
                    yield format_sse(
                        timeout_payload.model_dump_json(),
@ -797,7 +814,7 @@ async def stream_training_progress(
                tp_error = getattr(
                    getattr(backend, "trainer", None), "training_progress", None
                )
-                error_payload = build_progress(0, 0.0, 0.0, 0, progress = tp_error)
+                error_payload = build_progress(0, None, None, 0, progress = tp_error)
                yield format_sse(
                    error_payload.model_dump_json(),
                    event = "error",
@ -807,8 +824,8 @@ async def stream_training_progress(

        # ── Final "complete" event ───────────────────────────────
        final_step = backend.step_history[-1] if backend.step_history else last_step
-        final_loss = backend.loss_history[-1] if backend.loss_history else 0.0
-        final_lr = backend.lr_history[-1] if backend.lr_history else 0.0
+        final_loss = backend.loss_history[-1] if backend.loss_history else None
+        final_lr = backend.lr_history[-1] if backend.lr_history else None
        final_tp = getattr(getattr(backend, "trainer", None), "training_progress", None)
        final_total_steps = (
            getattr(final_tp, "total_steps", final_step) if final_tp else final_step
--- a/studio/backend/routes/training_history.py
+++ b/studio/backend/routes/training_history.py
@ -0,0 +1,85 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
+
+"""
+Training history API routes — browse, view, and delete past training runs.
+"""
+
+import json
+
+from fastapi import APIRouter, Depends, HTTPException, Query
+from loggers import get_logger
+
+from auth.authentication import get_current_subject
+from models import (
+    TrainingRunDeleteResponse,
+    TrainingRunDetailResponse,
+    TrainingRunListResponse,
+    TrainingRunMetrics,
+    TrainingRunSummary,
+)
+from storage.studio_db import delete_run, get_run, get_run_metrics, list_runs
+
+logger = get_logger(__name__)
+
+router = APIRouter()
+
+
+@router.get("/runs", response_model = TrainingRunListResponse)
+async def list_training_runs(
+    limit: int = Query(50, ge = 1, le = 200),
+    offset: int = Query(0, ge = 0),
+    current_subject: str = Depends(get_current_subject),
+):
+    """List training runs, newest first."""
+    result = list_runs(limit = limit, offset = offset)
+    return TrainingRunListResponse(
+        runs = [TrainingRunSummary(**r) for r in result["runs"]],
+        total = result["total"],
+    )
+
+
+@router.get("/runs/{run_id}", response_model = TrainingRunDetailResponse)
+async def get_training_run_detail(
+    run_id: str,
+    current_subject: str = Depends(get_current_subject),
+):
+    """Get a single training run with full config and metrics."""
+    run = get_run(run_id)
+    if run is None:
+        raise HTTPException(status_code = 404, detail = f"Run {run_id} not found")
+
+    try:
+        config = json.loads(run.get("config_json", "{}"))
+    except (json.JSONDecodeError, TypeError):
+        logger.debug("Failed to parse config_json for run %s", run_id)
+        config = {}
+
+    metrics_data = get_run_metrics(run_id)
+
+    return TrainingRunDetailResponse(
+        run = TrainingRunSummary(**{k: v for k, v in run.items() if k != "config_json"}),
+        config = config,
+        metrics = TrainingRunMetrics(**metrics_data),
+    )
+
+
+@router.delete("/runs/{run_id}", response_model = TrainingRunDeleteResponse)
+async def delete_training_run(
+    run_id: str,
+    current_subject: str = Depends(get_current_subject),
+):
+    """Delete a training run and its metrics (CASCADE)."""
+    run = get_run(run_id)
+    if run is None:
+        raise HTTPException(status_code = 404, detail = f"Run {run_id} not found")
+    if run["status"] == "running":
+        raise HTTPException(
+            status_code = 409, detail = "Cannot delete a running training run"
+        )
+    logger.info("Deleting training run %s", run_id)
+    delete_run(run_id)
+    return TrainingRunDeleteResponse(
+        status = "deleted",
+        message = f"Run {run_id} deleted",
+    )
--- a/studio/backend/run.py
+++ b/studio/backend/run.py
@ -24,6 +24,7 @@ if str(backend_dir) not in sys.path:
 import _platform_compat  # noqa: F401

 from loggers import get_logger
+from startup_banner import print_studio_access_banner

 logger = get_logger(__name__)

@ -73,18 +74,79 @@ def _resolve_external_ip() -> str:
        return "0.0.0.0"


+def _get_pid_on_port(port: int) -> "tuple[int, str] | None":
+    """Return (pid, process_name) of the process listening on *port*, or None.
+
+    Uses psutil when available.  Falls back gracefully to None so callers
+    can still report the port conflict without process details.
+
+    Works on Windows, macOS, and Linux wherever psutil is installed.
+    """
+    try:
+        import psutil
+    except ImportError:
+        return None
+    try:
+        for conn in psutil.net_connections(kind = "tcp"):
+            if conn.status == "LISTEN" and conn.laddr.port == port:
+                if conn.pid is None:
+                    return None
+                try:
+                    proc = psutil.Process(conn.pid)
+                    return (conn.pid, proc.name())
+                except (psutil.NoSuchProcess, psutil.AccessDenied):
+                    return (conn.pid, "<unknown>")
+    except (psutil.AccessDenied, OSError) as e:
+        # psutil.net_connections() needs elevated privileges on some platforms
+        logger.debug("Failed to scan network connections for port %s: %s", port, e)
+    return None
+
+
 def _is_port_free(host: str, port: int) -> bool:
-    """Check if a port is available for binding."""
+    """Check if a port is available for binding.
+
+    When *host* is ``0.0.0.0`` (wildcard), we also check whether anything
+    is already listening on ``127.0.0.1`` (and ``::1`` when IPv6 is
+    available).  An SSH tunnel or similar process may hold the loopback
+    address while our wildcard bind still succeeds, making Unsloth Studio
+    unreachable via ``localhost``.
+
+    Works on Windows, macOS, and Linux.
+    """
    import socket

+    # 1. Can we bind to the requested address?
+    #    Use getaddrinfo so both IPv4 ("0.0.0.0") and IPv6 ("::") hosts
+    #    resolve to the correct address family automatically.
    try:
-        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+        addr_info = socket.getaddrinfo(host, port, socket.AF_UNSPEC, socket.SOCK_STREAM)
+        family, socktype, proto, _, sockaddr = addr_info[0]
+        with socket.socket(family, socktype, proto) as s:
            s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
-            s.bind((host, port))
-            return True
+            s.bind(sockaddr)
    except OSError:
        return False

+    # 2. When binding to all interfaces, verify that localhost is not
+    #    already claimed by another process (e.g. an SSH -L tunnel).
+    #    We attempt a TCP connect -- if it succeeds something is listening.
+    if host in ("0.0.0.0", "::"):
+        for loopback, family in [
+            ("127.0.0.1", socket.AF_INET),
+            ("::1", socket.AF_INET6),
+        ]:
+            try:
+                with socket.socket(family, socket.SOCK_STREAM) as s:
+                    s.settimeout(1)
+                    if s.connect_ex((loopback, port)) == 0:
+                        # Connection succeeded -- port is taken on loopback
+                        return False
+            except OSError:
+                # IPv6 disabled or other OS-level restriction -- skip
+                continue
+
+    return True
+

 def _find_free_port(host: str, start: int, max_attempts: int = 20) -> int:
    """Find a free port starting from `start`, trying up to max_attempts ports."""
@ -97,6 +159,29 @@ def _find_free_port(host: str, start: int, max_attempts: int = 20) -> int:
    )


+_PID_FILE = Path.home() / ".unsloth" / "studio" / "studio.pid"
+
+
+def _write_pid_file():
+    """Write the current process PID to the studio PID file."""
+    try:
+        _PID_FILE.parent.mkdir(parents = True, exist_ok = True)
+        _PID_FILE.write_text(str(os.getpid()))
+    except OSError:
+        pass
+
+
+def _remove_pid_file():
+    """Remove the PID file if it belongs to this process."""
+    try:
+        if _PID_FILE.is_file():
+            stored = _PID_FILE.read_text().strip()
+            if stored == str(os.getpid()):
+                _PID_FILE.unlink(missing_ok = True)
+    except OSError:
+        pass
+
+
 def _graceful_shutdown(server = None):
    """Explicitly shut down all subprocess backends and the uvicorn server.

@ -104,6 +189,7 @@ def _graceful_shutdown(server = None):
    before the parent exits. This is critical on Windows where atexit
    handlers are unreliable after Ctrl+C.
    """
+    _remove_pid_file()
    logger.info("Graceful shutdown initiated — cleaning up subprocesses...")

    # 1. Shut down uvicorn server (releases the listening socket)
@ -149,11 +235,11 @@ def _graceful_shutdown(server = None):
    logger.info("All subprocesses cleaned up")


-# The uvicorn server instance — set by run_server(), used by callers
+# The uvicorn server instance -- set by run_server(), used by callers
 # that need to tell the server to exit (e.g. signal handlers).
 _server = None

-# Shutdown event — used to wake the main loop on signal
+# Shutdown event -- used to wake the main loop on signal
 _shutdown_event = None


@ -162,6 +248,7 @@ def run_server(
    port: int = 8888,
    frontend_path: Path = Path(__file__).resolve().parent.parent / "frontend" / "dist",
    silent: bool = False,
+    llama_parallel_slots: int = 1,
 ):
    """
    Start the FastAPI server.
@ -171,6 +258,7 @@ def run_server(
        port: Port to bind to (auto-increments if in use)
        frontend_path: Path to frontend build directory (optional)
        silent: Suppress startup messages
+        llama_parallel_slots: Number of parallel slots for llama-server

    Note:
        Signal handlers are NOT registered here so that embedders
@ -205,18 +293,31 @@ def run_server(
    # Auto-find free port if requested port is in use
    if not _is_port_free(host, port):
        original_port = port
-        port = _find_free_port(host, port)
+        blocker = _get_pid_on_port(port)
+        port = _find_free_port(host, port + 1)
        if not silent:
-            print(f"Port {original_port} is in use, using port {port} instead")
+            print("")
+            print("=" * 50)
+            if blocker:
+                pid, name = blocker
+                print(
+                    f"Port {original_port} is already in use by " f"{name} (PID {pid})."
+                )
+            else:
+                print(f"Port {original_port} is already in use.")
+            print(f"Unsloth Studio will use port {port} instead.")
+            print(f"Open http://localhost:{port} in your browser.")
+            print("=" * 50)
+            print("")

    # Setup frontend if path provided
    if frontend_path:
        if setup_frontend(app, frontend_path):
            if not silent:
-                print(f"✅ Frontend loaded from {frontend_path}")
+                print(f"[OK] Frontend loaded from {frontend_path}")
        else:
            if not silent:
-                print(f"⚠️ Frontend not found at {frontend_path}")
+                print(f"[WARNING] Frontend not found at {frontend_path}")

    # Create the uvicorn server and expose it for signal handlers
    config = uvicorn.Config(
@ -225,6 +326,15 @@ def run_server(
    _server = uvicorn.Server(config)
    _shutdown_event = Event()

+    # Expose the actual bound port so request-handling code can build
+    # loopback URLs that point at the real backend, not whatever port a
+    # reverse proxy or tunnel exposed in the request URL. Only publish
+    # an explicit value when we know the concrete port; for ephemeral
+    # binds (port==0) leave it unset and let request handlers fall back
+    # to the ASGI request scope or request.base_url.
+    app.state.server_port = port if port and port > 0 else None
+    app.state.llama_parallel_slots = llama_parallel_slots
+
    # Run server in a daemon thread
    def _run():
        asyncio.run(_server.serve())
@ -233,21 +343,27 @@ def run_server(
    thread.start()
    time.sleep(3)

+    _write_pid_file()
+    import atexit
+
+    atexit.register(_remove_pid_file)
+
+    # Expose a shutdown callable via app.state so the /api/shutdown endpoint
+    # can trigger graceful shutdown without circular imports.
+    def _trigger_shutdown():
+        _graceful_shutdown(_server)
+        if _shutdown_event is not None:
+            _shutdown_event.set()
+
+    app.state.trigger_shutdown = _trigger_shutdown
+
    if not silent:
        display_host = _resolve_external_ip() if host == "0.0.0.0" else host
-
-        print("")
-        print("=" * 50)
-        print(f"🦥 Open your web browser, and enter http://localhost:{port}")
-        print("=" * 50)
-        print("")
-        print("=" * 50)
-        print(f"🦥 Unsloth Studio is running on port {port}")
-        print(f"   Local Access:          http://localhost:{port}")
-        print(f"   Worldwide Web Address: http://{display_host}:{port}")
-        print(f"   API:                   http://{display_host}:{port}/api")
-        print(f"   Health:                http://{display_host}:{port}/api/health")
-        print("=" * 50)
+        print_studio_access_banner(
+            port = port,
+            bind_host = host,
+            display_host = display_host,
+        )

    return app

@ -297,7 +413,7 @@ if __name__ == "__main__":
        sys.stderr.flush()
        sys.exit(1)

-    # ── Signal handler — ensures subprocess cleanup on Ctrl+C ────
+    # Signal handler -- ensures subprocess cleanup on Ctrl+C
    def _signal_handler(signum, frame):
        _graceful_shutdown(_server)
        _shutdown_event.set()
--- a/studio/backend/startup_banner.py
+++ b/studio/backend/startup_banner.py
@ -0,0 +1,123 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
+
+"""Terminal banner for Studio startup.
+
+Stdlib only — safe to import without the rest of the backend (no structlog/uvicorn).
+"""
+
+from __future__ import annotations
+
+import os
+import sys
+
+
+def stdout_supports_color() -> bool:
+    """True if we should emit ANSI colors."""
+    if os.environ.get("NO_COLOR", "").strip():
+        return False
+    if os.environ.get("FORCE_COLOR", "").strip():
+        return True
+    try:
+        return sys.stdout.isatty()
+    except (AttributeError, OSError, ValueError):
+        return False
+
+
+def print_port_in_use_notice(original_port: int, new_port: int) -> None:
+    """Message when the requested port is taken and another is chosen."""
+    msg = f"Port {original_port} is in use, using port {new_port} instead."
+    if stdout_supports_color():
+        print(f"\033[38;5;245m{msg}\033[0m")
+    else:
+        print(msg)
+
+
+def print_studio_access_banner(
+    *,
+    port: int,
+    bind_host: str,
+    display_host: str,
+) -> None:
+    """Pretty-print URLs after the server is listening (beginner-friendly)."""
+    use_color = stdout_supports_color()
+    dim = "\033[38;5;245m"
+    title = "\033[38;5;150m"
+    local_url_style = "\033[38;5;108;1m"
+    secondary = "\033[38;5;109m"
+    reset = "\033[0m"
+
+    def style(text: str, code: str) -> str:
+        return f"{code}{text}{reset}" if use_color else text
+
+    ipv6_bind = bind_host in ("::", "::1")
+    if ipv6_bind:
+        loopback_url = f"http://[::1]:{port}"
+        alt_local = f"http://localhost:{port}"
+    else:
+        loopback_url = f"http://127.0.0.1:{port}"
+        alt_local = f"http://localhost:{port}"
+    if ":" in display_host:
+        external_url = f"http://[{display_host}]:{port}"
+    else:
+        external_url = f"http://{display_host}:{port}"
+
+    listen_all = bind_host in ("0.0.0.0", "::")
+    loopback_bind = bind_host in ("127.0.0.1", "localhost", "::1")
+
+    # Use loopback URL only when the server is reachable on loopback;
+    # otherwise show the actual bound address.
+    primary_url = loopback_url if listen_all or loopback_bind else external_url
+    tip_url = alt_local if listen_all or loopback_bind else external_url
+    api_base = primary_url
+
+    lines: list[str] = [
+        "",
+        style("🦥 Unsloth Studio is running", title),
+        style("─" * 52, dim),
+        style("  On this machine -- open this in your browser:", dim),
+        style(f"    {primary_url}", local_url_style),
+    ]
+
+    if (listen_all or loopback_bind) and primary_url != alt_local:
+        lines.append(style(f"    (same as {alt_local})", dim))
+
+    if listen_all and display_host not in (
+        "127.0.0.1",
+        "localhost",
+        "::1",
+        "0.0.0.0",
+        "::",
+    ):
+        lines.extend(
+            [
+                "",
+                style("  From another device on your network / to share:", dim),
+                style(f"    {external_url}", secondary),
+            ]
+        )
+    elif not listen_all and not loopback_bind and external_url != primary_url:
+        lines.extend(
+            [
+                "",
+                style("  Bound address:", dim),
+                style(f"    {external_url}", secondary),
+            ]
+        )
+
+    lines.extend(
+        [
+            "",
+            style("  API & health:", dim),
+            style(f"    {api_base}/api", secondary),
+            style(f"    {api_base}/api/health", secondary),
+            style("─" * 52, dim),
+            style(
+                f"  Tip: if you are on this computer, open {tip_url}/ in your browser.",
+                dim,
+            ),
+            "",
+        ]
+    )
+
+    print("\n".join(lines))
--- a/studio/backend/storage/init.py
+++ b/studio/backend/storage/init.py
@ -0,0 +1,2 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
--- a/studio/backend/storage/studio_db.py
+++ b/studio/backend/storage/studio_db.py
@ -0,0 +1,488 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
+
+"""
+SQLite storage for training run history and metrics.
+
+Follows the same pattern as auth/storage.py — module-level functions,
+raw sqlite3, per-function connections. Enhancements over auth:
+  - WAL mode for concurrent read/write access
+  - PRAGMA foreign_keys = ON for CASCADE deletes
+"""
+
+import json
+import logging
+import os
+import platform
+import sqlite3
+import threading
+from datetime import datetime, timezone
+
+logger = logging.getLogger(__name__)
+from typing import Optional
+
+
+from utils.paths import studio_db_path, ensure_dir
+
+
+def _denied_path_prefixes() -> list[str]:
+    """Platform-aware denylist of system directories."""
+    system = platform.system()
+    if system == "Linux":
+        return ["/proc", "/sys", "/dev", "/etc", "/boot", "/run"]
+    if system == "Darwin":
+        # realpath() resolves /etc -> /private/etc, /tmp -> /private/tmp on macOS,
+        # so include the /private variants to avoid bypasses.
+        return [
+            "/System",
+            "/Library",
+            "/dev",
+            "/etc",
+            "/private/etc",
+            "/tmp",
+            "/private/tmp",
+            "/var",
+            "/private/var",
+        ]
+    if system == "Windows":
+        win = os.environ.get("SystemRoot", r"C:\Windows")
+        pf = os.environ.get("ProgramFiles", r"C:\Program Files")
+        pf86 = os.environ.get("ProgramFiles(x86)", r"C:\Program Files (x86)")
+        return [os.path.normcase(p) for p in [win, pf, pf86]]
+    return []
+
+
+_schema_lock = threading.Lock()
+_schema_ready = False
+
+
+def _ensure_schema(conn: sqlite3.Connection) -> None:
+    """Create tables and indexes if they don't exist. Called once per process."""
+    conn.execute("PRAGMA journal_mode=WAL")
+    conn.execute(
+        """
+        CREATE TABLE IF NOT EXISTS training_runs (
+            id TEXT NOT NULL PRIMARY KEY,
+            status TEXT NOT NULL DEFAULT 'running',
+            model_name TEXT NOT NULL,
+            dataset_name TEXT NOT NULL,
+            config_json TEXT NOT NULL,
+            started_at TEXT NOT NULL,
+            ended_at TEXT,
+            total_steps INTEGER,
+            final_step INTEGER,
+            final_loss REAL,
+            output_dir TEXT,
+            error_message TEXT,
+            duration_seconds REAL,
+            loss_sparkline TEXT
+        )
+        """
+    )
+    conn.execute(
+        """
+        CREATE TABLE IF NOT EXISTS training_metrics (
+            id INTEGER PRIMARY KEY AUTOINCREMENT,
+            run_id TEXT NOT NULL REFERENCES training_runs(id) ON DELETE CASCADE,
+            step INTEGER NOT NULL,
+            loss REAL,
+            learning_rate REAL,
+            grad_norm REAL,
+            eval_loss REAL,
+            epoch REAL,
+            num_tokens INTEGER,
+            elapsed_seconds REAL,
+            UNIQUE(run_id, step)
+        )
+        """
+    )
+    conn.execute(
+        "CREATE INDEX IF NOT EXISTS idx_metrics_run_id ON training_metrics(run_id)"
+    )
+    # Use COLLATE NOCASE on Windows so C:\Models and c:\models dedup via the
+    # UNIQUE constraint.  On Linux/macOS (case-sensitive FS) keep the default
+    # BINARY collation so /Models and /models remain distinct.
+    collation = "COLLATE NOCASE" if platform.system() == "Windows" else ""
+    conn.execute(
+        f"""
+        CREATE TABLE IF NOT EXISTS scan_folders (
+            id INTEGER PRIMARY KEY AUTOINCREMENT,
+            path TEXT NOT NULL UNIQUE {collation},
+            created_at TEXT NOT NULL
+        )
+        """
+    )
+
+
+def get_connection() -> sqlite3.Connection:
+    """Open studio.db with WAL mode, create tables once per process, enable foreign keys."""
+    global _schema_ready
+    db_path = studio_db_path()
+    ensure_dir(db_path.parent)
+    conn = sqlite3.connect(str(db_path))
+    conn.row_factory = sqlite3.Row
+    # foreign_keys is session-scoped, must be set per connection
+    conn.execute("PRAGMA foreign_keys=ON")
+    if not _schema_ready:
+        with _schema_lock:
+            if not _schema_ready:
+                try:
+                    _ensure_schema(conn)
+                    _schema_ready = True
+                except Exception:
+                    conn.close()
+                    raise
+    return conn
+
+
+def create_run(
+    id: str,
+    model_name: str,
+    dataset_name: str,
+    config_json: str,
+    started_at: str,
+    total_steps: Optional[int],
+) -> None:
+    conn = get_connection()
+    try:
+        conn.execute(
+            """
+            INSERT INTO training_runs (id, model_name, dataset_name, config_json, started_at, total_steps)
+            VALUES (?, ?, ?, ?, ?, ?)
+            """,
+            (id, model_name, dataset_name, config_json, started_at, total_steps),
+        )
+        conn.commit()
+    finally:
+        conn.close()
+
+
+def update_run_total_steps(id: str, total_steps: int) -> None:
+    conn = get_connection()
+    try:
+        conn.execute(
+            "UPDATE training_runs SET total_steps = ? WHERE id = ?",
+            (total_steps, id),
+        )
+        conn.commit()
+    finally:
+        conn.close()
+
+
+def update_run_progress(
+    id: str, step: int, loss: Optional[float], duration_seconds: Optional[float]
+) -> None:
+    """Update current progress on a running training run (called on each metric flush)."""
+    conn = get_connection()
+    try:
+        conn.execute(
+            "UPDATE training_runs SET final_step = ?, final_loss = ?, duration_seconds = ? WHERE id = ?",
+            (step, loss, duration_seconds, id),
+        )
+        conn.commit()
+    finally:
+        conn.close()
+
+
+def finish_run(
+    id: str,
+    status: str,
+    ended_at: str,
+    final_step: Optional[int],
+    final_loss: Optional[float],
+    duration_seconds: Optional[float],
+    loss_sparkline: Optional[str] = None,
+    output_dir: Optional[str] = None,
+    error_message: Optional[str] = None,
+) -> None:
+    conn = get_connection()
+    try:
+        conn.execute(
+            """
+            UPDATE training_runs
+            SET status = ?, ended_at = ?, final_step = ?, final_loss = ?,
+                duration_seconds = ?, loss_sparkline = ?, output_dir = ?,
+                error_message = ?
+            WHERE id = ?
+            """,
+            (
+                status,
+                ended_at,
+                final_step,
+                final_loss,
+                duration_seconds,
+                loss_sparkline,
+                output_dir,
+                error_message,
+                id,
+            ),
+        )
+        conn.commit()
+    finally:
+        conn.close()
+
+
+def insert_metrics_batch(run_id: str, metrics: list[dict]) -> None:
+    if not metrics:
+        return
+    conn = get_connection()
+    try:
+        conn.executemany(
+            """
+            INSERT INTO training_metrics
+                (run_id, step, loss, learning_rate, grad_norm, eval_loss, epoch, num_tokens, elapsed_seconds)
+            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
+            ON CONFLICT(run_id, step) DO UPDATE SET
+                loss = COALESCE(excluded.loss, loss),
+                learning_rate = COALESCE(excluded.learning_rate, learning_rate),
+                grad_norm = COALESCE(excluded.grad_norm, grad_norm),
+                eval_loss = COALESCE(excluded.eval_loss, eval_loss),
+                epoch = COALESCE(excluded.epoch, epoch),
+                num_tokens = COALESCE(excluded.num_tokens, num_tokens),
+                elapsed_seconds = COALESCE(excluded.elapsed_seconds, elapsed_seconds)
+            """,
+            [
+                (
+                    run_id,
+                    m.get("step"),
+                    m.get("loss"),
+                    m.get("learning_rate"),
+                    m.get("grad_norm"),
+                    m.get("eval_loss"),
+                    m.get("epoch"),
+                    m.get("num_tokens"),
+                    m.get("elapsed_seconds"),
+                )
+                for m in metrics
+            ],
+        )
+        conn.commit()
+    finally:
+        conn.close()
+
+
+def list_runs(limit: int = 50, offset: int = 0) -> dict:
+    conn = get_connection()
+    try:
+        total = conn.execute("SELECT COUNT(*) FROM training_runs").fetchone()[0]
+        rows = conn.execute(
+            """
+            SELECT id, status, model_name, dataset_name, started_at, ended_at,
+                   total_steps, final_step, final_loss, output_dir,
+                   duration_seconds, error_message, loss_sparkline
+            FROM training_runs
+            ORDER BY started_at DESC
+            LIMIT ? OFFSET ?
+            """,
+            (limit, offset),
+        ).fetchall()
+        runs = []
+        for row in rows:
+            run = dict(row)
+            sparkline = run.get("loss_sparkline")
+            if sparkline:
+                try:
+                    run["loss_sparkline"] = json.loads(sparkline)
+                except (json.JSONDecodeError, TypeError):
+                    logger.debug(
+                        "Failed to parse loss_sparkline for run %s", run.get("id")
+                    )
+                    run["loss_sparkline"] = None
+            runs.append(run)
+        return {"runs": runs, "total": total}
+    finally:
+        conn.close()
+
+
+def get_run(id: str) -> Optional[dict]:
+    conn = get_connection()
+    try:
+        row = conn.execute("SELECT * FROM training_runs WHERE id = ?", (id,)).fetchone()
+        if row is None:
+            return None
+        run = dict(row)
+        sparkline = run.get("loss_sparkline")
+        if sparkline:
+            try:
+                run["loss_sparkline"] = json.loads(sparkline)
+            except (json.JSONDecodeError, TypeError):
+                logger.debug("Failed to parse loss_sparkline for run %s", id)
+                run["loss_sparkline"] = None
+        return run
+    finally:
+        conn.close()
+
+
+def get_run_metrics(id: str) -> dict:
+    """Return metric arrays for a run, using paired step arrays per metric."""
+    conn = get_connection()
+    try:
+        rows = conn.execute(
+            """
+            SELECT step, loss, learning_rate, grad_norm, eval_loss, epoch,
+                   num_tokens, elapsed_seconds
+            FROM training_metrics
+            WHERE run_id = ?
+            ORDER BY step
+            """,
+            (id,),
+        ).fetchall()
+
+        step_history: list[int] = []
+        loss_history: list[float] = []
+        loss_step_history: list[int] = []
+        lr_history: list[float] = []
+        lr_step_history: list[int] = []
+        grad_norm_history: list[float] = []
+        grad_norm_step_history: list[int] = []
+        eval_loss_history: list[float] = []
+        eval_step_history: list[int] = []
+        final_epoch: float | None = None
+        final_num_tokens: int | None = None
+
+        for row in rows:
+            step = row["step"]
+            step_history.append(step)
+            if step > 0 and row["loss"] is not None:
+                loss_history.append(row["loss"])
+                loss_step_history.append(step)
+            if step > 0 and row["learning_rate"] is not None:
+                lr_history.append(row["learning_rate"])
+                lr_step_history.append(step)
+            if step > 0 and row["grad_norm"] is not None:
+                grad_norm_history.append(row["grad_norm"])
+                grad_norm_step_history.append(step)
+            if step > 0 and row["eval_loss"] is not None:
+                eval_loss_history.append(row["eval_loss"])
+                eval_step_history.append(step)
+            if row["epoch"] is not None:
+                final_epoch = row["epoch"]
+            if row["num_tokens"] is not None:
+                final_num_tokens = row["num_tokens"]
+
+        return {
+            "step_history": step_history,
+            "loss_history": loss_history,
+            "loss_step_history": loss_step_history,
+            "lr_history": lr_history,
+            "lr_step_history": lr_step_history,
+            "grad_norm_history": grad_norm_history,
+            "grad_norm_step_history": grad_norm_step_history,
+            "eval_loss_history": eval_loss_history,
+            "eval_step_history": eval_step_history,
+            "final_epoch": final_epoch,
+            "final_num_tokens": final_num_tokens,
+        }
+    finally:
+        conn.close()
+
+
+def delete_run(id: str) -> None:
+    conn = get_connection()
+    try:
+        conn.execute("DELETE FROM training_runs WHERE id = ?", (id,))
+        conn.commit()
+    finally:
+        conn.close()
+
+
+def cleanup_orphaned_runs() -> None:
+    """Mark any 'running' rows as errored on startup (server restarted mid-training)."""
+    conn = get_connection()
+    try:
+        conn.execute(
+            """
+            UPDATE training_runs
+            SET status = 'error',
+                error_message = 'Server restarted during training',
+                ended_at = ?
+            WHERE status = 'running'
+            """,
+            (datetime.now(timezone.utc).isoformat(),),
+        )
+        conn.commit()
+    finally:
+        conn.close()
+
+
+def list_scan_folders() -> list[dict]:
+    conn = get_connection()
+    try:
+        rows = conn.execute(
+            "SELECT id, path, created_at FROM scan_folders ORDER BY created_at"
+        ).fetchall()
+        return [dict(row) for row in rows]
+    finally:
+        conn.close()
+
+
+def add_scan_folder(path: str) -> dict:
+    """Add a directory to the custom scan folder list. Returns the row."""
+    if not path or not path.strip():
+        raise ValueError("Path cannot be empty")
+    normalized = os.path.realpath(os.path.expanduser(path.strip()))
+
+    # Validate the path is an existing, readable directory before persisting.
+    if not os.path.exists(normalized):
+        raise ValueError("Path does not exist")
+    if not os.path.isdir(normalized):
+        raise ValueError("Path must be a directory, not a file")
+    if not os.access(normalized, os.R_OK | os.X_OK):
+        raise ValueError("Path is not readable")
+
+    # On Windows, use normcase for denylist comparison but store the
+    # original-cased path so downstream consumers see the native
+    # drive-letter casing the user expects (e.g. C:\Models, not c:\models).
+    is_win = platform.system() == "Windows"
+    check = os.path.normcase(normalized) if is_win else normalized
+    for prefix in _denied_path_prefixes():
+        if check == prefix or check.startswith(prefix + os.sep):
+            raise ValueError(f"Path under {prefix} is not allowed")
+
+    conn = get_connection()
+    try:
+        now = datetime.now(timezone.utc).isoformat()
+        # On Windows, use case-insensitive lookup so C:\Models and c:\models
+        # dedup correctly while preserving the originally-stored casing.
+        if is_win:
+            existing = conn.execute(
+                "SELECT id, path, created_at FROM scan_folders WHERE path = ? COLLATE NOCASE",
+                (normalized,),
+            ).fetchone()
+        else:
+            existing = conn.execute(
+                "SELECT id, path, created_at FROM scan_folders WHERE path = ?",
+                (normalized,),
+            ).fetchone()
+        if existing is not None:
+            return dict(existing)
+        try:
+            conn.execute(
+                "INSERT INTO scan_folders (path, created_at) VALUES (?, ?)",
+                (normalized, now),
+            )
+            conn.commit()
+        except sqlite3.IntegrityError:
+            pass  # duplicate -- fall through to SELECT
+        # Use the same collation as the pre-check so we find the row even
+        # when a concurrent writer stored it with different casing (Windows).
+        fallback_sql = (
+            "SELECT id, path, created_at FROM scan_folders WHERE path = ? COLLATE NOCASE"
+            if is_win
+            else "SELECT id, path, created_at FROM scan_folders WHERE path = ?"
+        )
+        row = conn.execute(fallback_sql, (normalized,)).fetchone()
+        if row is None:
+            raise ValueError("Folder was concurrently removed")
+        return dict(row)
+    finally:
+        conn.close()
+
+
+def remove_scan_folder(id: int) -> None:
+    conn = get_connection()
+    try:
+        conn.execute("DELETE FROM scan_folders WHERE id = ?", (id,))
+        conn.commit()
+    finally:
+        conn.close()
--- a/studio/backend/tests/conftest.py
+++ b/studio/backend/tests/conftest.py
@ -3,14 +3,136 @@

 """
 Shared pytest configuration for the backend test suite.
-Ensures that the backend root is on sys.path so that
-`import utils.utils` (and similar flat imports) resolve correctly.
+
+Responsibilities:
+1. Put the backend root on sys.path so `from models.inference import ...`
+   (and similar flat imports) resolve in test modules — mirrors how the
+   app itself is launched.
+2. Provide a hybrid ``studio_server`` session fixture for end-to-end tests
+   (see ``test_studio_api.py``). The fixture supports two invocation modes:
+
+   a. **External server.** If ``UNSLOTH_E2E_BASE_URL`` is set, tests point
+      at an already-running Studio instance. ``UNSLOTH_E2E_API_KEY`` must
+      also be set. This is the fast-iteration mode: start the server once
+      with ``unsloth studio run ...``, then run pytest against it many
+      times with no per-run GGUF load cost.
+
+   b. **Fixture-managed server.** Otherwise, the fixture launches a fresh
+      server via ``_start_server`` and tears it down at session end. This
+      is the one-shot mode for CI or a clean-slate verification run.
+
+   The model / variant for mode (b) come from ``--unsloth-model`` /
+   ``--unsloth-gguf-variant`` pytest options, then ``UNSLOTH_E2E_MODEL`` /
+   ``UNSLOTH_E2E_VARIANT`` env vars, then the defaults in
+   ``test_studio_api.py``.
 """

+import os
 import sys
 from pathlib import Path

+import pytest
+
 # Add backend root to sys.path (mirrors how the app itself is launched)
 _backend_root = Path(__file__).resolve().parent.parent
 if str(_backend_root) not in sys.path:
    sys.path.insert(0, str(_backend_root))
+
+
+# ── Pytest CLI options ───────────────────────────────────────────────
+
+
+def pytest_addoption(parser):
+    group = parser.getgroup(
+        "unsloth-e2e",
+        "Unsloth Studio end-to-end test options",
+    )
+    group.addoption(
+        "--unsloth-model",
+        action = "store",
+        default = None,
+        help = (
+            "GGUF model id used when starting a server for e2e tests. "
+            "Ignored if UNSLOTH_E2E_BASE_URL is set. Overrides "
+            "UNSLOTH_E2E_MODEL env var. Defaults to test_studio_api.py's "
+            "DEFAULT_MODEL."
+        ),
+    )
+    group.addoption(
+        "--unsloth-gguf-variant",
+        action = "store",
+        default = None,
+        help = (
+            "GGUF variant used when starting a server for e2e tests. "
+            "Ignored if UNSLOTH_E2E_BASE_URL is set. Overrides "
+            "UNSLOTH_E2E_VARIANT env var. Defaults to test_studio_api.py's "
+            "DEFAULT_VARIANT."
+        ),
+    )
+
+
+# ── E2E server fixtures ──────────────────────────────────────────────
+
+
+@pytest.fixture(scope = "session")
+def studio_server(request):
+    """Yield ``(base_url, api_key)`` for e2e tests.
+
+    Resolution order:
+
+    1. If ``UNSLOTH_E2E_BASE_URL`` is set → point at that server,
+       require ``UNSLOTH_E2E_API_KEY`` alongside (skip if missing).
+    2. Otherwise → start a fresh ``unsloth studio run`` subprocess via
+       the existing ``_start_server`` helper in ``test_studio_api.py``
+       and tear it down on session teardown.
+
+    Session-scoped so the expensive GGUF load happens at most once per
+    pytest invocation. Lazily instantiated — tests that don't request
+    the fixture (e.g. the unit tests in ``test_anthropic_messages.py``
+    or ``test_help_output``) do not trigger server startup.
+    """
+    external_url = os.environ.get("UNSLOTH_E2E_BASE_URL")
+    if external_url:
+        api_key = os.environ.get("UNSLOTH_E2E_API_KEY")
+        if not api_key:
+            pytest.skip(
+                "UNSLOTH_E2E_BASE_URL is set but UNSLOTH_E2E_API_KEY is "
+                "missing — tests that require auth cannot run against an "
+                "external server without it.",
+            )
+        yield external_url, api_key
+        return
+
+    # Lazy import: pytest has already loaded test_studio_api into
+    # sys.modules by the time any test requests this fixture, so this
+    # is a cache hit, not a re-execution.
+    import test_studio_api as _e2e
+
+    model = (
+        request.config.getoption("--unsloth-model")
+        or os.environ.get("UNSLOTH_E2E_MODEL")
+        or _e2e.DEFAULT_MODEL
+    )
+    variant = (
+        request.config.getoption("--unsloth-gguf-variant")
+        or os.environ.get("UNSLOTH_E2E_VARIANT")
+        or _e2e.DEFAULT_VARIANT
+    )
+
+    proc, api_key = _e2e._start_server(model, variant)
+    try:
+        yield f"http://{_e2e.HOST}:{_e2e.PORT}", api_key
+    finally:
+        _e2e._kill_server(proc)
+
+
+@pytest.fixture
+def base_url(studio_server):
+    """Base URL for the e2e Studio server (from ``studio_server``)."""
+    return studio_server[0]
+
+
+@pytest.fixture
+def api_key(studio_server):
+    """API key for the e2e Studio server (from ``studio_server``)."""
+    return studio_server[1]
--- a/studio/backend/tests/test_anthropic_messages.py
+++ b/studio/backend/tests/test_anthropic_messages.py
@ -0,0 +1,774 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved.
+
+"""
+Tests for the Anthropic Messages API schemas and translation layer.
+No running server or GPU required.
+"""
+
+import sys
+import os
+import json
+
+_backend = os.path.join(os.path.dirname(__file__), "..")
+sys.path.insert(0, _backend)
+
+from models.inference import (
+    AnthropicMessagesRequest,
+    AnthropicMessagesResponse,
+    AnthropicMessage,
+    AnthropicTextBlock,
+    AnthropicToolUseBlock,
+    AnthropicToolResultBlock,
+    AnthropicTool,
+    AnthropicUsage,
+    AnthropicResponseTextBlock,
+    AnthropicResponseToolUseBlock,
+)
+from core.inference.anthropic_compat import (
+    anthropic_messages_to_openai,
+    anthropic_tools_to_openai,
+    build_anthropic_sse_event,
+    AnthropicStreamEmitter,
+    AnthropicPassthroughEmitter,
+)
+
+
+# =====================================================================
+# Pydantic model tests
+# =====================================================================
+
+
+class TestAnthropicModels:
+    def test_minimal_request(self):
+        req = AnthropicMessagesRequest(
+            messages = [{"role": "user", "content": "Hi"}],
+        )
+        assert req.max_tokens is None
+        assert req.model == "default"
+        assert req.stream is False
+
+    def test_max_tokens_optional(self):
+        req = AnthropicMessagesRequest(
+            max_tokens = 100,
+            messages = [{"role": "user", "content": "Hi"}],
+        )
+        assert req.max_tokens == 100
+
+    def test_system_as_string(self):
+        req = AnthropicMessagesRequest(
+            max_tokens = 50,
+            messages = [{"role": "user", "content": "Hi"}],
+            system = "You are helpful.",
+        )
+        assert req.system == "You are helpful."
+
+    def test_tools_field_parses(self):
+        req = AnthropicMessagesRequest(
+            max_tokens = 100,
+            messages = [{"role": "user", "content": "Hi"}],
+            tools = [{"name": "web_search", "input_schema": {"type": "object"}}],
+        )
+        assert len(req.tools) == 1
+        assert req.tools[0].name == "web_search"
+
+    def test_extra_fields_accepted(self):
+        req = AnthropicMessagesRequest(
+            max_tokens = 100,
+            messages = [{"role": "user", "content": "Hi"}],
+            some_future_field = "hello",
+        )
+        assert req.max_tokens == 100
+
+    def test_stream_defaults_false(self):
+        req = AnthropicMessagesRequest(
+            max_tokens = 100,
+            messages = [{"role": "user", "content": "Hi"}],
+        )
+        assert req.stream is False
+
+    def test_enable_tools_shorthand(self):
+        req = AnthropicMessagesRequest(
+            messages = [{"role": "user", "content": "Hi"}],
+            enable_tools = True,
+            enabled_tools = ["web_search", "python"],
+            session_id = "my-session",
+        )
+        assert req.enable_tools is True
+        assert req.enabled_tools == ["web_search", "python"]
+        assert req.session_id == "my-session"
+
+    def test_extension_fields_default_none(self):
+        req = AnthropicMessagesRequest(
+            messages = [{"role": "user", "content": "Hi"}],
+        )
+        assert req.enable_tools is None
+        assert req.enabled_tools is None
+        assert req.session_id is None
+
+    def test_response_model_defaults(self):
+        resp = AnthropicMessagesResponse()
+        assert resp.type == "message"
+        assert resp.role == "assistant"
+        assert resp.id.startswith("msg_")
+        assert resp.content == []
+        assert resp.usage.input_tokens == 0
+
+
+# =====================================================================
+# Message translation tests
+# =====================================================================
+
+
+class TestAnthropicMessagesToOpenAI:
+    def test_simple_user_message(self):
+        msgs = [{"role": "user", "content": "Hello"}]
+        result = anthropic_messages_to_openai(msgs)
+        assert result == [{"role": "user", "content": "Hello"}]
+
+    def test_system_string_prepended(self):
+        msgs = [{"role": "user", "content": "Hello"}]
+        result = anthropic_messages_to_openai(msgs, system = "Be brief.")
+        assert result[0] == {"role": "system", "content": "Be brief."}
+        assert result[1] == {"role": "user", "content": "Hello"}
+
+    def test_system_as_block_list(self):
+        system = [
+            {"type": "text", "text": "Be brief."},
+            {"type": "text", "text": "Be accurate."},
+        ]
+        msgs = [{"role": "user", "content": "Hello"}]
+        result = anthropic_messages_to_openai(msgs, system = system)
+        assert result[0]["role"] == "system"
+        assert "Be brief." in result[0]["content"]
+        assert "Be accurate." in result[0]["content"]
+
+    def test_multi_turn_conversation(self):
+        msgs = [
+            {"role": "user", "content": "Hi"},
+            {"role": "assistant", "content": "Hello!"},
+            {"role": "user", "content": "How are you?"},
+        ]
+        result = anthropic_messages_to_openai(msgs)
+        assert len(result) == 3
+        assert result[0]["role"] == "user"
+        assert result[1]["role"] == "assistant"
+        assert result[2]["role"] == "user"
+
+    def test_assistant_tool_use_maps_to_tool_calls(self):
+        msgs = [
+            {
+                "role": "assistant",
+                "content": [
+                    {"type": "text", "text": "Let me search."},
+                    {
+                        "type": "tool_use",
+                        "id": "tu_1",
+                        "name": "web_search",
+                        "input": {"query": "test"},
+                    },
+                ],
+            }
+        ]
+        result = anthropic_messages_to_openai(msgs)
+        assert len(result) == 1
+        m = result[0]
+        assert m["role"] == "assistant"
+        assert m["content"] == "Let me search."
+        assert len(m["tool_calls"]) == 1
+        tc = m["tool_calls"][0]
+        assert tc["id"] == "tu_1"
+        assert tc["function"]["name"] == "web_search"
+        assert json.loads(tc["function"]["arguments"]) == {"query": "test"}
+
+    def test_tool_result_maps_to_tool_role(self):
+        msgs = [
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "tool_result",
+                        "tool_use_id": "tu_1",
+                        "content": "Result text",
+                    },
+                ],
+            }
+        ]
+        result = anthropic_messages_to_openai(msgs)
+        assert len(result) == 1
+        assert result[0]["role"] == "tool"
+        assert result[0]["tool_call_id"] == "tu_1"
+        assert result[0]["content"] == "Result text"
+
+    def test_mixed_text_and_tool_use_blocks(self):
+        msgs = [
+            {
+                "role": "assistant",
+                "content": [
+                    {"type": "text", "text": "Thinking..."},
+                    {
+                        "type": "tool_use",
+                        "id": "tu_1",
+                        "name": "python",
+                        "input": {"code": "1+1"},
+                    },
+                    {
+                        "type": "tool_use",
+                        "id": "tu_2",
+                        "name": "terminal",
+                        "input": {"command": "ls"},
+                    },
+                ],
+            }
+        ]
+        result = anthropic_messages_to_openai(msgs)
+        assert len(result) == 1
+        m = result[0]
+        assert m["content"] == "Thinking..."
+        assert len(m["tool_calls"]) == 2
+
+    def test_tool_result_with_list_content(self):
+        msgs = [
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "tool_result",
+                        "tool_use_id": "tu_1",
+                        "content": [
+                            {"type": "text", "text": "Line 1"},
+                            {"type": "text", "text": "Line 2"},
+                        ],
+                    },
+                ],
+            }
+        ]
+        result = anthropic_messages_to_openai(msgs)
+        assert result[0]["content"] == "Line 1 Line 2"
+
+
+# =====================================================================
+# Tool translation tests
+# =====================================================================
+
+
+class TestAnthropicToolsToOpenAI:
+    def test_single_tool(self):
+        tools = [
+            {
+                "name": "web_search",
+                "description": "Search",
+                "input_schema": {
+                    "type": "object",
+                    "properties": {"query": {"type": "string"}},
+                },
+            }
+        ]
+        result = anthropic_tools_to_openai(tools)
+        assert len(result) == 1
+        assert result[0]["type"] == "function"
+        assert result[0]["function"]["name"] == "web_search"
+        assert result[0]["function"]["parameters"]["type"] == "object"
+
+    def test_multiple_tools(self):
+        tools = [
+            {"name": "a", "description": "Tool A", "input_schema": {}},
+            {"name": "b", "description": "Tool B", "input_schema": {}},
+        ]
+        result = anthropic_tools_to_openai(tools)
+        assert len(result) == 2
+        assert result[0]["function"]["name"] == "a"
+        assert result[1]["function"]["name"] == "b"
+
+    def test_empty_list(self):
+        assert anthropic_tools_to_openai([]) == []
+
+    def test_pydantic_model_input(self):
+        tool = AnthropicTool(
+            name = "test", description = "desc", input_schema = {"type": "object"}
+        )
+        result = anthropic_tools_to_openai([tool])
+        assert result[0]["function"]["name"] == "test"
+
+
+# =====================================================================
+# SSE event helper tests
+# =====================================================================
+
+
+class TestBuildAnthropicSSEEvent:
+    def test_basic_event(self):
+        result = build_anthropic_sse_event("message_start", {"type": "message_start"})
+        assert result.startswith("event: message_start\n")
+        assert "data: " in result
+        assert result.endswith("\n\n")
+
+    def test_data_is_valid_json(self):
+        result = build_anthropic_sse_event("test", {"key": "value"})
+        data_line = result.split("\n")[1]
+        payload = json.loads(data_line.removeprefix("data: "))
+        assert payload == {"key": "value"}
+
+
+# =====================================================================
+# Stream emitter tests
+# =====================================================================
+
+
+class TestAnthropicStreamEmitter:
+    def test_start_emits_message_start_and_content_block_start(self):
+        e = AnthropicStreamEmitter()
+        events = e.start("msg_123", "test-model")
+        assert len(events) == 2
+        assert "message_start" in events[0]
+        assert "content_block_start" in events[1]
+        assert '"type": "text"' in events[1]
+
+    def test_content_delta_emits_text_delta(self):
+        e = AnthropicStreamEmitter()
+        e.start("msg_1", "m")
+        events = e.feed({"type": "content", "text": "Hello"})
+        assert len(events) == 1
+        parsed = json.loads(events[0].split("data: ")[1])
+        assert parsed["delta"]["type"] == "text_delta"
+        assert parsed["delta"]["text"] == "Hello"
+
+    def test_cumulative_content_diffs_correctly(self):
+        e = AnthropicStreamEmitter()
+        e.start("msg_1", "m")
+        e.feed({"type": "content", "text": "Hel"})
+        events = e.feed({"type": "content", "text": "Hello"})
+        parsed = json.loads(events[0].split("data: ")[1])
+        assert parsed["delta"]["text"] == "lo"
+
+    def test_empty_content_diff_no_event(self):
+        e = AnthropicStreamEmitter()
+        e.start("msg_1", "m")
+        e.feed({"type": "content", "text": "Hi"})
+        events = e.feed({"type": "content", "text": "Hi"})
+        assert events == []
+
+    def test_tool_start_closes_text_opens_tool_block(self):
+        e = AnthropicStreamEmitter()
+        e.start("msg_1", "m")
+        e.feed({"type": "content", "text": "Thinking"})
+        events = e.feed(
+            {
+                "type": "tool_start",
+                "tool_name": "web_search",
+                "tool_call_id": "tc_1",
+                "arguments": {"query": "test"},
+            }
+        )
+        # content_block_stop + content_block_start(tool_use) + content_block_delta(input_json)
+        assert len(events) == 3
+        assert "content_block_stop" in events[0]
+        assert "tool_use" in events[1]
+        assert "input_json_delta" in events[2]
+
+    def test_tool_end_closes_tool_opens_new_text_block(self):
+        e = AnthropicStreamEmitter()
+        e.start("msg_1", "m")
+        e.feed(
+            {
+                "type": "tool_start",
+                "tool_name": "t",
+                "tool_call_id": "tc_1",
+                "arguments": {},
+            }
+        )
+        events = e.feed(
+            {
+                "type": "tool_end",
+                "tool_name": "t",
+                "tool_call_id": "tc_1",
+                "result": "done",
+            }
+        )
+        # content_block_stop (tool) + tool_result + content_block_start (new text)
+        assert len(events) == 3
+        assert "content_block_stop" in events[0]
+        assert "tool_result" in events[1]
+        parsed = json.loads(events[1].split("data: ")[1])
+        assert parsed["content"] == "done"
+        assert parsed["tool_use_id"] == "tc_1"
+        assert "content_block_start" in events[2]
+        assert '"type": "text"' in events[2]
+
+    def test_finish_emits_stop_events(self):
+        e = AnthropicStreamEmitter()
+        e.start("msg_1", "m")
+        events = e.finish("end_turn")
+        # content_block_stop + message_delta + message_stop
+        assert len(events) == 3
+        assert "content_block_stop" in events[0]
+        assert "message_delta" in events[1]
+        assert "end_turn" in events[1]
+        assert "message_stop" in events[2]
+
+    def test_metadata_captured_in_finish_usage(self):
+        e = AnthropicStreamEmitter()
+        e.start("msg_1", "m")
+        e.feed(
+            {
+                "type": "metadata",
+                "usage": {"prompt_tokens": 10, "completion_tokens": 20},
+            }
+        )
+        events = e.finish("end_turn")
+        delta_event = [ev for ev in events if "message_delta" in ev][0]
+        parsed = json.loads(delta_event.split("data: ")[1])
+        assert parsed["usage"]["output_tokens"] == 20
+
+    def test_status_events_ignored(self):
+        e = AnthropicStreamEmitter()
+        e.start("msg_1", "m")
+        events = e.feed({"type": "status", "text": "Searching..."})
+        assert events == []
+
+    def test_no_tool_calls_simple_text_flow(self):
+        e = AnthropicStreamEmitter()
+        start_events = e.start("msg_1", "m")
+        content_events = e.feed({"type": "content", "text": "Hello world"})
+        meta_events = e.feed(
+            {"type": "metadata", "usage": {"prompt_tokens": 5, "completion_tokens": 2}}
+        )
+        end_events = e.finish("end_turn")
+
+        assert len(start_events) == 2
+        assert len(content_events) == 1
+        assert meta_events == []
+        assert len(end_events) == 3
+
+    def test_block_index_increments(self):
+        e = AnthropicStreamEmitter()
+        e.start("msg_1", "m")
+        assert e.block_index == 0
+        e.feed(
+            {
+                "type": "tool_start",
+                "tool_name": "t",
+                "tool_call_id": "tc_1",
+                "arguments": {},
+            }
+        )
+        assert e.block_index == 1
+        e.feed(
+            {
+                "type": "tool_end",
+                "tool_name": "t",
+                "tool_call_id": "tc_1",
+                "result": "ok",
+            }
+        )
+        assert e.block_index == 2
+
+    def test_text_after_tool_resets_prev_text(self):
+        e = AnthropicStreamEmitter()
+        e.start("msg_1", "m")
+        e.feed({"type": "content", "text": "Before tool"})
+        e.feed(
+            {
+                "type": "tool_start",
+                "tool_name": "t",
+                "tool_call_id": "tc_1",
+                "arguments": {},
+            }
+        )
+        e.feed(
+            {
+                "type": "tool_end",
+                "tool_name": "t",
+                "tool_call_id": "tc_1",
+                "result": "ok",
+            }
+        )
+        # After tool_end, prev_text should be reset
+        events = e.feed({"type": "content", "text": "After tool"})
+        parsed = json.loads(events[0].split("data: ")[1])
+        assert parsed["delta"]["text"] == "After tool"
+
+
+# =====================================================================
+# Pass-through emitter tests (client-side tool execution path)
+# =====================================================================
+
+
+class TestAnthropicPassthroughEmitter:
+    def _parse(self, event_str):
+        return json.loads(event_str.split("data: ")[1])
+
+    def test_start_emits_message_start_only(self):
+        e = AnthropicPassthroughEmitter()
+        events = e.start("msg_1", "test-model")
+        assert len(events) == 1
+        assert "message_start" in events[0]
+        parsed = self._parse(events[0])
+        assert parsed["message"]["id"] == "msg_1"
+        assert parsed["message"]["model"] == "test-model"
+
+    def test_text_chunk_opens_text_block_and_emits_delta(self):
+        e = AnthropicPassthroughEmitter()
+        e.start("msg_1", "m")
+        chunk = {"choices": [{"delta": {"content": "Hello"}}]}
+        events = e.feed_chunk(chunk)
+        # content_block_start + content_block_delta
+        assert len(events) == 2
+        assert "content_block_start" in events[0]
+        assert '"type": "text"' in events[0]
+        delta = self._parse(events[1])
+        assert delta["delta"]["type"] == "text_delta"
+        assert delta["delta"]["text"] == "Hello"
+
+    def test_sequential_text_chunks_single_block(self):
+        e = AnthropicPassthroughEmitter()
+        e.start("msg_1", "m")
+        events1 = e.feed_chunk({"choices": [{"delta": {"content": "Hello"}}]})
+        events2 = e.feed_chunk({"choices": [{"delta": {"content": " world"}}]})
+        # First chunk opens the block, second only emits delta
+        assert len(events1) == 2
+        assert len(events2) == 1
+        assert self._parse(events2[0])["delta"]["text"] == " world"
+
+    def test_tool_call_opens_tool_use_block(self):
+        e = AnthropicPassthroughEmitter()
+        e.start("msg_1", "m")
+        chunk = {
+            "choices": [
+                {
+                    "delta": {
+                        "tool_calls": [
+                            {
+                                "index": 0,
+                                "id": "call_1",
+                                "type": "function",
+                                "function": {"name": "Bash", "arguments": ""},
+                            }
+                        ]
+                    }
+                }
+            ]
+        }
+        events = e.feed_chunk(chunk)
+        assert len(events) == 1
+        parsed = self._parse(events[0])
+        assert parsed["type"] == "content_block_start"
+        assert parsed["content_block"]["type"] == "tool_use"
+        assert parsed["content_block"]["id"] == "call_1"
+        assert parsed["content_block"]["name"] == "Bash"
+
+    def test_tool_call_arguments_streamed_as_input_json_delta(self):
+        e = AnthropicPassthroughEmitter()
+        e.start("msg_1", "m")
+        # Open the tool call
+        e.feed_chunk(
+            {
+                "choices": [
+                    {
+                        "delta": {
+                            "tool_calls": [
+                                {
+                                    "index": 0,
+                                    "id": "c1",
+                                    "type": "function",
+                                    "function": {"name": "Bash", "arguments": ""},
+                                }
+                            ]
+                        }
+                    }
+                ]
+            }
+        )
+        # Stream argument fragments
+        events1 = e.feed_chunk(
+            {
+                "choices": [
+                    {
+                        "delta": {
+                            "tool_calls": [
+                                {"index": 0, "function": {"arguments": '{"cmd'}}
+                            ]
+                        }
+                    }
+                ]
+            }
+        )
+        events2 = e.feed_chunk(
+            {
+                "choices": [
+                    {
+                        "delta": {
+                            "tool_calls": [
+                                {"index": 0, "function": {"arguments": '": "ls"}'}}
+                            ]
+                        }
+                    }
+                ]
+            }
+        )
+        parsed1 = self._parse(events1[0])
+        parsed2 = self._parse(events2[0])
+        assert parsed1["delta"]["type"] == "input_json_delta"
+        assert parsed1["delta"]["partial_json"] == '{"cmd'
+        assert parsed2["delta"]["partial_json"] == '": "ls"}'
+
+    def test_text_then_tool_closes_text_block(self):
+        e = AnthropicPassthroughEmitter()
+        e.start("msg_1", "m")
+        e.feed_chunk({"choices": [{"delta": {"content": "Let me check."}}]})
+        events = e.feed_chunk(
+            {
+                "choices": [
+                    {
+                        "delta": {
+                            "tool_calls": [
+                                {
+                                    "index": 0,
+                                    "id": "c1",
+                                    "type": "function",
+                                    "function": {"name": "Bash", "arguments": ""},
+                                }
+                            ]
+                        }
+                    }
+                ]
+            }
+        )
+        # Should close text block and open tool_use block
+        assert "content_block_stop" in events[0]
+        assert "content_block_start" in events[1]
+        assert '"type": "tool_use"' in events[1]
+
+    def test_finish_reason_tool_calls_sets_tool_use_stop(self):
+        e = AnthropicPassthroughEmitter()
+        e.start("msg_1", "m")
+        e.feed_chunk(
+            {
+                "choices": [
+                    {
+                        "delta": {
+                            "tool_calls": [
+                                {
+                                    "index": 0,
+                                    "id": "c1",
+                                    "type": "function",
+                                    "function": {"name": "Bash", "arguments": "{}"},
+                                }
+                            ]
+                        }
+                    }
+                ]
+            }
+        )
+        e.feed_chunk({"choices": [{"delta": {}, "finish_reason": "tool_calls"}]})
+        events = e.finish()
+        delta_event = [ev for ev in events if "message_delta" in ev][0]
+        parsed = self._parse(delta_event)
+        assert parsed["delta"]["stop_reason"] == "tool_use"
+
+    def test_finish_reason_stop_sets_end_turn(self):
+        e = AnthropicPassthroughEmitter()
+        e.start("msg_1", "m")
+        e.feed_chunk({"choices": [{"delta": {"content": "Hi"}}]})
+        e.feed_chunk({"choices": [{"delta": {}, "finish_reason": "stop"}]})
+        events = e.finish()
+        delta_event = [ev for ev in events if "message_delta" in ev][0]
+        parsed = self._parse(delta_event)
+        assert parsed["delta"]["stop_reason"] == "end_turn"
+
+    def test_finish_reason_length_sets_max_tokens(self):
+        e = AnthropicPassthroughEmitter()
+        e.start("msg_1", "m")
+        e.feed_chunk({"choices": [{"delta": {"content": "Hi"}}]})
+        e.feed_chunk({"choices": [{"delta": {}, "finish_reason": "length"}]})
+        events = e.finish()
+        delta_event = [ev for ev in events if "message_delta" in ev][0]
+        parsed = self._parse(delta_event)
+        assert parsed["delta"]["stop_reason"] == "max_tokens"
+
+    def test_finish_closes_current_block(self):
+        e = AnthropicPassthroughEmitter()
+        e.start("msg_1", "m")
+        e.feed_chunk({"choices": [{"delta": {"content": "Hi"}}]})
+        events = e.finish()
+        assert "content_block_stop" in events[0]
+        assert "message_delta" in events[1]
+        assert "message_stop" in events[2]
+
+    def test_usage_chunk_captured(self):
+        e = AnthropicPassthroughEmitter()
+        e.start("msg_1", "m")
+        e.feed_chunk({"choices": [{"delta": {"content": "Hi"}}]})
+        e.feed_chunk(
+            {
+                "choices": [],
+                "usage": {"prompt_tokens": 10, "completion_tokens": 5},
+            }
+        )
+        events = e.finish()
+        delta_event = [ev for ev in events if "message_delta" in ev][0]
+        parsed = self._parse(delta_event)
+        assert parsed["usage"]["output_tokens"] == 5
+
+    def test_empty_chunk_returns_no_events(self):
+        e = AnthropicPassthroughEmitter()
+        e.start("msg_1", "m")
+        events = e.feed_chunk({"choices": []})
+        assert events == []
+
+    def test_no_blocks_at_all_still_produces_valid_finish(self):
+        e = AnthropicPassthroughEmitter()
+        e.start("msg_1", "m")
+        events = e.finish()
+        # No content_block_stop because no block was opened
+        assert not any("content_block_stop" in ev for ev in events)
+        assert any("message_delta" in ev for ev in events)
+        assert any("message_stop" in ev for ev in events)
+
+    def test_multiple_tool_calls_distinct_blocks(self):
+        e = AnthropicPassthroughEmitter()
+        e.start("msg_1", "m")
+        # First tool call
+        e.feed_chunk(
+            {
+                "choices": [
+                    {
+                        "delta": {
+                            "tool_calls": [
+                                {
+                                    "index": 0,
+                                    "id": "c1",
+                                    "type": "function",
+                                    "function": {"name": "Bash", "arguments": "{}"},
+                                }
+                            ]
+                        }
+                    }
+                ]
+            }
+        )
+        # Second tool call (different index)
+        events = e.feed_chunk(
+            {
+                "choices": [
+                    {
+                        "delta": {
+                            "tool_calls": [
+                                {
+                                    "index": 1,
+                                    "id": "c2",
+                                    "type": "function",
+                                    "function": {"name": "Read", "arguments": "{}"},
+                                }
+                            ]
+                        }
+                    }
+                ]
+            }
+        )
+        # Should close block 0, open block 1
+        assert "content_block_stop" in events[0]
+        assert "content_block_start" in events[1]
+        parsed = self._parse(events[1])
+        assert parsed["content_block"]["name"] == "Read"
+        assert parsed["content_block"]["id"] == "c2"
--- a/studio/backend/tests/test_browse_folders_route.py
+++ b/studio/backend/tests/test_browse_folders_route.py
@ -0,0 +1,86 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
+
+import os
+import sys
+import types
+from pathlib import Path
+
+import pytest
+from fastapi import HTTPException
+
+# Keep this test runnable in lightweight environments where optional logging
+# deps are not installed.
+if "structlog" not in sys.modules:
+
+    class _DummyLogger:
+        def __getattr__(self, _name):
+            return lambda *args, **kwargs: None
+
+    sys.modules["structlog"] = types.SimpleNamespace(
+        BoundLogger = _DummyLogger,
+        get_logger = lambda *args, **kwargs: _DummyLogger(),
+    )
+
+import routes.models as models_route
+
+
+def test_resolve_browse_target_returns_allowed_directory(tmp_path):
+    allowed = tmp_path / "allowed"
+    target = allowed / "models" / "nested"
+    target.mkdir(parents = True)
+
+    resolved = models_route._resolve_browse_target(str(target), [allowed])
+
+    assert resolved == target.resolve()
+
+
+def test_resolve_browse_target_rejects_outside_allowlist(tmp_path):
+    allowed = tmp_path / "allowed"
+    disallowed = tmp_path / "disallowed"
+    allowed.mkdir()
+    disallowed.mkdir()
+
+    with pytest.raises(HTTPException) as exc_info:
+        models_route._resolve_browse_target(str(disallowed), [allowed])
+
+    assert exc_info.value.status_code == 403
+
+
+def test_resolve_browse_target_rejects_file_path(tmp_path):
+    allowed = tmp_path / "allowed"
+    allowed.mkdir()
+    model_file = allowed / "model.gguf"
+    model_file.write_text("gguf")
+
+    with pytest.raises(HTTPException) as exc_info:
+        models_route._resolve_browse_target(str(model_file), [allowed])
+
+    assert exc_info.value.status_code == 400
+
+
+def test_resolve_browse_target_allows_symlink_into_other_allowed_root(tmp_path):
+    home_root = tmp_path / "home"
+    scan_root = tmp_path / "scan"
+    target = scan_root / "nested"
+    home_root.mkdir()
+    target.mkdir(parents = True)
+    (home_root / "scan-link").symlink_to(scan_root, target_is_directory = True)
+
+    resolved = models_route._resolve_browse_target(
+        str(home_root / "scan-link" / "nested"),
+        [home_root, scan_root],
+    )
+
+    assert resolved == target.resolve()
+
+
+@pytest.mark.skipif(os.altsep is not None, reason = "POSIX-only path semantics")
+def test_resolve_browse_target_allows_backslash_in_posix_segment(tmp_path):
+    allowed = tmp_path / "allowed"
+    target = allowed / r"dir\name"
+    target.mkdir(parents = True)
+
+    resolved = models_route._resolve_browse_target(str(target), [allowed])
+
+    assert resolved == target.resolve()
--- a/studio/backend/tests/test_cache_case_resolution.py
+++ b/studio/backend/tests/test_cache_case_resolution.py
@ -0,0 +1,120 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
+
+from pathlib import Path
+import sys
+import types
+
+# Keep this test runnable in lightweight environments where optional logging
+# deps are not installed.
+if "structlog" not in sys.modules:
+
+    class _DummyLogger:
+        def __getattr__(self, _name):
+            return lambda *args, **kwargs: None
+
+    sys.modules["structlog"] = types.SimpleNamespace(
+        BoundLogger = _DummyLogger,
+        get_logger = lambda *args, **kwargs: _DummyLogger(),
+    )
+
+from utils.paths.path_utils import (
+    resolve_cached_repo_id_case,
+    get_cache_case_resolution_stats,
+    reset_cache_case_resolution_state,
+)
+import utils.paths.path_utils as path_utils
+
+
+def _mk_cache_repo(cache_root: Path, repo_id: str) -> Path:
+    repo_dir = cache_root / f"models--{repo_id.replace('/', '--')}"
+    repo_dir.mkdir(parents = True, exist_ok = True)
+    return repo_dir
+
+
+def test_resolve_cached_repo_id_case_exact_hit(tmp_path, monkeypatch):
+    reset_cache_case_resolution_state()
+    _mk_cache_repo(tmp_path, "Org/Model")
+    monkeypatch.setattr(path_utils, "_hf_hub_cache_dir", lambda: tmp_path)
+
+    resolved = resolve_cached_repo_id_case("Org/Model")
+
+    assert resolved == "Org/Model"
+    stats = get_cache_case_resolution_stats()
+    assert stats["calls"] == 1
+    assert stats["exact_hits"] == 1
+    assert stats["variant_hits"] == 0
+
+
+def test_resolve_cached_repo_id_case_variant_hit(tmp_path, monkeypatch):
+    reset_cache_case_resolution_state()
+    _mk_cache_repo(tmp_path, "Org/Model")
+    monkeypatch.setattr(path_utils, "_hf_hub_cache_dir", lambda: tmp_path)
+
+    resolved = resolve_cached_repo_id_case("org/model")
+
+    assert resolved == "Org/Model"
+    stats = get_cache_case_resolution_stats()
+    assert stats["variant_hits"] == 1
+    assert stats["tie_breaks"] == 0
+
+
+def test_resolve_cached_repo_id_case_tie_break_deterministic(tmp_path, monkeypatch):
+    reset_cache_case_resolution_state()
+    _mk_cache_repo(tmp_path, "Org/Model")
+    _mk_cache_repo(tmp_path, "org/model")
+    monkeypatch.setattr(path_utils, "_hf_hub_cache_dir", lambda: tmp_path)
+
+    resolved = resolve_cached_repo_id_case("oRg/mOdEl")
+
+    # Deterministic rule: lexical sort of candidate repo ids.
+    assert resolved == "Org/Model"
+    stats = get_cache_case_resolution_stats()
+    assert stats["variant_hits"] == 1
+    assert stats["tie_breaks"] == 1
+
+
+def test_resolve_cached_repo_id_case_no_cache_fallback(tmp_path, monkeypatch):
+    reset_cache_case_resolution_state()
+    monkeypatch.setattr(path_utils, "_hf_hub_cache_dir", lambda: tmp_path)
+
+    resolved = resolve_cached_repo_id_case("Org/Missing")
+
+    assert resolved == "Org/Missing"
+    stats = get_cache_case_resolution_stats()
+    assert stats["fallbacks"] == 1
+    assert stats["variant_hits"] == 0
+    assert stats["exact_hits"] == 0
+
+
+def test_resolve_cached_repo_id_case_memoization(tmp_path, monkeypatch):
+    reset_cache_case_resolution_state()
+    _mk_cache_repo(tmp_path, "Org/Model")
+    monkeypatch.setattr(path_utils, "_hf_hub_cache_dir", lambda: tmp_path)
+
+    first = resolve_cached_repo_id_case("org/model")
+    second = resolve_cached_repo_id_case("org/model")
+
+    assert first == "Org/Model"
+    assert second == "Org/Model"
+    stats = get_cache_case_resolution_stats()
+    assert stats["calls"] == 2
+    assert stats["variant_hits"] == 1
+    assert stats["memo_hits"] == 1
+
+
+def test_resolve_cached_repo_id_case_late_cache_population(tmp_path, monkeypatch):
+    """Regression guard: memoized fallback should not hide a later cache variant."""
+    reset_cache_case_resolution_state()
+    monkeypatch.setattr(path_utils, "_hf_hub_cache_dir", lambda: tmp_path)
+
+    first = resolve_cached_repo_id_case("org/model")
+    assert first == "org/model"
+
+    # Simulate cache being populated after first miss (e.g. another code path/download).
+    _mk_cache_repo(tmp_path, "Org/Model")
+
+    second = resolve_cached_repo_id_case("org/model")
+
+    # Desired behavior: second lookup should pick up the now-existing variant.
+    assert second == "Org/Model"
--- a/studio/backend/tests/test_cached_gguf_routes.py
+++ b/studio/backend/tests/test_cached_gguf_routes.py
@ -0,0 +1,398 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
+
+import asyncio
+import sys
+import types
+from pathlib import Path
+from types import SimpleNamespace
+
+# Keep this test runnable in lightweight environments where optional logging
+# deps are not installed.
+if "structlog" not in sys.modules:
+
+    class _DummyLogger:
+        def __getattr__(self, _name):
+            return lambda *args, **kwargs: None
+
+    sys.modules["structlog"] = types.SimpleNamespace(
+        BoundLogger = _DummyLogger,
+        get_logger = lambda *args, **kwargs: _DummyLogger(),
+    )
+
+import routes.models as models_route
+
+
+def _repo(
+    repo_id: str,
+    files: list[SimpleNamespace],
+    repo_path: Path,
+    *,
+    revisions: list[SimpleNamespace] | None = None,
+) -> SimpleNamespace:
+    return SimpleNamespace(
+        repo_id = repo_id,
+        repo_type = "model",
+        repo_path = repo_path,
+        revisions = revisions or [SimpleNamespace(files = files)],
+    )
+
+
+def _file(
+    name: str,
+    size_on_disk: int,
+    *,
+    blob_path: str | None = None,
+) -> SimpleNamespace:
+    return SimpleNamespace(
+        file_name = name,
+        size_on_disk = size_on_disk,
+        blob_path = blob_path,
+    )
+
+
+def test_iter_gguf_paths_matches_extension_case_insensitively(tmp_path):
+    nested = tmp_path / "snapshots" / "rev"
+    nested.mkdir(parents = True)
+    lower = nested / "Q4_K_M.gguf"
+    upper = nested / "Q8_0.GGUF"
+    other = nested / "README.md"
+    lower.write_text("a")
+    upper.write_text("b")
+    other.write_text("c")
+
+    result = sorted(path.name for path in models_route._iter_gguf_paths(tmp_path))
+
+    assert result == ["Q4_K_M.gguf", "Q8_0.GGUF"]
+
+
+def test_list_cached_gguf_includes_non_suffix_repo_when_cache_contains_gguf(
+    monkeypatch, tmp_path
+):
+    repo = _repo(
+        "HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive",
+        [_file("Q4_K_M.gguf", 5_000), _file("README.md", 10)],
+        tmp_path / "models--HauhauCS--Gemma",
+    )
+    scan = SimpleNamespace(repos = [repo])
+
+    monkeypatch.setattr(models_route, "_all_hf_cache_scans", lambda: [scan])
+
+    result = asyncio.run(models_route.list_cached_gguf(current_subject = "test-user"))
+
+    assert result["cached"] == [
+        {
+            "repo_id": "HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive",
+            "size_bytes": 5_000,
+            "cache_path": str(repo.repo_path),
+        }
+    ]
+
+
+def test_list_cached_gguf_matches_extension_case_insensitively(monkeypatch, tmp_path):
+    repo = _repo(
+        "Org/Model-Without-Suffix",
+        [_file("Q8_0.GGUF", 7_000)],
+        tmp_path / "models--Org--Model-Without-Suffix",
+    )
+    scan = SimpleNamespace(repos = [repo])
+
+    monkeypatch.setattr(models_route, "_all_hf_cache_scans", lambda: [scan])
+
+    result = asyncio.run(models_route.list_cached_gguf(current_subject = "test-user"))
+
+    assert result["cached"] == [
+        {
+            "repo_id": "Org/Model-Without-Suffix",
+            "size_bytes": 7_000,
+            "cache_path": str(repo.repo_path),
+        }
+    ]
+
+
+def test_list_cached_gguf_skips_repos_without_positive_gguf_size(monkeypatch, tmp_path):
+    missing = _repo(
+        "Org/ReadmeOnly",
+        [_file("README.md", 10)],
+        tmp_path / "models--Org--ReadmeOnly",
+    )
+    zero = _repo(
+        "Org/ZeroSize",
+        [_file("Q4_K_M.gguf", 0)],
+        tmp_path / "models--Org--ZeroSize",
+    )
+    scan = SimpleNamespace(repos = [missing, zero])
+
+    monkeypatch.setattr(models_route, "_all_hf_cache_scans", lambda: [scan])
+
+    result = asyncio.run(models_route.list_cached_gguf(current_subject = "test-user"))
+
+    assert result["cached"] == []
+
+
+def test_list_cached_gguf_keeps_largest_duplicate_repo_across_scans(
+    monkeypatch, tmp_path
+):
+    smaller = _repo(
+        "Org/Dupe",
+        [_file("Q4_K_M.gguf", 2_000)],
+        tmp_path / "models--Org--Dupe-a",
+    )
+    larger = _repo(
+        "org/dupe",
+        [_file("Q4_K_M.gguf", 5_000), _file("Q6_K.gguf", 1_000)],
+        tmp_path / "models--Org--Dupe-b",
+    )
+
+    monkeypatch.setattr(
+        models_route,
+        "_all_hf_cache_scans",
+        lambda: [
+            SimpleNamespace(repos = [smaller]),
+            SimpleNamespace(repos = [larger]),
+        ],
+    )
+
+    result = asyncio.run(models_route.list_cached_gguf(current_subject = "test-user"))
+
+    assert result["cached"] == [
+        {
+            "repo_id": "org/dupe",
+            "size_bytes": 6_000,
+            "cache_path": str(larger.repo_path),
+        }
+    ]
+
+
+def test_list_cached_gguf_dedupes_shared_blobs_across_revisions(monkeypatch, tmp_path):
+    shared = "blobs/shared-q4"
+    repo = _repo(
+        "Org/SharedBlobRepo",
+        [],
+        tmp_path / "models--Org--SharedBlobRepo",
+        revisions = [
+            SimpleNamespace(files = [_file("Q4_K_M.gguf", 5_000, blob_path = shared)]),
+            SimpleNamespace(files = [_file("Q4_K_M.gguf", 5_000, blob_path = shared)]),
+        ],
+    )
+
+    monkeypatch.setattr(
+        models_route,
+        "_all_hf_cache_scans",
+        lambda: [SimpleNamespace(repos = [repo])],
+    )
+
+    result = asyncio.run(models_route.list_cached_gguf(current_subject = "test-user"))
+
+    assert result["cached"] == [
+        {
+            "repo_id": "Org/SharedBlobRepo",
+            "size_bytes": 5_000,
+            "cache_path": str(repo.repo_path),
+        }
+    ]
+
+
+def test_list_cached_models_skips_non_suffix_repo_when_gguf_files_exist(
+    monkeypatch, tmp_path
+):
+    mixed = _repo(
+        "Org/MixedRepo",
+        [
+            _file("Q4_K_M.gguf", 5_000),
+            _file("model.safetensors", 10_000),
+        ],
+        tmp_path / "models--Org--MixedRepo",
+    )
+
+    monkeypatch.setattr(
+        models_route,
+        "_all_hf_cache_scans",
+        lambda: [SimpleNamespace(repos = [mixed])],
+    )
+
+    result = asyncio.run(models_route.list_cached_models(current_subject = "test-user"))
+
+    assert result["cached"] == []
+
+
+def test_list_cached_gguf_includes_mixed_repo_with_gguf_and_safetensors(
+    monkeypatch, tmp_path
+):
+    """Mirror of the _skips_ test: the mixed repo should still surface in
+    cached-gguf so the picker can show it as a GGUF download."""
+    mixed = _repo(
+        "Org/MixedRepo",
+        [
+            _file("Q4_K_M.gguf", 5_000),
+            _file("model.safetensors", 10_000),
+        ],
+        tmp_path / "models--Org--MixedRepo",
+    )
+
+    monkeypatch.setattr(
+        models_route,
+        "_all_hf_cache_scans",
+        lambda: [SimpleNamespace(repos = [mixed])],
+    )
+
+    result = asyncio.run(models_route.list_cached_gguf(current_subject = "test-user"))
+
+    assert result["cached"] == [
+        {
+            "repo_id": "Org/MixedRepo",
+            "size_bytes": 5_000,
+            "cache_path": str(mixed.repo_path),
+        }
+    ]
+
+
+def test_list_cached_gguf_handles_none_size_on_disk(monkeypatch, tmp_path):
+    """A partial/interrupted GGUF download has ``size_on_disk = None``. The
+    route must treat the unknown bytes as zero instead of raising TypeError
+    out of ``sum()`` and wiping the entire response."""
+    partial = _repo(
+        "Org/PartialDownload",
+        [_file("Q4_K_M.gguf", None), _file("Q6_K.gguf", 5_000)],
+        tmp_path / "models--Org--PartialDownload",
+    )
+
+    monkeypatch.setattr(
+        models_route,
+        "_all_hf_cache_scans",
+        lambda: [SimpleNamespace(repos = [partial])],
+    )
+
+    result = asyncio.run(models_route.list_cached_gguf(current_subject = "test-user"))
+
+    assert result["cached"] == [
+        {
+            "repo_id": "Org/PartialDownload",
+            "size_bytes": 5_000,
+            "cache_path": str(partial.repo_path),
+        }
+    ]
+
+
+def test_list_cached_gguf_skips_malformed_repo_without_wiping_response(
+    monkeypatch, tmp_path
+):
+    """One repo raising during classification must not poison the response
+    for every other repo in the scan."""
+
+    class _ExplodingRepo:
+        repo_id = "Org/Broken"
+        repo_type = "model"
+        repo_path = tmp_path / "models--Org--Broken"
+
+        @property
+        def revisions(self):
+            raise RuntimeError("boom")
+
+    healthy = _repo(
+        "Org/Healthy",
+        [_file("Q4_K_M.gguf", 5_000)],
+        tmp_path / "models--Org--Healthy",
+    )
+
+    monkeypatch.setattr(
+        models_route,
+        "_all_hf_cache_scans",
+        lambda: [SimpleNamespace(repos = [_ExplodingRepo(), healthy])],
+    )
+
+    result = asyncio.run(models_route.list_cached_gguf(current_subject = "test-user"))
+
+    assert result["cached"] == [
+        {
+            "repo_id": "Org/Healthy",
+            "size_bytes": 5_000,
+            "cache_path": str(healthy.repo_path),
+        }
+    ]
+
+
+def test_list_cached_gguf_skips_repo_with_only_mmproj_gguf(monkeypatch, tmp_path):
+    """A repo whose only ``.gguf`` artifact is an mmproj vision adapter
+    must not be classified as a GGUF repo: the variant selector filters
+    mmproj out and the picker would otherwise show zero variants."""
+    mmproj_only = _repo(
+        "Org/MmprojOnly",
+        [
+            _file("mmproj-Q8_0.gguf", 5_000),
+            _file("model.safetensors", 10_000),
+        ],
+        tmp_path / "models--Org--MmprojOnly",
+    )
+
+    monkeypatch.setattr(
+        models_route,
+        "_all_hf_cache_scans",
+        lambda: [SimpleNamespace(repos = [mmproj_only])],
+    )
+
+    result = asyncio.run(models_route.list_cached_gguf(current_subject = "test-user"))
+
+    assert result["cached"] == []
+
+
+def test_list_cached_models_includes_repo_with_only_mmproj_gguf(monkeypatch, tmp_path):
+    """Mirror of the cached-gguf skip: a safetensors repo with an
+    auxiliary mmproj vision adapter must still surface in cached-models
+    so the user can load it as a normal model."""
+    mmproj_aux = _repo(
+        "Org/MmprojAux",
+        [
+            _file("mmproj-Q8_0.gguf", 5_000),
+            _file("model.safetensors", 10_000),
+        ],
+        tmp_path / "models--Org--MmprojAux",
+    )
+
+    monkeypatch.setattr(
+        models_route,
+        "_all_hf_cache_scans",
+        lambda: [SimpleNamespace(repos = [mmproj_aux])],
+    )
+
+    result = asyncio.run(models_route.list_cached_models(current_subject = "test-user"))
+
+    assert result["cached"] == [
+        {
+            "repo_id": "Org/MmprojAux",
+            "size_bytes": 15_000,
+        }
+    ]
+
+
+def test_list_cached_gguf_includes_vision_repo_with_main_gguf_and_mmproj(
+    monkeypatch, tmp_path
+):
+    """A vision-capable GGUF repo (main weight + mmproj adapter) is still
+    a GGUF repo. The reported size is the main weight size; mmproj is
+    excluded from the GGUF-size accounting because it is filtered out at
+    classification time."""
+    vision_repo = _repo(
+        "Org/VisionGguf",
+        [
+            _file("Q4_K_M.gguf", 5_000),
+            _file("mmproj-Q8_0.gguf", 1_000),
+        ],
+        tmp_path / "models--Org--VisionGguf",
+    )
+
+    monkeypatch.setattr(
+        models_route,
+        "_all_hf_cache_scans",
+        lambda: [SimpleNamespace(repos = [vision_repo])],
+    )
+
+    result = asyncio.run(models_route.list_cached_gguf(current_subject = "test-user"))
+
+    assert result["cached"] == [
+        {
+            "repo_id": "Org/VisionGguf",
+            "size_bytes": 5_000,
+            "cache_path": str(vision_repo.repo_path),
+        }
+    ]
--- a/studio/backend/tests/test_export_log_cursor.py
+++ b/studio/backend/tests/test_export_log_cursor.py
@ -0,0 +1,179 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
+
+"""
+Regression tests for the export log ring-buffer cursor semantics.
+
+Context: the live export log SSE stream has a race where the frontend
+opens the SSE connection AFTER the POST that starts the export. Any
+lines the worker subprocess emits during the gap between POST and SSE
+connect get buffered with seqs 1..k, and then the SSE default cursor
+`get_current_log_seq()` returns k -- so lines 1..k are forever
+unreachable to that client.
+
+Fix: `clear_logs()` snapshots the pre-run seq into `_run_start_seq`
+(exposed via `get_run_start_seq()`), and `routes/export.py` defaults
+the SSE cursor to that snapshot instead of the current seq. Every line
+appended during the current run has seq strictly greater than the
+snapshot, so the client sees the full run regardless of when it
+connects.
+
+These tests exercise the orchestrator-side contract only (no
+subprocess, no FastAPI, no frontend). The routes-level integration
+with get_run_start_seq() is a one-line edit covered by manual testing
+and the frontend build.
+"""
+
+from __future__ import annotations
+
+import sys
+import types
+from pathlib import Path
+
+import pytest
+
+
+# Backend root on sys.path so `from core.export.orchestrator import ...`
+# and friends resolve without the studio app bootstrap.
+_BACKEND_DIR = Path(__file__).resolve().parent.parent
+if str(_BACKEND_DIR) not in sys.path:
+    sys.path.insert(0, str(_BACKEND_DIR))
+
+# ExportOrchestrator imports structlog and a few heavy modules at the
+# top of orchestrator.py. Stub the ones we don't need in these unit
+# tests so the import succeeds on machines without the full studio
+# venv.
+_loggers_stub = types.ModuleType("loggers")
+_loggers_stub.get_logger = lambda name: __import__("logging").getLogger(name)
+sys.modules.setdefault("loggers", _loggers_stub)
+
+# structlog is only used for a module-level import; a bare stub is
+# enough because we never call into it in these tests.
+sys.modules.setdefault("structlog", types.ModuleType("structlog"))
+
+# utils.paths.outputs_root is only called inside scan_checkpoints which
+# we don't hit in these tests. Provide a stub module so the top-level
+# import in orchestrator.py resolves.
+_utils_pkg = types.ModuleType("utils")
+_utils_pkg.__path__ = []  # mark as package
+_utils_paths_stub = types.ModuleType("utils.paths")
+_utils_paths_stub.outputs_root = lambda: Path("/tmp")
+sys.modules.setdefault("utils", _utils_pkg)
+sys.modules.setdefault("utils.paths", _utils_paths_stub)
+
+
+@pytest.fixture
+def orchestrator():
+    """Fresh ExportOrchestrator with only the log-buffer state exercised."""
+    from core.export.orchestrator import ExportOrchestrator
+
+    return ExportOrchestrator()
+
+
+def _append(orch, line: str, stream: str = "stdout") -> None:
+    """Shortcut for simulating a worker log message."""
+    orch._append_log({"type": "log", "stream": stream, "line": line, "ts": 0.0})
+
+
+# ---------------------------------------------------------------------------
+# clear_logs() semantics
+# ---------------------------------------------------------------------------
+
+
+def test_run_start_seq_is_zero_before_any_logs(orchestrator) -> None:
+    """A brand-new orchestrator must report run_start_seq == 0 so a
+    first SSE connection picks up every line from seq 1 onward."""
+    assert orchestrator.get_run_start_seq() == 0
+
+
+def test_clear_logs_snapshots_current_seq(orchestrator) -> None:
+    """clear_logs() must capture _log_seq BEFORE clearing the buffer,
+    so subsequent runs can anchor their SSE cursor at the snapshot."""
+    _append(orchestrator, "old run line 1")
+    _append(orchestrator, "old run line 2")
+    _append(orchestrator, "old run line 3")
+    assert orchestrator.get_current_log_seq() == 3
+
+    orchestrator.clear_logs()
+
+    assert orchestrator.get_run_start_seq() == 3
+    assert orchestrator.get_current_log_seq() == 3  # seq counter preserved
+
+
+# ---------------------------------------------------------------------------
+# Race regression: SSE connects AFTER lines have been emitted
+# ---------------------------------------------------------------------------
+
+
+def test_sse_default_cursor_catches_all_current_run_lines(orchestrator) -> None:
+    """Simulate the POST-then-SSE race: worker starts emitting lines
+    immediately after clear_logs(), SSE connects several lines later.
+    Using get_run_start_seq() as the default cursor MUST return every
+    line emitted since clear_logs() ran.
+
+    Pre-fix, the SSE defaulted to get_current_log_seq() at connect
+    time, which would return the last-seen seq and miss lines N+1..M.
+    """
+    # Previous run leaves some buffered lines.
+    _append(orchestrator, "previous run line A")
+    _append(orchestrator, "previous run line B")
+
+    # New run starts: orchestrator clears the buffer and snapshots seq.
+    orchestrator.clear_logs()
+    run_start = orchestrator.get_run_start_seq()
+
+    # Worker emits early lines BEFORE the SSE connects.
+    _append(orchestrator, "Importing Unsloth...")
+    _append(orchestrator, "Loading checkpoint: /foo/bar")
+    _append(orchestrator, "Starting export...")
+
+    # SSE connects now and asks "give me everything after the run
+    # start cursor".
+    entries, new_cursor = orchestrator.get_logs_since(run_start)
+
+    # All three early lines must be present. Pre-fix this was [].
+    lines = [e["line"] for e in entries]
+    assert lines == [
+        "Importing Unsloth...",
+        "Loading checkpoint: /foo/bar",
+        "Starting export...",
+    ]
+    assert new_cursor == entries[-1]["seq"]
+
+
+def test_sse_default_cursor_excludes_previous_run(orchestrator) -> None:
+    """After clear_logs(), lines from the PREVIOUS run must not leak
+    into the new run's SSE stream. Pre-fix this worked correctly
+    (clear_logs cleared the deque); the fix must preserve it.
+    """
+    _append(orchestrator, "previous run line 1")
+    _append(orchestrator, "previous run line 2")
+    _append(orchestrator, "previous run line 3")
+    assert orchestrator.get_current_log_seq() == 3
+
+    orchestrator.clear_logs()
+    run_start = orchestrator.get_run_start_seq()
+
+    _append(orchestrator, "new run line")
+
+    entries, _ = orchestrator.get_logs_since(run_start)
+    assert [e["line"] for e in entries] == ["new run line"]
+
+
+def test_clear_logs_twice_advances_run_start(orchestrator) -> None:
+    """Back-to-back clear_logs() calls (e.g. cleanup -> load ->
+    export in the same dialog session) must each re-anchor run_start
+    at the current seq, so successive runs each start with a fresh
+    low-water mark."""
+    _append(orchestrator, "run 1 line a")
+    _append(orchestrator, "run 1 line b")
+
+    orchestrator.clear_logs()
+    assert orchestrator.get_run_start_seq() == 2
+
+    _append(orchestrator, "run 2 line a")
+    _append(orchestrator, "run 2 line b")
+    _append(orchestrator, "run 2 line c")
+
+    orchestrator.clear_logs()
+    assert orchestrator.get_run_start_seq() == 5
--- a/studio/backend/tests/test_gpu_selection.py
+++ b/studio/backend/tests/test_gpu_selection.py
--- a/studio/backend/tests/test_gpu_selection_sandbox.py
+++ b/studio/backend/tests/test_gpu_selection_sandbox.py
@ -0,0 +1,544 @@
+#!/usr/bin/env python3
+"""
+Sandbox test for multi-GPU selection logic.
+
+Tests the core GPU selection, memory estimation, and device_map logic
+in an isolated environment. Can be run on Linux, macOS, and Windows
+without requiring actual GPUs -- all hardware calls are mocked.
+
+Usage:
+    python -m pytest studio/backend/tests/test_gpu_selection_sandbox.py -v
+    # or directly:
+    python studio/backend/tests/test_gpu_selection_sandbox.py
+"""
+
+import os
+import sys
+import unittest
+from pathlib import Path
+from unittest.mock import patch, MagicMock
+
+# Ensure backend is on sys.path
+_backend_root = Path(__file__).resolve().parent.parent
+if str(_backend_root) not in sys.path:
+    sys.path.insert(0, str(_backend_root))
+
+
+def _make_fake_config(
+    vocab_size = 32000,
+    hidden_size = 4096,
+    intermediate_size = 11008,
+    num_hidden_layers = 32,
+    num_attention_heads = 32,
+    num_key_value_heads = 8,
+    tie_word_embeddings = False,
+):
+    """Create a fake HF config-like object for estimation tests."""
+    from types import SimpleNamespace
+
+    return SimpleNamespace(
+        vocab_size = vocab_size,
+        hidden_size = hidden_size,
+        intermediate_size = intermediate_size,
+        num_hidden_layers = num_hidden_layers,
+        num_attention_heads = num_attention_heads,
+        num_key_value_heads = num_key_value_heads,
+        tie_word_embeddings = tie_word_embeddings,
+    )
+
+
+class TestEstimateFP16ModelSizeFromConfig(unittest.TestCase):
+    """Test the config-based model size estimation."""
+
+    def test_llama_8b_size_reasonable(self):
+        from utils.hardware.hardware import _estimate_fp16_model_size_bytes_from_config
+
+        config = _make_fake_config(
+            vocab_size = 128256,
+            hidden_size = 4096,
+            intermediate_size = 14336,
+            num_hidden_layers = 32,
+            num_attention_heads = 32,
+            num_key_value_heads = 8,
+            tie_word_embeddings = False,
+        )
+        size = _estimate_fp16_model_size_bytes_from_config(config)
+        self.assertIsNotNone(size)
+        size_gb = size / (1024**3)
+        # Llama 3.1 8B should be ~15GB in fp16
+        self.assertGreater(size_gb, 12)
+        self.assertLess(size_gb, 20)
+
+    def test_small_model(self):
+        from utils.hardware.hardware import _estimate_fp16_model_size_bytes_from_config
+
+        config = _make_fake_config(
+            vocab_size = 32000,
+            hidden_size = 2048,
+            intermediate_size = 5504,
+            num_hidden_layers = 22,
+            num_attention_heads = 32,
+            num_key_value_heads = 4,
+        )
+        size = _estimate_fp16_model_size_bytes_from_config(config)
+        self.assertIsNotNone(size)
+        size_gb = size / (1024**3)
+        # ~1B model should be ~2GB in fp16
+        self.assertGreater(size_gb, 1)
+        self.assertLess(size_gb, 5)
+
+    def test_returns_none_for_incomplete_config(self):
+        from utils.hardware.hardware import _estimate_fp16_model_size_bytes_from_config
+        from types import SimpleNamespace
+
+        config = SimpleNamespace(vocab_size = 32000)  # Missing most fields
+        size = _estimate_fp16_model_size_bytes_from_config(config)
+        self.assertIsNone(size)
+
+    def test_moe_model(self):
+        from utils.hardware.hardware import _estimate_fp16_model_size_bytes_from_config
+        from types import SimpleNamespace
+
+        config = SimpleNamespace(
+            vocab_size = 152064,
+            hidden_size = 3584,
+            intermediate_size = 18944,
+            num_hidden_layers = 28,
+            num_attention_heads = 28,
+            num_key_value_heads = 4,
+            tie_word_embeddings = False,
+            num_local_experts = 64,
+            moe_intermediate_size = 2560,
+        )
+        size = _estimate_fp16_model_size_bytes_from_config(config)
+        self.assertIsNotNone(size)
+        size_gb = size / (1024**3)
+        # MoE model with 64 experts should be large
+        self.assertGreater(size_gb, 50)
+
+
+class TestEstimateRequiredModelMemory(unittest.TestCase):
+    """Test memory requirement estimation."""
+
+    def test_inference_fp16_uses_1_3x(self):
+        from utils.hardware.hardware import estimate_required_model_memory_gb
+
+        with patch(
+            "utils.hardware.hardware.estimate_fp16_model_size_bytes",
+            return_value = (10 * (1024**3), "config"),  # 10GB model
+        ):
+            required, meta = estimate_required_model_memory_gb(
+                "test/model",
+                training_type = None,  # inference
+                load_in_4bit = False,
+            )
+            self.assertIsNotNone(required)
+            self.assertAlmostEqual(required, 13.0, places = 0)
+            self.assertEqual(meta["mode"], "inference")
+
+    def test_inference_4bit_uses_reduced_estimate(self):
+        from utils.hardware.hardware import estimate_required_model_memory_gb
+
+        with patch(
+            "utils.hardware.hardware.estimate_fp16_model_size_bytes",
+            return_value = (30 * (1024**3), "config"),  # 30GB fp16 model
+        ):
+            required, meta = estimate_required_model_memory_gb(
+                "test/model",
+                training_type = None,  # inference
+                load_in_4bit = True,
+            )
+            self.assertIsNotNone(required)
+            # 4bit base = 30/3.2 = 9.375GB, required = 9.375 + max(9.375*0.3, 2) = 12.19GB
+            self.assertAlmostEqual(required, 12.2, places = 0)
+
+    def test_4bit_training_reduces_base(self):
+        from utils.hardware.hardware import estimate_required_model_memory_gb
+
+        with patch(
+            "utils.hardware.hardware.estimate_fp16_model_size_bytes",
+            return_value = (30 * (1024**3), "config"),  # 30GB fp16 model
+        ):
+            required, meta = estimate_required_model_memory_gb(
+                "test/model",
+                training_type = "LoRA/QLoRA",
+                load_in_4bit = True,
+            )
+            self.assertIsNotNone(required)
+            # fallback: base=30/3.2=9.375, lora=30*0.04=1.2, act=30*0.15=4.5, cuda=1.4
+            self.assertAlmostEqual(required, 16.5, places = 0)
+
+    def test_full_finetune_uses_3_5x(self):
+        from utils.hardware.hardware import estimate_required_model_memory_gb
+
+        with patch(
+            "utils.hardware.hardware.estimate_fp16_model_size_bytes",
+            return_value = (10 * (1024**3), "config"),  # 10GB model
+        ):
+            required, meta = estimate_required_model_memory_gb(
+                "test/model",
+                training_type = "Full Finetuning",
+            )
+            self.assertIsNotNone(required)
+            # fallback: 10 * 3.5 + 1.4 cuda overhead = 36.4
+            self.assertAlmostEqual(required, 36.4, places = 0)
+
+    def test_returns_none_when_unavailable(self):
+        from utils.hardware.hardware import estimate_required_model_memory_gb
+
+        with patch(
+            "utils.hardware.hardware.estimate_fp16_model_size_bytes",
+            return_value = (None, "unavailable"),
+        ):
+            required, meta = estimate_required_model_memory_gb("test/model")
+            self.assertIsNone(required)
+
+
+class TestAutoSelectGpuIds(unittest.TestCase):
+    """Test automatic GPU selection based on model size and free memory."""
+
+    def _make_utilization(self, devices):
+        """Create a fake utilization response."""
+        return {
+            "available": True,
+            "devices": [
+                {
+                    "index": idx,
+                    "vram_total_gb": total,
+                    "vram_used_gb": total - free,
+                }
+                for idx, total, free in devices
+            ],
+        }
+
+    def test_single_gpu_sufficient(self):
+        from utils.hardware.hardware import auto_select_gpu_ids
+        import utils.hardware.hardware as hw
+
+        with (
+            patch.object(hw, "get_device", return_value = hw.DeviceType.CUDA),
+            patch.object(
+                hw,
+                "estimate_required_model_memory_gb",
+                return_value = (
+                    10.0,
+                    {
+                        "mode": "inference",
+                        "required_gb": 10.0,
+                        "model_size_source": "config",
+                        "model_size_gb": 7.7,
+                    },
+                ),
+            ),
+            patch.object(
+                hw,
+                "_get_parent_visible_gpu_spec",
+                return_value = {
+                    "raw": "0,1,2,3",
+                    "numeric_ids": [0, 1, 2, 3],
+                    "supports_explicit_gpu_ids": True,
+                },
+            ),
+            patch.object(hw, "get_parent_visible_gpu_ids", return_value = [0, 1, 2, 3]),
+            patch.object(
+                hw,
+                "get_visible_gpu_utilization",
+                return_value = self._make_utilization(
+                    [
+                        (0, 80.0, 75.0),
+                        (1, 80.0, 78.0),
+                        (2, 80.0, 70.0),
+                        (3, 80.0, 72.0),
+                    ]
+                ),
+            ),
+        ):
+            selected, meta = auto_select_gpu_ids("test/model")
+            # Should pick GPU 1 (most free memory: 78GB) -- enough for 10GB
+            self.assertEqual(len(selected), 1)
+            self.assertEqual(selected[0], 1)
+
+    def test_two_gpus_needed(self):
+        from utils.hardware.hardware import auto_select_gpu_ids
+        import utils.hardware.hardware as hw
+
+        with (
+            patch.object(hw, "get_device", return_value = hw.DeviceType.CUDA),
+            patch.object(
+                hw,
+                "estimate_required_model_memory_gb",
+                return_value = (
+                    50.0,
+                    {
+                        "mode": "inference",
+                        "required_gb": 50.0,
+                        "model_size_source": "config",
+                        "model_size_gb": 38.0,
+                    },
+                ),
+            ),
+            patch.object(
+                hw,
+                "_get_parent_visible_gpu_spec",
+                return_value = {
+                    "raw": "0,1",
+                    "numeric_ids": [0, 1],
+                    "supports_explicit_gpu_ids": True,
+                },
+            ),
+            patch.object(hw, "get_parent_visible_gpu_ids", return_value = [0, 1]),
+            patch.object(
+                hw,
+                "get_visible_gpu_utilization",
+                return_value = self._make_utilization(
+                    [
+                        (0, 40.0, 30.0),  # 30GB free
+                        (1, 40.0, 35.0),  # 35GB free
+                    ]
+                ),
+            ),
+        ):
+            selected, meta = auto_select_gpu_ids("test/model")
+            # 35GB (first) + 30*0.85 (second) = 60.5GB > 50GB
+            self.assertEqual(len(selected), 2)
+
+    def test_non_cuda_returns_none(self):
+        from utils.hardware.hardware import auto_select_gpu_ids
+        import utils.hardware.hardware as hw
+
+        with patch.object(hw, "get_device", return_value = hw.DeviceType.CPU):
+            selected, meta = auto_select_gpu_ids("test/model")
+            self.assertIsNone(selected)
+            self.assertEqual(meta["selection_mode"], "non_cuda")
+
+
+class TestGetDeviceMap(unittest.TestCase):
+    """Test device_map string generation."""
+
+    def test_single_gpu_returns_sequential(self):
+        from utils.hardware.hardware import get_device_map
+        import utils.hardware.hardware as hw
+
+        with (
+            patch.object(hw, "get_device", return_value = hw.DeviceType.CUDA),
+            patch.object(
+                hw,
+                "_get_parent_visible_gpu_spec",
+                return_value = {
+                    "raw": "0",
+                    "numeric_ids": [0],
+                    "supports_explicit_gpu_ids": True,
+                },
+            ),
+            patch.object(hw, "get_visible_gpu_count", return_value = 1),
+        ):
+            dm = get_device_map(gpu_ids = [0])
+            self.assertEqual(dm, "sequential")
+
+    def test_multi_gpu_returns_balanced(self):
+        from utils.hardware.hardware import get_device_map
+        import utils.hardware.hardware as hw
+
+        with patch.object(hw, "get_device", return_value = hw.DeviceType.CUDA):
+            dm = get_device_map(gpu_ids = [0, 1])
+            self.assertEqual(dm, "balanced")
+
+    def test_cpu_returns_sequential(self):
+        from utils.hardware.hardware import get_device_map
+        import utils.hardware.hardware as hw
+
+        with patch.object(hw, "get_device", return_value = hw.DeviceType.CPU):
+            dm = get_device_map(gpu_ids = None)
+            self.assertEqual(dm, "sequential")
+
+
+class TestResolveRequestedGpuIds(unittest.TestCase):
+    """Test GPU ID validation."""
+
+    def test_none_returns_parent_visible(self):
+        from utils.hardware.hardware import resolve_requested_gpu_ids
+
+        with (
+            patch.dict(os.environ, {"CUDA_VISIBLE_DEVICES": "2,3"}, clear = False),
+            patch("utils.hardware.hardware.get_physical_gpu_count", return_value = 8),
+        ):
+            result = resolve_requested_gpu_ids(None)
+            self.assertEqual(result, [2, 3])
+
+    def test_empty_list_returns_parent_visible(self):
+        from utils.hardware.hardware import resolve_requested_gpu_ids
+
+        with (
+            patch.dict(os.environ, {"CUDA_VISIBLE_DEVICES": "2,3"}, clear = False),
+            patch("utils.hardware.hardware.get_physical_gpu_count", return_value = 8),
+        ):
+            result = resolve_requested_gpu_ids([])
+            self.assertEqual(result, [2, 3])
+
+    def test_duplicates_rejected(self):
+        from utils.hardware.hardware import resolve_requested_gpu_ids
+
+        with (
+            patch.dict(os.environ, {"CUDA_VISIBLE_DEVICES": "0,1,2"}, clear = False),
+            patch("utils.hardware.hardware.get_physical_gpu_count", return_value = 8),
+        ):
+            with self.assertRaises(ValueError):
+                resolve_requested_gpu_ids([1, 1])
+
+    def test_out_of_range_rejected(self):
+        from utils.hardware.hardware import resolve_requested_gpu_ids
+
+        with (
+            patch.dict(os.environ, {"CUDA_VISIBLE_DEVICES": "0,1"}, clear = False),
+            patch("utils.hardware.hardware.get_physical_gpu_count", return_value = 4),
+        ):
+            with self.assertRaises(ValueError):
+                resolve_requested_gpu_ids([5])
+
+    def test_uuid_env_var_rejects_explicit_ids(self):
+        from utils.hardware.hardware import resolve_requested_gpu_ids
+
+        with (
+            patch.dict(
+                os.environ, {"CUDA_VISIBLE_DEVICES": "GPU-abc,GPU-def"}, clear = False
+            ),
+            patch("utils.hardware.hardware.get_physical_gpu_count", return_value = 8),
+        ):
+            with self.assertRaises(ValueError):
+                resolve_requested_gpu_ids([0])
+
+
+class TestApplyGpuIds(unittest.TestCase):
+    """Test CUDA_VISIBLE_DEVICES environment variable setting."""
+
+    def test_apply_list(self):
+        from utils.hardware.hardware import apply_gpu_ids
+
+        with patch.dict(os.environ, {}, clear = False):
+            apply_gpu_ids([3, 5])
+            self.assertEqual(os.environ.get("CUDA_VISIBLE_DEVICES"), "3,5")
+
+    def test_apply_none_does_nothing(self):
+        from utils.hardware.hardware import apply_gpu_ids
+
+        original = os.environ.get("CUDA_VISIBLE_DEVICES")
+        apply_gpu_ids(None)
+        self.assertEqual(os.environ.get("CUDA_VISIBLE_DEVICES"), original)
+
+
+class TestMultiGpuOverheadAccounting(unittest.TestCase):
+    """Test that multi-GPU overhead is applied correctly.
+
+    The first GPU should keep its full free memory, and only
+    additional GPUs should have the overhead factor applied.
+    """
+
+    def _make_utilization(self, devices):
+        return {
+            "available": True,
+            "devices": [
+                {
+                    "index": idx,
+                    "vram_total_gb": total,
+                    "vram_used_gb": total - free,
+                }
+                for idx, total, free in devices
+            ],
+        }
+
+    def test_first_gpu_not_penalized(self):
+        """A model that just fits on 1 GPU should not require 2 GPUs."""
+        from utils.hardware.hardware import auto_select_gpu_ids
+        import utils.hardware.hardware as hw
+
+        # Model requires 79GB, GPU has 80GB free
+        with (
+            patch.object(hw, "get_device", return_value = hw.DeviceType.CUDA),
+            patch.object(
+                hw,
+                "estimate_required_model_memory_gb",
+                return_value = (
+                    79.0,
+                    {
+                        "mode": "inference",
+                        "required_gb": 79.0,
+                        "model_size_source": "config",
+                        "model_size_gb": 60.0,
+                    },
+                ),
+            ),
+            patch.object(
+                hw,
+                "_get_parent_visible_gpu_spec",
+                return_value = {
+                    "raw": "0,1",
+                    "numeric_ids": [0, 1],
+                    "supports_explicit_gpu_ids": True,
+                },
+            ),
+            patch.object(hw, "get_parent_visible_gpu_ids", return_value = [0, 1]),
+            patch.object(
+                hw,
+                "get_visible_gpu_utilization",
+                return_value = self._make_utilization(
+                    [
+                        (0, 80.0, 80.0),
+                        (1, 80.0, 80.0),
+                    ]
+                ),
+            ),
+        ):
+            selected, meta = auto_select_gpu_ids("test/model")
+            # Should fit on 1 GPU (80GB >= 79GB)
+            self.assertEqual(len(selected), 1)
+
+    def test_second_gpu_has_overhead(self):
+        """When 2 GPUs are needed, the second one's contribution is reduced."""
+        from utils.hardware.hardware import auto_select_gpu_ids
+        import utils.hardware.hardware as hw
+
+        # Model requires 110GB. First GPU has 80GB, second has 40GB.
+        # With overhead: 80 + 40*0.85 = 114GB -- just enough
+        with (
+            patch.object(hw, "get_device", return_value = hw.DeviceType.CUDA),
+            patch.object(
+                hw,
+                "estimate_required_model_memory_gb",
+                return_value = (
+                    110.0,
+                    {
+                        "mode": "inference",
+                        "required_gb": 110.0,
+                        "model_size_source": "config",
+                        "model_size_gb": 85.0,
+                    },
+                ),
+            ),
+            patch.object(
+                hw,
+                "_get_parent_visible_gpu_spec",
+                return_value = {
+                    "raw": "0,1",
+                    "numeric_ids": [0, 1],
+                    "supports_explicit_gpu_ids": True,
+                },
+            ),
+            patch.object(hw, "get_parent_visible_gpu_ids", return_value = [0, 1]),
+            patch.object(
+                hw,
+                "get_visible_gpu_utilization",
+                return_value = self._make_utilization(
+                    [
+                        (0, 80.0, 80.0),
+                        (1, 80.0, 40.0),
+                    ]
+                ),
+            ),
+        ):
+            selected, meta = auto_select_gpu_ids("test/model")
+            # Should use both GPUs
+            self.assertEqual(len(selected), 2)
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/studio/backend/tests/test_kv_cache_estimation.py
+++ b/studio/backend/tests/test_kv_cache_estimation.py
@ -0,0 +1,929 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved.
+
+"""Tests for 5-path architecture-aware KV cache VRAM estimation.
+
+Covers the GGUF metadata parser, _can_estimate_kv gate, all 5 estimation
+paths (MLA, Hybrid Mamba, Sliding Window, Standard GQA, Legacy), KV cache
+quantization, edge cases, and lifecycle (init/unload/reparse).
+
+Requires no GPU, network, or external libraries beyond pytest.
+Cross-platform: Linux, macOS, Windows, WSL.
+"""
+
+import io
+import struct
+import sys
+import types as _types
+from pathlib import Path
+
+import pytest
+
+# ---------------------------------------------------------------------------
+# Stub heavy / unavailable external dependencies before importing the
+# module under test.  Same pattern as test_native_context_length.py.
+# ---------------------------------------------------------------------------
+
+_BACKEND_DIR = str(Path(__file__).resolve().parent.parent)
+if _BACKEND_DIR not in sys.path:
+    sys.path.insert(0, _BACKEND_DIR)
+
+# loggers
+_loggers_stub = _types.ModuleType("loggers")
+_loggers_stub.get_logger = lambda name: __import__("logging").getLogger(name)
+sys.modules.setdefault("loggers", _loggers_stub)
+
+# structlog
+_structlog_stub = _types.ModuleType("structlog")
+sys.modules.setdefault("structlog", _structlog_stub)
+
+# httpx
+_httpx_stub = _types.ModuleType("httpx")
+for _exc_name in (
+    "ConnectError",
+    "TimeoutException",
+    "ReadTimeout",
+    "ReadError",
+    "RemoteProtocolError",
+    "CloseError",
+):
+    setattr(_httpx_stub, _exc_name, type(_exc_name, (Exception,), {}))
+
+
+class _FakeTimeout:
+    def __init__(self, *a, **kw):
+        pass
+
+
+_httpx_stub.Timeout = _FakeTimeout
+_httpx_stub.Client = type(
+    "Client",
+    (),
+    {
+        "__init__": lambda self, **kw: None,
+        "__enter__": lambda self: self,
+        "__exit__": lambda self, *a: None,
+    },
+)
+sys.modules.setdefault("httpx", _httpx_stub)
+
+from core.inference.llama_cpp import LlamaCppBackend
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def _make_gguf_bytes(arch: str, kv_pairs: dict) -> bytes:
+    """Build a minimal GGUF v3 binary blob with the given KV metadata.
+
+    Only supports UINT32 (type 4), UINT64 (type 10), and STRING (type 8)
+    values, which is all the metadata parser reads.
+    """
+    buf = io.BytesIO()
+    # Header: magic, version, tensor_count, kv_count
+    buf.write(struct.pack("<I", 0x46554747))  # GGUF magic
+    buf.write(struct.pack("<I", 3))  # version 3
+    buf.write(struct.pack("<Q", 0))  # tensor_count
+    buf.write(struct.pack("<Q", len(kv_pairs)))
+
+    for key, val in kv_pairs.items():
+        key_bytes = key.encode("utf-8")
+        buf.write(struct.pack("<Q", len(key_bytes)))
+        buf.write(key_bytes)
+        if isinstance(val, str):
+            buf.write(struct.pack("<I", 8))  # STRING
+            val_bytes = val.encode("utf-8")
+            buf.write(struct.pack("<Q", len(val_bytes)))
+            buf.write(val_bytes)
+        elif isinstance(val, int):
+            if val <= 0xFFFFFFFF:
+                buf.write(struct.pack("<I", 4))  # UINT32
+                buf.write(struct.pack("<I", val))
+            else:
+                buf.write(struct.pack("<I", 10))  # UINT64
+                buf.write(struct.pack("<Q", val))
+        else:
+            raise TypeError(f"Unsupported value type: {type(val)}")
+    return buf.getvalue()
+
+
+def _backend_from_gguf(arch: str, fields: dict) -> LlamaCppBackend:
+    """Create a LlamaCppBackend with parsed GGUF metadata from given fields."""
+    kv = {"general.architecture": arch}
+    for k, v in fields.items():
+        kv[f"{arch}.{k}"] = v
+    import tempfile, os
+
+    data = _make_gguf_bytes(arch, kv)
+    fd, path = tempfile.mkstemp(suffix = ".gguf")
+    try:
+        os.write(fd, data)
+        os.close(fd)
+        b = LlamaCppBackend()
+        b._read_gguf_metadata(path)
+        return b
+    finally:
+        os.unlink(path)
+
+
+# ---------------------------------------------------------------------------
+# A. GGUF Parser Tests
+# ---------------------------------------------------------------------------
+
+
+class TestGGUFParserNewFields:
+    """Verify that the 8 new architecture-aware fields are correctly parsed."""
+
+    @pytest.mark.parametrize(
+        "field,gguf_key,value",
+        [
+            ("_kv_key_length", "attention.key_length", 128),
+            ("_kv_value_length", "attention.value_length", 128),
+            ("_sliding_window", "attention.sliding_window", 1024),
+            ("_full_attention_interval", "full_attention_interval", 4),
+            ("_kv_lora_rank", "attention.kv_lora_rank", 512),
+            ("_key_length_mla", "attention.key_length_mla", 256),
+            ("_ssm_inner_size", "ssm.inner_size", 6144),
+            ("_ssm_state_size", "ssm.state_size", 128),
+        ],
+    )
+    def test_field_parsed(self, field, gguf_key, value):
+        b = _backend_from_gguf("testarch", {gguf_key: value})
+        assert getattr(b, field) == value
+
+    def test_missing_fields_are_none(self):
+        b = _backend_from_gguf("testarch", {"block_count": 10})
+        for attr in [
+            "_kv_key_length",
+            "_kv_value_length",
+            "_sliding_window",
+            "_full_attention_interval",
+            "_kv_lora_rank",
+            "_key_length_mla",
+            "_ssm_inner_size",
+            "_ssm_state_size",
+        ]:
+            assert getattr(b, attr) is None
+
+    def test_all_13_fields_parsed_together(self):
+        fields = {
+            "context_length": 131072,
+            "block_count": 62,
+            "attention.head_count_kv": 16,
+            "attention.head_count": 32,
+            "embedding_length": 5376,
+            "attention.key_length": 128,
+            "attention.value_length": 128,
+            "attention.sliding_window": 1024,
+            "full_attention_interval": 6,
+            "attention.kv_lora_rank": 512,
+            "attention.key_length_mla": 256,
+            "ssm.inner_size": 4096,
+            "ssm.state_size": 128,
+        }
+        b = _backend_from_gguf("testarch", fields)
+        assert b._context_length == 131072
+        assert b._n_layers == 62
+        assert b._n_kv_heads == 16
+        assert b._n_heads == 32
+        assert b._embedding_length == 5376
+        assert b._kv_key_length == 128
+        assert b._kv_value_length == 128
+        assert b._sliding_window == 1024
+        assert b._full_attention_interval == 6
+        assert b._kv_lora_rank == 512
+        assert b._key_length_mla == 256
+        assert b._ssm_inner_size == 4096
+        assert b._ssm_state_size == 128
+
+
+class TestGGUFParserReset:
+    """Verify that fields are properly reset between parses."""
+
+    def test_reset_between_parses(self):
+        # First parse with all fields
+        b = _backend_from_gguf(
+            "arch1",
+            {
+                "block_count": 32,
+                "attention.key_length": 128,
+                "attention.kv_lora_rank": 512,
+                "ssm.inner_size": 4096,
+            },
+        )
+        assert b._kv_key_length == 128
+        assert b._kv_lora_rank == 512
+        assert b._ssm_inner_size == 4096
+
+        # Second parse without those fields -- they should be None
+        kv = {"general.architecture": "arch2", "arch2.block_count": 64}
+        import tempfile, os
+
+        data = _make_gguf_bytes("arch2", kv)
+        fd, path = tempfile.mkstemp(suffix = ".gguf")
+        os.write(fd, data)
+        os.close(fd)
+        try:
+            b._read_gguf_metadata(path)
+        finally:
+            os.unlink(path)
+        assert b._kv_key_length is None
+        assert b._kv_lora_rank is None
+        assert b._ssm_inner_size is None
+        assert b._n_layers == 64
+
+
+# ---------------------------------------------------------------------------
+# B. _can_estimate_kv Gate Tests
+# ---------------------------------------------------------------------------
+
+
+class TestCanEstimateKV:
+    """Verify gate logic for all field combinations."""
+
+    def test_no_layers_returns_false(self):
+        b = LlamaCppBackend()
+        b._n_layers = None
+        b._kv_key_length = 128
+        assert not b._can_estimate_kv()
+
+    def test_explicit_both_dims_sufficient(self):
+        b = LlamaCppBackend()
+        b._n_layers = 32
+        b._kv_key_length = 128
+        b._kv_value_length = 128
+        assert b._can_estimate_kv()
+
+    def test_key_length_alone_insufficient(self):
+        """key_length without value_length should NOT be enough."""
+        b = LlamaCppBackend()
+        b._n_layers = 32
+        b._kv_key_length = 128
+        assert not b._can_estimate_kv()
+
+    def test_kv_lora_rank_sufficient(self):
+        b = LlamaCppBackend()
+        b._n_layers = 61
+        b._kv_lora_rank = 512
+        assert b._can_estimate_kv()
+
+    def test_legacy_embed_plus_heads(self):
+        b = LlamaCppBackend()
+        b._n_layers = 28
+        b._embedding_length = 1024
+        b._n_heads = 16
+        assert b._can_estimate_kv()
+
+    def test_legacy_embed_plus_kv_heads(self):
+        b = LlamaCppBackend()
+        b._n_layers = 28
+        b._embedding_length = 1024
+        b._n_kv_heads = 8
+        assert b._can_estimate_kv()
+
+    def test_legacy_no_embed_returns_false(self):
+        b = LlamaCppBackend()
+        b._n_layers = 28
+        b._n_heads = 16
+        # No embedding_length, no new-style fields
+        assert not b._can_estimate_kv()
+
+    def test_fresh_backend_returns_false(self):
+        b = LlamaCppBackend()
+        assert not b._can_estimate_kv()
+
+
+# ---------------------------------------------------------------------------
+# C. Path 1: MLA Estimation
+# ---------------------------------------------------------------------------
+
+
+class TestMLAEstimation:
+    """MLA: K-only cache using compressed KV latent + RoPE."""
+
+    def _mla_backend(self, **overrides):
+        defaults = {
+            "_n_layers": 61,
+            "_n_kv_heads": 1,
+            "_n_heads": 128,
+            "_embedding_length": 7168,
+            "_kv_key_length": 576,
+            "_kv_value_length": 512,
+            "_kv_lora_rank": 512,
+            "_key_length_mla": 192,
+        }
+        defaults.update(overrides)
+        b = LlamaCppBackend()
+        for k, v in defaults.items():
+            setattr(b, k, v)
+        return b
+
+    def test_deepseek_v3_f16(self):
+        b = self._mla_backend()
+        # 61 layers * 163840 ctx * 1 head * 576 key_len * 2 bpe
+        expected = 61 * 163840 * 1 * 576 * 2
+        assert b._estimate_kv_cache_bytes(163840, "f16") == expected
+
+    def test_mla_ignores_value_length(self):
+        """MLA should NOT add value_length -- V is reconstructed from the latent."""
+        b = self._mla_backend()
+        result = b._estimate_kv_cache_bytes(1000, "f16")
+        # Should be n_layers * ctx * 1 * key_len(576) * 2
+        expected = 61 * 1000 * 1 * 576 * 2
+        assert result == expected
+
+    def test_mla_fallback_when_no_key_length(self):
+        """If key_length is missing, fallback to kv_lora_rank + key_length_mla."""
+        b = self._mla_backend(_kv_key_length = None)
+        # _key_length_mla=192 in default, so rope_dim=192
+        result = b._estimate_kv_cache_bytes(1000, "f16")
+        expected = 61 * 1000 * 1 * (512 + 192) * 2  # 704
+        assert result == expected
+
+    def test_mla_fallback_no_key_length_mla(self):
+        """If both key_length and key_length_mla are missing, fallback to +64."""
+        b = self._mla_backend(_kv_key_length = None, _key_length_mla = None)
+        result = b._estimate_kv_cache_bytes(1000, "f16")
+        expected = 61 * 1000 * 1 * (512 + 64) * 2  # 576
+        assert result == expected
+
+    def test_mla_defaults_n_kv_to_1_when_heads_absent(self):
+        """MLA should use n_kv=1 even if n_kv_heads is None (not n_heads)."""
+        b = self._mla_backend(_n_kv_heads = None)  # n_heads=128 still set
+        result = b._estimate_kv_cache_bytes(1000, "f16")
+        # Should use n_kv_mla=1, NOT n_heads=128
+        expected = 61 * 1000 * 1 * 576 * 2
+        assert result == expected
+
+    def test_mla_q4_quantization(self):
+        b = self._mla_backend()
+        result_f16 = b._estimate_kv_cache_bytes(1000, "f16")
+        result_q4 = b._estimate_kv_cache_bytes(1000, "q4_0")
+        assert result_q4 < result_f16
+        # q4_0 bpe = 0.5625, f16 bpe = 2.0
+        assert result_q4 == int(61 * 1000 * 1 * 576 * 0.5625)
+
+
+# ---------------------------------------------------------------------------
+# D. Path 2: Hybrid Mamba Estimation
+# ---------------------------------------------------------------------------
+
+
+class TestHybridMambaEstimation:
+    """Hybrid Mamba: only attention layers (1 in N) need KV cache."""
+
+    def _hybrid_backend(self, **overrides):
+        defaults = {
+            "_n_layers": 64,
+            "_n_kv_heads": 4,
+            "_n_heads": 24,
+            "_embedding_length": 5120,
+            "_kv_key_length": 256,
+            "_kv_value_length": 256,
+            "_full_attention_interval": 4,
+            "_ssm_inner_size": 6144,
+            "_ssm_state_size": 128,
+        }
+        defaults.update(overrides)
+        b = LlamaCppBackend()
+        for k, v in defaults.items():
+            setattr(b, k, v)
+        return b
+
+    def test_qwen35_27b(self):
+        b = self._hybrid_backend()
+        # n_attn = 64 // 4 = 16
+        expected = 16 * 262144 * 4 * (256 + 256) * 2
+        assert b._estimate_kv_cache_bytes(262144, "f16") == expected
+
+    def test_qwen35_35b_a3b(self):
+        b = self._hybrid_backend(
+            _n_layers = 40,
+            _n_kv_heads = 2,
+            _n_heads = 16,
+            _embedding_length = 2048,
+            _ssm_inner_size = 4096,
+        )
+        # n_attn = 40 // 4 = 10
+        expected = 10 * 262144 * 2 * (256 + 256) * 2
+        assert b._estimate_kv_cache_bytes(262144, "f16") == expected
+
+    def test_hybrid_without_explicit_dims(self):
+        """Fallback to head_dim when key_length/value_length are missing."""
+        b = self._hybrid_backend(_kv_key_length = None, _kv_value_length = None)
+        head_dim = 5120 // 24  # 213
+        expected = 16 * 4096 * 4 * 2 * head_dim * 2
+        assert b._estimate_kv_cache_bytes(4096, "f16") == expected
+
+    def test_fai_zero_safety(self):
+        """full_attention_interval=0 should not cause ZeroDivisionError."""
+        b = self._hybrid_backend(_full_attention_interval = 0)
+        result = b._estimate_kv_cache_bytes(4096, "f16")
+        # fai=0 -> n_attn = n_layers (all layers)
+        expected = 64 * 4096 * 4 * (256 + 256) * 2
+        assert result == expected
+
+
+# ---------------------------------------------------------------------------
+# E. Path 3: Sliding Window Estimation
+# ---------------------------------------------------------------------------
+
+
+class TestSlidingWindowEstimation:
+    """SWA: half global (full ctx) + half sliding window."""
+
+    def _swa_backend(self, **overrides):
+        defaults = {
+            "_n_layers": 62,
+            "_n_kv_heads": 16,
+            "_n_heads": 32,
+            "_embedding_length": 5376,
+            "_kv_key_length": 128,
+            "_kv_value_length": 128,
+            "_sliding_window": 1024,
+        }
+        defaults.update(overrides)
+        b = LlamaCppBackend()
+        for k, v in defaults.items():
+            setattr(b, k, v)
+        return b
+
+    def test_gemma3(self):
+        b = self._swa_backend()
+        # 1/4 heuristic: 62 // 4 = 15 global, 47 SWA
+        n_global = max(1, 62 // 4)  # 15
+        n_swa = 62 - n_global  # 47
+        kv_per = 16 * (128 + 128) * 2
+        expected = int(n_global * 131072 * kv_per + n_swa * min(131072, 1024) * kv_per)
+        assert b._estimate_kv_cache_bytes(131072, "f16") == expected
+
+    def test_gpt_oss(self):
+        b = self._swa_backend(
+            _n_layers = 24,
+            _n_kv_heads = 8,
+            _n_heads = 64,
+            _embedding_length = 2880,
+            _kv_key_length = 64,
+            _kv_value_length = 64,
+            _sliding_window = 128,
+        )
+        # 1/4 heuristic: 24 // 4 = 6 global, 18 SWA
+        n_global = max(1, 24 // 4)  # 6
+        n_swa = 24 - n_global  # 18
+        kv_per = 8 * (64 + 64) * 2
+        expected = int(n_global * 131072 * kv_per + n_swa * min(131072, 128) * kv_per)
+        assert b._estimate_kv_cache_bytes(131072, "f16") == expected
+
+    def test_ctx_smaller_than_window(self):
+        """When context < sliding_window, SWA layers use full context anyway."""
+        b = self._swa_backend(_sliding_window = 8192)
+        n_global = max(1, 62 // 4)  # 15
+        n_swa = 62 - n_global  # 47
+        kv_per = 16 * (128 + 128) * 2
+        ctx = 4096
+        expected = int(n_global * ctx * kv_per + n_swa * min(ctx, 8192) * kv_per)
+        # min(4096, 8192) = 4096, so both pools use full ctx
+        assert b._estimate_kv_cache_bytes(ctx, "f16") == expected
+
+    def test_odd_layer_count(self):
+        """Odd layer count: n_global = max(1, n//4), n_swa = n - n_global."""
+        b = self._swa_backend(_n_layers = 63)
+        n_global = max(1, 63 // 4)  # 15
+        n_swa = 63 - n_global  # 48
+        kv_per = 16 * (128 + 128) * 2
+        expected = int(n_global * 1000 * kv_per + n_swa * min(1000, 1024) * kv_per)
+        assert b._estimate_kv_cache_bytes(1000, "f16") == expected
+
+
+# ---------------------------------------------------------------------------
+# F. Path 4: Standard GQA Estimation
+# ---------------------------------------------------------------------------
+
+
+class TestStandardGQAEstimation:
+    """Standard GQA with explicit key_length/value_length."""
+
+    def _gqa_backend(self, **overrides):
+        defaults = {
+            "_n_layers": 28,
+            "_n_kv_heads": 8,
+            "_n_heads": 16,
+            "_embedding_length": 1024,
+            "_kv_key_length": 128,
+            "_kv_value_length": 128,
+        }
+        defaults.update(overrides)
+        b = LlamaCppBackend()
+        for k, v in defaults.items():
+            setattr(b, k, v)
+        return b
+
+    def test_qwen3_06b(self):
+        b = self._gqa_backend()
+        expected = 28 * 40960 * 8 * (128 + 128) * 2
+        assert b._estimate_kv_cache_bytes(40960, "f16") == expected
+
+    def test_asymmetric_kv_dims(self):
+        """key_length != value_length (some architectures have this)."""
+        b = self._gqa_backend(_kv_key_length = 192, _kv_value_length = 64)
+        expected = 28 * 4096 * 8 * (192 + 64) * 2
+        assert b._estimate_kv_cache_bytes(4096, "f16") == expected
+
+    def test_differs_from_legacy(self):
+        """GQA path should differ from legacy when key_length != embed//n_heads."""
+        b = self._gqa_backend()
+        head_dim = 1024 // 16  # 64
+        gqa_result = b._estimate_kv_cache_bytes(4096, "f16")
+        # Legacy would use: 2 * 8 * 64 * 28 * 4096 * 2
+        legacy_result = int(2 * 8 * head_dim * 28 * 4096 * 2)
+        # GQA: 28 * 4096 * 8 * (128+128) * 2 -- uses actual key_length=128
+        assert gqa_result != legacy_result
+        assert gqa_result > legacy_result  # key_length (128) > head_dim (64)
+
+
+# ---------------------------------------------------------------------------
+# G. Path 5: Legacy Fallback Estimation
+# ---------------------------------------------------------------------------
+
+
+class TestLegacyEstimation:
+    """Legacy: embed // n_heads, for old GGUFs without new fields."""
+
+    def _legacy_backend(self, **overrides):
+        defaults = {
+            "_n_layers": 32,
+            "_n_kv_heads": 8,
+            "_n_heads": 32,
+            "_embedding_length": 4096,
+        }
+        defaults.update(overrides)
+        b = LlamaCppBackend()
+        for k, v in defaults.items():
+            setattr(b, k, v)
+        return b
+
+    def test_basic_legacy(self):
+        b = self._legacy_backend()
+        head_dim = 4096 // 32  # 128
+        expected = int(2 * 8 * 128 * 32 * 4096 * 2)
+        assert b._estimate_kv_cache_bytes(4096, "f16") == expected
+
+    def test_legacy_with_only_n_heads(self):
+        """n_kv_heads is None, falls back to n_heads."""
+        b = self._legacy_backend(_n_kv_heads = None)
+        head_dim = 4096 // 32
+        expected = int(2 * 32 * head_dim * 32 * 4096 * 2)
+        assert b._estimate_kv_cache_bytes(4096, "f16") == expected
+
+    def test_legacy_identical_to_old_formula(self):
+        """Confirm legacy path produces the same result as the pre-PR formula."""
+        b = self._legacy_backend()
+        n_layers = 32
+        n_kv_heads = 8
+        head_dim = 4096 // 32
+        n_ctx = 8192
+        bpe = 2.0
+        old_formula = int(2 * n_kv_heads * head_dim * n_layers * n_ctx * bpe)
+        assert b._estimate_kv_cache_bytes(n_ctx, "f16") == old_formula
+
+
+# ---------------------------------------------------------------------------
+# H. Path Priority (selection order)
+# ---------------------------------------------------------------------------
+
+
+class TestPathPriority:
+    """Confirm: MLA > Hybrid Mamba > SWA > GQA > Legacy."""
+
+    def test_mla_takes_priority_over_all(self):
+        """If kv_lora_rank is set, MLA path is used even if other fields are present."""
+        b = LlamaCppBackend()
+        b._n_layers = 61
+        b._n_kv_heads = 1
+        b._n_heads = 128
+        b._embedding_length = 7168
+        b._kv_key_length = 576
+        b._kv_value_length = 512
+        b._kv_lora_rank = 512
+        b._ssm_inner_size = 4096  # Would trigger Hybrid
+        b._full_attention_interval = 4
+        b._sliding_window = 1024  # Would trigger SWA
+
+        # MLA: 61 * 1000 * 1 * 576 * 2
+        expected_mla = int(61 * 1000 * 1 * 576 * 2)
+        assert b._estimate_kv_cache_bytes(1000, "f16") == expected_mla
+
+    def test_hybrid_over_swa(self):
+        """Hybrid takes priority over SWA when both fields present."""
+        b = LlamaCppBackend()
+        b._n_layers = 64
+        b._n_kv_heads = 4
+        b._n_heads = 24
+        b._embedding_length = 5120
+        b._kv_key_length = 256
+        b._kv_value_length = 256
+        b._ssm_inner_size = 6144
+        b._full_attention_interval = 4
+        b._sliding_window = 1024  # Would trigger SWA
+
+        n_attn = 64 // 4
+        expected_hybrid = int(n_attn * 1000 * 4 * (256 + 256) * 2)
+        assert b._estimate_kv_cache_bytes(1000, "f16") == expected_hybrid
+
+    def test_all_paths_produce_different_values(self):
+        """With carefully chosen params, each path should yield a distinct value."""
+        # Use embedding_length=768 so legacy head_dim (768//16=48) differs from
+        # key_length (256), and MLA key_len (256) != legacy K+V (2*48=96).
+        params = {
+            "_n_layers": 40,
+            "_n_kv_heads": 4,
+            "_n_heads": 16,
+            "_embedding_length": 768,
+            "_kv_key_length": 256,
+            "_kv_value_length": 256,
+        }
+        ctx = 4096
+
+        # Path 4: Standard GQA
+        b_gqa = LlamaCppBackend()
+        for k, v in params.items():
+            setattr(b_gqa, k, v)
+        gqa_val = b_gqa._estimate_kv_cache_bytes(ctx, "f16")
+
+        # Path 1: MLA
+        b_mla = LlamaCppBackend()
+        for k, v in params.items():
+            setattr(b_mla, k, v)
+        b_mla._kv_lora_rank = 512
+        mla_val = b_mla._estimate_kv_cache_bytes(ctx, "f16")
+
+        # Path 2: Hybrid Mamba
+        b_hybrid = LlamaCppBackend()
+        for k, v in params.items():
+            setattr(b_hybrid, k, v)
+        b_hybrid._ssm_inner_size = 4096
+        b_hybrid._full_attention_interval = 4
+        hybrid_val = b_hybrid._estimate_kv_cache_bytes(ctx, "f16")
+
+        # Path 3: SWA
+        b_swa = LlamaCppBackend()
+        for k, v in params.items():
+            setattr(b_swa, k, v)
+        b_swa._sliding_window = 512
+        swa_val = b_swa._estimate_kv_cache_bytes(ctx, "f16")
+
+        # Path 5: Legacy (no key_length/value_length)
+        b_legacy = LlamaCppBackend()
+        b_legacy._n_layers = 40
+        b_legacy._n_kv_heads = 4
+        b_legacy._n_heads = 16
+        b_legacy._embedding_length = 768
+        legacy_val = b_legacy._estimate_kv_cache_bytes(ctx, "f16")
+
+        values = [mla_val, hybrid_val, swa_val, gqa_val, legacy_val]
+        assert len(set(values)) == 5, f"Expected 5 distinct values, got {values}"
+
+
+# ---------------------------------------------------------------------------
+# I. KV Cache Quantization
+# ---------------------------------------------------------------------------
+
+
+class TestQuantization:
+    """Verify all supported cache_type_kv values produce correct scaling."""
+
+    @pytest.mark.parametrize(
+        "cache_type,expected_bpe",
+        [
+            ("f32", 4.0),
+            ("f16", 2.0),
+            ("bf16", 2.0),
+            ("q8_0", 34 / 32),
+            ("q5_1", 0.75),
+            ("q5_0", 0.6875),
+            ("q4_1", 0.625),
+            ("q4_0", 0.5625),
+            ("iq4_nl", 0.5625),
+            (None, 2.0),  # default is f16
+            ("unknown", 2.0),  # unknown falls back to f16
+        ],
+    )
+    def test_quantization_scaling(self, cache_type, expected_bpe):
+        b = LlamaCppBackend()
+        b._n_layers = 10
+        b._n_kv_heads = 1
+        b._n_heads = 8
+        b._embedding_length = 512
+        b._kv_key_length = 64
+        b._kv_value_length = 64
+        result = b._estimate_kv_cache_bytes(1000, cache_type)
+        expected = int(10 * 1000 * 1 * (64 + 64) * expected_bpe)
+        assert result == expected
+
+
+# ---------------------------------------------------------------------------
+# J. Edge Cases
+# ---------------------------------------------------------------------------
+
+
+class TestEdgeCases:
+    """Boundary conditions and degenerate inputs."""
+
+    def test_zero_context(self):
+        b = LlamaCppBackend()
+        b._n_layers = 32
+        b._kv_key_length = 128
+        assert b._estimate_kv_cache_bytes(0, "f16") == 0
+
+    def test_negative_context(self):
+        b = LlamaCppBackend()
+        b._n_layers = 32
+        b._kv_key_length = 128
+        assert b._estimate_kv_cache_bytes(-1, "f16") == 0
+
+    def test_context_of_one(self):
+        b = LlamaCppBackend()
+        b._n_layers = 10
+        b._n_kv_heads = 1
+        b._kv_key_length = 64
+        b._kv_value_length = 64
+        result = b._estimate_kv_cache_bytes(1, "f16")
+        assert result == int(10 * 1 * 1 * (64 + 64) * 2)
+
+    def test_very_large_context(self):
+        """1M context should not overflow or crash."""
+        b = LlamaCppBackend()
+        b._n_layers = 10
+        b._n_kv_heads = 1
+        b._kv_key_length = 128
+        b._kv_value_length = 128
+        result = b._estimate_kv_cache_bytes(1_000_000, "f16")
+        assert result > 0
+        assert isinstance(result, int)
+
+    def test_n_kv_heads_none_falls_to_n_heads(self):
+        b = LlamaCppBackend()
+        b._n_layers = 10
+        b._n_kv_heads = None
+        b._n_heads = 8
+        b._kv_key_length = 64
+        b._kv_value_length = 64
+        result = b._estimate_kv_cache_bytes(100, "f16")
+        expected = int(10 * 100 * 8 * (64 + 64) * 2)
+        assert result == expected
+
+    def test_both_heads_none_falls_to_one(self):
+        b = LlamaCppBackend()
+        b._n_layers = 10
+        b._n_kv_heads = None
+        b._n_heads = None
+        b._kv_key_length = 64
+        b._kv_value_length = 64
+        result = b._estimate_kv_cache_bytes(100, "f16")
+        expected = int(10 * 100 * 1 * (64 + 64) * 2)
+        assert result == expected
+
+
+# ---------------------------------------------------------------------------
+# K. Lifecycle Tests
+# ---------------------------------------------------------------------------
+
+
+class TestLifecycle:
+    """Init, unload, and reparse field management."""
+
+    def test_init_fields_none(self):
+        b = LlamaCppBackend()
+        for attr in [
+            "_kv_key_length",
+            "_kv_value_length",
+            "_sliding_window",
+            "_full_attention_interval",
+            "_kv_lora_rank",
+            "_key_length_mla",
+            "_ssm_inner_size",
+            "_ssm_state_size",
+        ]:
+            assert getattr(b, attr) is None
+
+    def test_unload_resets_fields(self):
+        b = LlamaCppBackend()
+        b._n_layers = 32
+        b._kv_key_length = 128
+        b._kv_lora_rank = 512
+        b._sliding_window = 1024
+        b._ssm_inner_size = 4096
+        b._full_attention_interval = 4
+        b.unload_model()
+        for attr in [
+            "_kv_key_length",
+            "_kv_value_length",
+            "_sliding_window",
+            "_full_attention_interval",
+            "_kv_lora_rank",
+            "_key_length_mla",
+            "_ssm_inner_size",
+            "_ssm_state_size",
+        ]:
+            assert getattr(b, attr) is None
+
+    def test_end_to_end_synthetic_mla(self):
+        """Full round-trip: write GGUF -> parse -> estimate."""
+        b = _backend_from_gguf(
+            "deepseek2",
+            {
+                "context_length": 163840,
+                "block_count": 61,
+                "attention.head_count_kv": 1,
+                "attention.head_count": 128,
+                "embedding_length": 7168,
+                "attention.key_length": 576,
+                "attention.value_length": 512,
+                "attention.kv_lora_rank": 512,
+                "attention.key_length_mla": 192,
+            },
+        )
+        assert b._can_estimate_kv()
+        result = b._estimate_kv_cache_bytes(163840, "f16")
+        expected = 61 * 163840 * 1 * 576 * 2
+        assert result == expected
+
+    def test_end_to_end_synthetic_hybrid(self):
+        b = _backend_from_gguf(
+            "qwen35",
+            {
+                "context_length": 262144,
+                "block_count": 64,
+                "attention.head_count_kv": 4,
+                "attention.head_count": 24,
+                "embedding_length": 5120,
+                "attention.key_length": 256,
+                "attention.value_length": 256,
+                "full_attention_interval": 4,
+                "ssm.inner_size": 6144,
+                "ssm.state_size": 128,
+            },
+        )
+        assert b._can_estimate_kv()
+        result = b._estimate_kv_cache_bytes(262144, "f16")
+        n_attn = 64 // 4
+        expected = n_attn * 262144 * 4 * (256 + 256) * 2
+        assert result == expected
+
+    def test_end_to_end_synthetic_swa(self):
+        b = _backend_from_gguf(
+            "gemma3",
+            {
+                "context_length": 131072,
+                "block_count": 62,
+                "attention.head_count_kv": 16,
+                "attention.head_count": 32,
+                "embedding_length": 5376,
+                "attention.key_length": 128,
+                "attention.value_length": 128,
+                "attention.sliding_window": 1024,
+            },
+        )
+        assert b._can_estimate_kv()
+        result = b._estimate_kv_cache_bytes(131072, "f16")
+        n_global = max(1, 62 // 4)  # 15
+        n_swa = 62 - n_global  # 47
+        kv_per = 16 * 256 * 2
+        expected = int(n_global * 131072 * kv_per + n_swa * 1024 * kv_per)
+        assert result == expected
+
+    def test_end_to_end_synthetic_gqa(self):
+        b = _backend_from_gguf(
+            "qwen3",
+            {
+                "context_length": 40960,
+                "block_count": 28,
+                "attention.head_count_kv": 8,
+                "attention.head_count": 16,
+                "embedding_length": 1024,
+                "attention.key_length": 128,
+                "attention.value_length": 128,
+            },
+        )
+        assert b._can_estimate_kv()
+        result = b._estimate_kv_cache_bytes(40960, "f16")
+        expected = 28 * 40960 * 8 * 256 * 2
+        assert result == expected
+
+    def test_end_to_end_synthetic_legacy(self):
+        b = _backend_from_gguf(
+            "llama",
+            {
+                "context_length": 4096,
+                "block_count": 32,
+                "attention.head_count_kv": 8,
+                "attention.head_count": 32,
+                "embedding_length": 4096,
+            },
+        )
+        assert b._can_estimate_kv()
+        result = b._estimate_kv_cache_bytes(4096, "f16")
+        head_dim = 4096 // 32
+        expected = int(2 * 8 * head_dim * 32 * 4096 * 2)
+        assert result == expected
--- a/studio/backend/tests/test_llama_cpp_cache_aware_disk_check.py
+++ b/studio/backend/tests/test_llama_cpp_cache_aware_disk_check.py
@ -0,0 +1,243 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
+
+"""Tests for the cache-aware disk-space preflight in
+``LlamaCppBackend.load_model``.
+
+The preflight used to compare the repo's total GGUF download size against
+free disk without accounting for bytes already present in the Hugging
+Face cache. That made re-loading a cached large model (e.g.
+``unsloth/MiniMax-M2.7-GGUF`` at 131 GB) fail cold whenever free disk was
+below the full weight footprint, even though nothing needed
+downloading.
+
+These tests exercise the preflight arithmetic in isolation by driving
+``get_paths_info`` and ``try_to_load_from_cache`` through ``mock.patch``.
+No network, GPU, or subprocess use.
+
+Cross-platform: Linux, macOS, Windows, WSL.
+"""
+
+from __future__ import annotations
+
+import sys
+import tempfile
+import types as _types
+from pathlib import Path
+from unittest.mock import patch
+
+import pytest
+
+# ---------------------------------------------------------------------------
+# Stub heavy / unavailable external dependencies before importing the
+# module under test.  Same pattern as test_kv_cache_estimation.py.
+# ---------------------------------------------------------------------------
+
+_BACKEND_DIR = str(Path(__file__).resolve().parent.parent)
+if _BACKEND_DIR not in sys.path:
+    sys.path.insert(0, _BACKEND_DIR)
+
+# loggers
+_loggers_stub = _types.ModuleType("loggers")
+_loggers_stub.get_logger = lambda name: __import__("logging").getLogger(name)
+sys.modules.setdefault("loggers", _loggers_stub)
+
+# structlog
+_structlog_stub = _types.ModuleType("structlog")
+sys.modules.setdefault("structlog", _structlog_stub)
+
+# httpx
+_httpx_stub = _types.ModuleType("httpx")
+for _exc_name in (
+    "ConnectError",
+    "TimeoutException",
+    "ReadTimeout",
+    "ReadError",
+    "RemoteProtocolError",
+    "CloseError",
+):
+    setattr(_httpx_stub, _exc_name, type(_exc_name, (Exception,), {}))
+
+
+class _FakeTimeout:
+    def __init__(self, *a, **kw):
+        pass
+
+
+_httpx_stub.Timeout = _FakeTimeout
+_httpx_stub.Client = type(
+    "Client",
+    (),
+    {
+        "__init__": lambda self, **kw: None,
+        "__enter__": lambda self: self,
+        "__exit__": lambda self, *a: None,
+    },
+)
+sys.modules.setdefault("httpx", _httpx_stub)
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+GIB = 1024**3
+
+
+class _FakePathInfo:
+    """Mimics huggingface_hub's RepoFile-ish return type from get_paths_info."""
+
+    def __init__(self, path: str, size: int):
+        self.path = path
+        self.size = size
+
+
+def _preflight(
+    repo_files,
+    cached_files,
+    free_bytes,
+    hf_repo = "unsloth/Example-GGUF",
+    hf_token = None,
+):
+    """Run the preflight arithmetic as written in llama_cpp.py and return
+    the decision outcome as a dict.
+
+    ``repo_files``: list of (filename, remote_bytes).
+    ``cached_files``: dict {filename: on_disk_bytes} for files already in cache.
+    ``free_bytes``: value returned by shutil.disk_usage(cache_dir).free.
+    """
+    import os
+    import shutil
+
+    path_infos = [_FakePathInfo(name, size) for name, size in repo_files]
+
+    with tempfile.TemporaryDirectory() as tmp:
+        # Create SPARSE files for the cached ones so os.path.exists /
+        # os.path.getsize pass without actually allocating bytes on disk.
+        # This is critical when simulating multi-GB models.
+        cache_paths = {}
+        for name, sz in cached_files.items():
+            p = Path(tmp) / name.replace("/", "_")
+            with open(p, "wb") as fh:
+                if sz > 0:
+                    fh.truncate(sz)  # sparse allocation: no data blocks written
+            cache_paths[name] = str(p)
+
+        def fake_try_to_load_from_cache(repo_id, filename):
+            return cache_paths.get(filename)
+
+        # Mirror the same variable names and control flow as the real code
+        # so behavioral drift is caught immediately.
+        total_bytes = sum((p.size or 0) for p in path_infos)
+        already_cached_bytes = 0
+        for p in path_infos:
+            if not p.size:
+                continue
+            cached_path = fake_try_to_load_from_cache(hf_repo, p.path)
+            if isinstance(cached_path, str) and os.path.exists(cached_path):
+                try:
+                    on_disk = os.path.getsize(cached_path)
+                except OSError:
+                    on_disk = 0
+                if on_disk >= p.size:
+                    already_cached_bytes += p.size
+
+        total_download_bytes = max(0, total_bytes - already_cached_bytes)
+        needed_download = total_download_bytes > free_bytes
+        return {
+            "total_bytes": total_bytes,
+            "already_cached_bytes": already_cached_bytes,
+            "total_download_bytes": total_download_bytes,
+            "would_raise_disk_error": (needed_download and total_download_bytes > 0),
+        }
+
+
+# ---------------------------------------------------------------------------
+# Tests
+# ---------------------------------------------------------------------------
+
+
+class TestCacheAwarePreflight:
+    def test_fully_cached_model_does_not_require_disk(self):
+        """The MiniMax case: 131 GB weights cached, only 36 GB free.
+        Preflight must not raise."""
+        shards = [(f"UD-Q4_K_XL/shard-{i}.gguf", 35 * GIB) for i in range(4)]
+        cached = {name: size for name, size in shards}
+        out = _preflight(
+            repo_files = shards,
+            cached_files = cached,
+            free_bytes = 36 * GIB,
+        )
+        assert out["total_download_bytes"] == 0
+        assert out["already_cached_bytes"] == 140 * GIB
+        assert out["would_raise_disk_error"] is False
+
+    def test_partial_cache_only_counts_remaining_bytes(self):
+        """Two of four shards cached: preflight against remaining 70 GB."""
+        shards = [(f"UD-Q4_K_XL/shard-{i}.gguf", 35 * GIB) for i in range(4)]
+        cached = {
+            shards[0][0]: shards[0][1],
+            shards[1][0]: shards[1][1],
+        }
+        out = _preflight(
+            repo_files = shards,
+            cached_files = cached,
+            free_bytes = 80 * GIB,
+        )
+        assert out["already_cached_bytes"] == 70 * GIB
+        assert out["total_download_bytes"] == 70 * GIB
+        assert out["would_raise_disk_error"] is False
+
+    def test_partial_cache_insufficient_disk_for_rest_still_raises(self):
+        """Two of four shards cached; remaining 70 GB still bigger than
+        free disk -> preflight correctly wants to raise."""
+        shards = [(f"UD-Q4_K_XL/shard-{i}.gguf", 35 * GIB) for i in range(4)]
+        cached = {
+            shards[0][0]: shards[0][1],
+            shards[1][0]: shards[1][1],
+        }
+        out = _preflight(
+            repo_files = shards,
+            cached_files = cached,
+            free_bytes = 50 * GIB,
+        )
+        assert out["total_download_bytes"] == 70 * GIB
+        assert out["would_raise_disk_error"] is True
+
+    def test_nothing_cached_preserves_existing_behavior(self):
+        """Cold-cache path still compares full download vs free disk."""
+        shards = [("UD-Q4_K_XL/shard-0.gguf", 40 * GIB)]
+        out = _preflight(
+            repo_files = shards,
+            cached_files = {},
+            free_bytes = 50 * GIB,
+        )
+        assert out["already_cached_bytes"] == 0
+        assert out["total_download_bytes"] == 40 * GIB
+        assert out["would_raise_disk_error"] is False
+
+    def test_incomplete_cached_blob_is_not_credited(self):
+        """A partial file on disk (e.g. interrupted download) is not
+        counted as cached -- we still require bytes for it."""
+        shards = [("UD-Q4_K_XL/shard-0.gguf", 40 * GIB)]
+        partial = {"UD-Q4_K_XL/shard-0.gguf": 10 * GIB}
+        out = _preflight(
+            repo_files = shards,
+            cached_files = partial,
+            free_bytes = 50 * GIB,
+        )
+        assert out["already_cached_bytes"] == 0
+        assert out["total_download_bytes"] == 40 * GIB
+        assert out["would_raise_disk_error"] is False
+
+    def test_zero_size_path_infos_do_not_crash(self):
+        """A path_info with size=0 should not be credited or break the
+        arithmetic."""
+        shards = [("mmproj.gguf", 0), ("UD-Q4_K_XL/shard-0.gguf", 40 * GIB)]
+        out = _preflight(
+            repo_files = shards,
+            cached_files = {},
+            free_bytes = 50 * GIB,
+        )
+        assert out["already_cached_bytes"] == 0
+        assert out["total_bytes"] == 40 * GIB
--- a/studio/backend/tests/test_llama_cpp_context_fit.py
+++ b/studio/backend/tests/test_llama_cpp_context_fit.py
@ -0,0 +1,389 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
+
+"""Tests for the GGUF load-time context auto-fit decision.
+
+Guards two regressions in ``LlamaCppBackend.load_model``:
+
+1. **Auto mode on weights-exceed-VRAM** (``n_ctx == 0``): when the model
+   weights alone exceed 90% of every GPU subset's free memory, the
+   auto-pick loop used to exit without matching, leaving
+   ``effective_ctx`` at the model's native context (e.g. 196608 for
+   MiniMax-M2.7). The intended default per Studio's UI spec is 4096 so
+   the slider lands on a usable value; the user can still drag higher
+   and trigger ``--fit on`` with a warning.
+
+2. **Explicit ctx silently shrunk when KV overflows**: with fittable
+   weights but a requested ctx whose KV cache pushes total memory over
+   90% of VRAM, the old code binary-searched a smaller ctx and emitted
+   ``-c <capped> -ngl -1`` without informing the caller. The UI had
+   already surfaced its "might be slower" warning and expects the user's
+   explicit ctx to be honored with ``--fit on`` flexing ``-ngl`` instead.
+
+Tests avoid GPU probing, subprocess spawning, and GGUF I/O by driving the
+post-metadata decision block directly against a stubbed instance.
+
+Requires no GPU, network, or external libraries beyond pytest.
+Cross-platform: Linux, macOS, Windows, WSL.
+"""
+
+from __future__ import annotations
+
+import sys
+import types as _types
+from pathlib import Path
+
+import pytest
+
+# ---------------------------------------------------------------------------
+# Stub heavy / unavailable external dependencies before importing the
+# module under test.  Same pattern as test_kv_cache_estimation.py.
+# ---------------------------------------------------------------------------
+
+_BACKEND_DIR = str(Path(__file__).resolve().parent.parent)
+if _BACKEND_DIR not in sys.path:
+    sys.path.insert(0, _BACKEND_DIR)
+
+# loggers
+_loggers_stub = _types.ModuleType("loggers")
+_loggers_stub.get_logger = lambda name: __import__("logging").getLogger(name)
+sys.modules.setdefault("loggers", _loggers_stub)
+
+# structlog
+_structlog_stub = _types.ModuleType("structlog")
+sys.modules.setdefault("structlog", _structlog_stub)
+
+# httpx
+_httpx_stub = _types.ModuleType("httpx")
+for _exc_name in (
+    "ConnectError",
+    "TimeoutException",
+    "ReadTimeout",
+    "ReadError",
+    "RemoteProtocolError",
+    "CloseError",
+):
+    setattr(_httpx_stub, _exc_name, type(_exc_name, (Exception,), {}))
+
+
+class _FakeTimeout:
+    def __init__(self, *a, **kw):
+        pass
+
+
+_httpx_stub.Timeout = _FakeTimeout
+_httpx_stub.Client = type(
+    "Client",
+    (),
+    {
+        "__init__": lambda self, **kw: None,
+        "__enter__": lambda self: self,
+        "__exit__": lambda self, *a: None,
+    },
+)
+sys.modules.setdefault("httpx", _httpx_stub)
+
+from core.inference.llama_cpp import LlamaCppBackend
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+GIB = 1024**3
+FALLBACK_CTX = 4096
+
+
+def _make_backend(
+    native_ctx = 131072,
+    n_layers = 80,
+    n_kv_heads = 8,
+    n_heads = 64,
+    kv_key_length = 128,
+    kv_value_length = 128,
+):
+    """Create a LlamaCppBackend instance with GGUF metadata fields set and
+    the helpers used by the decision block stubbed out."""
+    inst = LlamaCppBackend.__new__(LlamaCppBackend)
+    inst._context_length = native_ctx
+    inst._n_layers = n_layers
+    inst._n_kv_heads = n_kv_heads
+    inst._n_heads = n_heads
+    inst._embedding_length = 8192
+    inst._kv_key_length = kv_key_length
+    inst._kv_value_length = kv_value_length
+    inst._kv_lora_rank = None
+    inst._sliding_window = None
+    inst._ssm_inner_size = None
+    inst._full_attention_interval = None
+    inst._key_length_mla = None
+    return inst
+
+
+def _drive(
+    n_ctx,
+    model_gib,
+    gpus,
+    native_ctx = 131072,
+    kv_per_token_bytes = 325_000,
+    can_estimate_kv = True,
+):
+    """Drive the post-metadata portion of load_model with stubbed inputs.
+
+    Mirrors the decision block at llama_cpp.py:1137-1296 so we can assert
+    the command that would be built, without subprocesses or GPU probes.
+    """
+    inst = _make_backend(native_ctx = native_ctx)
+    model_size = int(model_gib * GIB)
+    cache_type_kv = None
+
+    def fake_estimate(n_ctx_, _type = None):
+        return 0 if n_ctx_ <= 0 else n_ctx_ * kv_per_token_bytes
+
+    inst._estimate_kv_cache_bytes = fake_estimate
+    inst._can_estimate_kv = lambda: can_estimate_kv
+
+    context_length = inst._context_length
+
+    effective_ctx = n_ctx if n_ctx > 0 else (context_length or 0)
+    max_available_ctx = context_length or effective_ctx
+    if n_ctx > 0:
+        effective_ctx = n_ctx
+    elif context_length is not None:
+        effective_ctx = context_length
+    else:
+        effective_ctx = 0
+    original_ctx = effective_ctx
+    max_available_ctx = context_length or effective_ctx
+
+    gpu_indices, use_fit = None, True
+    explicit_ctx = n_ctx > 0
+
+    if gpus and inst._can_estimate_kv() and effective_ctx > 0:
+        native_ctx_for_cap = context_length or effective_ctx
+        if native_ctx_for_cap > 0:
+            ranked_for_cap = sorted(gpus, key = lambda g: g[1], reverse = True)
+            best_cap = 0
+            for n_gpus in range(1, len(ranked_for_cap) + 1):
+                subset = ranked_for_cap[:n_gpus]
+                pool_mib = sum(free for _, free in subset)
+                capped = inst._fit_context_to_vram(
+                    native_ctx_for_cap,
+                    pool_mib,
+                    model_size,
+                    cache_type_kv,
+                )
+                kv = inst._estimate_kv_cache_bytes(capped, cache_type_kv)
+                total_mib = (model_size + kv) / (1024 * 1024)
+                if total_mib <= pool_mib * 0.90:
+                    best_cap = max(best_cap, capped)
+            if best_cap > 0:
+                max_available_ctx = best_cap
+
+        if explicit_ctx:
+            requested_total = model_size + inst._estimate_kv_cache_bytes(
+                effective_ctx, cache_type_kv
+            )
+            gpu_indices, use_fit = inst._select_gpus(requested_total, gpus)
+        else:
+            ranked = sorted(gpus, key = lambda g: g[1], reverse = True)
+            matched = False
+            for n_gpus in range(1, len(ranked) + 1):
+                subset = ranked[:n_gpus]
+                pool_mib = sum(free for _, free in subset)
+                capped = inst._fit_context_to_vram(
+                    effective_ctx,
+                    pool_mib,
+                    model_size,
+                    cache_type_kv,
+                )
+                kv = inst._estimate_kv_cache_bytes(capped, cache_type_kv)
+                total_mib = (model_size + kv) / (1024 * 1024)
+                if total_mib <= pool_mib * 0.90:
+                    effective_ctx = capped
+                    gpu_indices = sorted(idx for idx, _ in subset)
+                    use_fit = False
+                    matched = True
+                    break
+            if not matched:
+                effective_ctx = min(FALLBACK_CTX, effective_ctx)
+    elif gpus:
+        gpu_indices, use_fit = inst._select_gpus(model_size, gpus)
+        if use_fit and not explicit_ctx:
+            effective_ctx = (
+                min(FALLBACK_CTX, effective_ctx) if effective_ctx > 0 else FALLBACK_CTX
+            )
+
+    return {
+        "c_arg": effective_ctx if effective_ctx > 0 else 0,
+        "use_fit": use_fit,
+        "gpu_indices": gpu_indices,
+        "max_available_ctx": max_available_ctx,
+        "original_ctx": original_ctx,
+    }
+
+
+# ---------------------------------------------------------------------------
+# Auto mode, model weights exceed VRAM  (Bug A guard)
+# ---------------------------------------------------------------------------
+
+
+class TestAutoModeWeightsExceedVRAM:
+    """``n_ctx == 0`` on a model whose weights don't fit anywhere."""
+
+    def test_minimax_like_single_gpu(self):
+        plan = _drive(
+            n_ctx = 0,
+            model_gib = 131,
+            gpus = [(0, 97_000)],
+            native_ctx = 196608,
+        )
+        assert plan["c_arg"] == FALLBACK_CTX
+        assert plan["use_fit"] is True
+        assert plan["gpu_indices"] is None
+        # UI slider ceiling stays at native: user can still drag higher
+        # and get the "might be slower" path.
+        assert plan["max_available_ctx"] == 196608
+
+    def test_multi_gpu_all_subsets_fail(self):
+        plan = _drive(
+            n_ctx = 0,
+            model_gib = 400,
+            gpus = [(0, 80_000), (1, 80_000), (2, 80_000), (3, 80_000)],
+            native_ctx = 131072,
+        )
+        assert plan["c_arg"] == FALLBACK_CTX
+        assert plan["use_fit"] is True
+        assert plan["gpu_indices"] is None
+
+    def test_no_kv_metadata_auto(self):
+        """File-size-only fallback path also defaults to 4096."""
+        plan = _drive(
+            n_ctx = 0,
+            model_gib = 131,
+            gpus = [(0, 97_000)],
+            native_ctx = 196608,
+            can_estimate_kv = False,
+        )
+        assert plan["c_arg"] == FALLBACK_CTX
+        assert plan["use_fit"] is True
+
+
+# ---------------------------------------------------------------------------
+# Explicit ctx, KV overflows fittable weights  (Bug B guard)
+# ---------------------------------------------------------------------------
+
+
+class TestExplicitCtxRespectsUser:
+    """``n_ctx > 0`` must never be silently shrunk."""
+
+    def test_fittable_weights_oversized_kv(self):
+        # 8 GB weights + 131k ctx KV on 24 GB VRAM.
+        # Budget = 21.6 GB, KV at 131k >> 13.6 GB remaining, so
+        # _select_gpus flips use_fit=True.
+        plan = _drive(
+            n_ctx = 131072,
+            model_gib = 8,
+            gpus = [(0, 24_000)],
+            native_ctx = 131072,
+        )
+        assert plan["c_arg"] == 131072
+        assert plan["use_fit"] is True
+        assert plan["gpu_indices"] is None
+
+    def test_explicit_that_fits_uses_ngl(self):
+        plan = _drive(
+            n_ctx = 8192,
+            model_gib = 8,
+            gpus = [(0, 24_000)],
+            native_ctx = 131072,
+        )
+        assert plan["c_arg"] == 8192
+        assert plan["use_fit"] is False
+        assert plan["gpu_indices"] == [0]
+
+    def test_explicit_on_weights_exceed_vram(self):
+        # User drags the slider to 32k on a too-big model: honored.
+        plan = _drive(
+            n_ctx = 32768,
+            model_gib = 131,
+            gpus = [(0, 97_000)],
+            native_ctx = 196608,
+        )
+        assert plan["c_arg"] == 32768
+        assert plan["use_fit"] is True
+
+    def test_explicit_at_fallback_on_too_big(self):
+        plan = _drive(
+            n_ctx = FALLBACK_CTX,
+            model_gib = 131,
+            gpus = [(0, 97_000)],
+            native_ctx = 196608,
+        )
+        assert plan["c_arg"] == FALLBACK_CTX
+        assert plan["use_fit"] is True
+
+    def test_explicit_below_floor_honored(self):
+        # 2048 is below --fit-ctx default; still honored since user set it.
+        plan = _drive(
+            n_ctx = 2048,
+            model_gib = 8,
+            gpus = [(0, 24_000)],
+        )
+        assert plan["c_arg"] == 2048
+
+
+# ---------------------------------------------------------------------------
+# Non-regression: fittable + auto still auto-picks largest fitting ctx
+# ---------------------------------------------------------------------------
+
+
+class TestFittableAutoPickRegressions:
+    def test_small_model_one_gpu(self):
+        plan = _drive(
+            n_ctx = 0,
+            model_gib = 8,
+            gpus = [(0, 24_000)],
+            native_ctx = 131072,
+            kv_per_token_bytes = 8192,
+        )
+        assert plan["use_fit"] is False
+        assert plan["gpu_indices"] == [0]
+        assert plan["c_arg"] > FALLBACK_CTX
+
+    def test_medium_model_needs_multi_gpu(self):
+        plan = _drive(
+            n_ctx = 0,
+            model_gib = 60,
+            gpus = [(0, 40_000), (1, 40_000)],
+            native_ctx = 131072,
+            kv_per_token_bytes = 8192,
+        )
+        assert plan["use_fit"] is False
+        assert plan["gpu_indices"] == [0, 1]
+
+    def test_no_kv_metadata_fittable_auto(self):
+        plan = _drive(
+            n_ctx = 0,
+            model_gib = 8,
+            gpus = [(0, 24_000)],
+            native_ctx = 131072,
+            can_estimate_kv = False,
+        )
+        assert plan["use_fit"] is False
+        assert plan["gpu_indices"] == [0]
+
+
+# ---------------------------------------------------------------------------
+# Platform-agnostic input shape
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.parametrize("platform_tag", ["linux", "windows", "mac", "rocm"])
+def test_identical_decision_across_platforms(platform_tag):
+    """The decision function takes ``[(gpu_idx, free_mib), ...]`` regardless
+    of how upstream (nvidia-smi / nvidia-smi.exe / Metal / rocm-smi) produced
+    it. Identical inputs must yield identical plans."""
+    plan_a = _drive(n_ctx = 0, model_gib = 8, gpus = [(0, 24_000)])
+    plan_b = _drive(n_ctx = 0, model_gib = 8, gpus = [(0, 24_000)])
+    assert plan_a == plan_b, platform_tag
--- a/studio/backend/tests/test_llama_cpp_load_progress.py
+++ b/studio/backend/tests/test_llama_cpp_load_progress.py
@ -0,0 +1,258 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
+
+"""Tests for ``LlamaCppBackend.load_progress()``.
+
+The chat settings flow and the training overlay both show a generic
+"Starting model..." spinner during the window after a GGUF download
+finishes and before llama-server reports healthy. For small models
+that window is a second or two and nobody notices. For large MoE GGUFs
+(MiniMax-M2.7, Qwen3.5-397B-A17B, etc.) the llama-server process spends
+minutes in kernel state D, paging tens or hundreds of GB of shards
+into the page cache. The UI has no way to show a real progress bar,
+rate, or ETA during that window.
+
+``load_progress()`` samples ``/proc/<pid>/status VmRSS`` (what the
+kernel has actually paged in) against the total shard file size on
+disk, so the frontend can render a real bar plus rate/ETA. This
+module pins that contract:
+
+  * returns ``None`` when no load is in flight
+  * returns ``{"phase": "mmap", ...}`` while the subprocess is alive
+    but ``_healthy`` is False
+  * returns ``{"phase": "ready", ...}`` once ``_healthy`` flips
+  * ``bytes_total`` is derived from the resolved on-disk path
+    (which the paired fix assigns to ``self._gguf_path`` on both the
+    local-GGUF and HF-download code paths)
+  * ``bytes_loaded`` is VmRSS in bytes, capped by total, rounded
+  * ``fraction`` is clamped to 0..1 and rounded to 4 decimal places
+
+Linux-only via ``/proc``. On platforms without ``/proc`` the method
+returns ``None`` instead of raising.
+Cross-platform test: skips cleanly on macOS / Windows if ``/proc`` is
+not available.
+"""
+
+from __future__ import annotations
+
+import os
+import sys
+import tempfile
+import types as _types
+from pathlib import Path
+from unittest.mock import patch
+
+import pytest
+
+# ---------------------------------------------------------------------------
+# Stub heavy / unavailable external dependencies before importing the
+# module under test. Same pattern as test_kv_cache_estimation.py.
+# ---------------------------------------------------------------------------
+
+_BACKEND_DIR = str(Path(__file__).resolve().parent.parent)
+if _BACKEND_DIR not in sys.path:
+    sys.path.insert(0, _BACKEND_DIR)
+
+_loggers_stub = _types.ModuleType("loggers")
+_loggers_stub.get_logger = lambda name: __import__("logging").getLogger(name)
+sys.modules.setdefault("loggers", _loggers_stub)
+
+_structlog_stub = _types.ModuleType("structlog")
+sys.modules.setdefault("structlog", _structlog_stub)
+
+_httpx_stub = _types.ModuleType("httpx")
+for _exc_name in (
+    "ConnectError",
+    "TimeoutException",
+    "ReadTimeout",
+    "ReadError",
+    "RemoteProtocolError",
+    "CloseError",
+):
+    setattr(_httpx_stub, _exc_name, type(_exc_name, (Exception,), {}))
+
+
+class _FakeTimeout:
+    def __init__(self, *a, **kw):
+        pass
+
+
+_httpx_stub.Timeout = _FakeTimeout
+_httpx_stub.Client = type(
+    "Client",
+    (),
+    {
+        "__init__": lambda self, **kw: None,
+        "__enter__": lambda self: self,
+        "__exit__": lambda self, *a: None,
+    },
+)
+sys.modules.setdefault("httpx", _httpx_stub)
+
+from core.inference.llama_cpp import LlamaCppBackend
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def _make_instance():
+    inst = LlamaCppBackend.__new__(LlamaCppBackend)
+    inst._process = None
+    inst._gguf_path = None
+    inst._healthy = False
+    return inst
+
+
+class _FakeProc:
+    """Minimal stand-in for subprocess.Popen that just carries a pid."""
+
+    def __init__(self, pid: int):
+        self.pid = pid
+
+
+def _write_sparse_file(path: Path, size_bytes: int) -> None:
+    """Create a sparse file of the given size without allocating blocks."""
+    with open(path, "wb") as fh:
+        if size_bytes > 0:
+            fh.truncate(size_bytes)
+
+
+# ---------------------------------------------------------------------------
+# Tests
+# ---------------------------------------------------------------------------
+
+
+class TestLoadProgressEmptyStates:
+    def test_returns_none_when_no_process(self):
+        inst = _make_instance()
+        assert inst.load_progress() is None
+
+    def test_returns_none_when_process_has_no_pid(self):
+        inst = _make_instance()
+        inst._process = _FakeProc(pid = None)  # type: ignore[arg-type]
+        assert inst.load_progress() is None
+
+
+class TestLoadProgressSingleShard:
+    def test_mmap_phase_for_alive_but_unhealthy(self, tmp_path):
+        """VmRSS below total -> phase='mmap', fraction reflects progress."""
+        gguf = tmp_path / "model.gguf"
+        _write_sparse_file(gguf, 40 * 1024**3)  # 40 GB
+
+        inst = _make_instance()
+        inst._process = _FakeProc(pid = os.getpid())  # use our own pid
+        inst._gguf_path = str(gguf)
+        inst._healthy = False
+
+        # Patch /proc read to claim 10 GB RSS.
+        def fake_open(path, *args, **kwargs):
+            if str(path).startswith("/proc/"):
+                import io
+
+                return io.StringIO(f"Name:\ttest\nVmRSS:\t{10 * 1024 ** 2}\tkB\n")
+            return open(path, *args, **kwargs)  # fall through
+
+        with patch("builtins.open", side_effect = fake_open):
+            out = inst.load_progress()
+
+        assert out is not None
+        assert out["phase"] == "mmap"
+        assert out["bytes_total"] == 40 * 1024**3
+        assert out["bytes_loaded"] == 10 * 1024**3
+        assert 0.24 < out["fraction"] < 0.26  # ~25%
+
+    def test_ready_phase_when_healthy(self, tmp_path):
+        gguf = tmp_path / "model.gguf"
+        _write_sparse_file(gguf, 8 * 1024**3)
+
+        inst = _make_instance()
+        inst._process = _FakeProc(pid = os.getpid())
+        inst._gguf_path = str(gguf)
+        inst._healthy = True
+
+        def fake_open(path, *args, **kwargs):
+            if str(path).startswith("/proc/"):
+                import io
+
+                return io.StringIO(f"VmRSS:\t{8 * 1024 ** 2}\tkB\n")
+            return open(path, *args, **kwargs)
+
+        with patch("builtins.open", side_effect = fake_open):
+            out = inst.load_progress()
+
+        assert out is not None
+        assert out["phase"] == "ready"
+        assert out["bytes_total"] == 8 * 1024**3
+        assert out["bytes_loaded"] == 8 * 1024**3
+        assert out["fraction"] == 1.0
+
+
+class TestLoadProgressMultiShard:
+    """Shard-aware total: for ``*-00001-of-00004.gguf`` primaries the
+    method sums sibling files with the same prefix."""
+
+    def test_sharded_total_aggregates_siblings(self, tmp_path):
+        for i in range(1, 5):
+            _write_sparse_file(
+                tmp_path / f"model-{i:05d}-of-00004.gguf",
+                size_bytes = 20 * 1024**3,
+            )
+        # Drop an unrelated .gguf in the same folder -- must not be counted.
+        _write_sparse_file(tmp_path / "mmproj-BF16.gguf", 2 * 1024**3)
+
+        inst = _make_instance()
+        inst._process = _FakeProc(pid = os.getpid())
+        inst._gguf_path = str(tmp_path / "model-00001-of-00004.gguf")
+        inst._healthy = False
+
+        def fake_open(path, *args, **kwargs):
+            if str(path).startswith("/proc/"):
+                import io
+
+                return io.StringIO("VmRSS:\t0\tkB\n")
+            return open(path, *args, **kwargs)
+
+        with patch("builtins.open", side_effect = fake_open):
+            out = inst.load_progress()
+
+        assert out is not None
+        assert out["bytes_total"] == 80 * 1024**3  # 4 x 20 GB, no mmproj
+
+
+class TestLoadProgressDegradation:
+    """Broken / unusual inputs never raise; they produce best-effort output."""
+
+    def test_missing_gguf_path_still_reports_rss(self, tmp_path):
+        inst = _make_instance()
+        inst._process = _FakeProc(pid = os.getpid())
+        inst._gguf_path = None
+        inst._healthy = False
+
+        def fake_open(path, *args, **kwargs):
+            if str(path).startswith("/proc/"):
+                import io
+
+                return io.StringIO("VmRSS:\t1024\tkB\n")
+            return open(path, *args, **kwargs)
+
+        with patch("builtins.open", side_effect = fake_open):
+            out = inst.load_progress()
+
+        assert out is not None
+        assert out["phase"] == "mmap"
+        assert out["bytes_total"] == 0
+        assert out["bytes_loaded"] == 1024 * 1024
+        assert out["fraction"] == 0.0
+
+    def test_unreadable_proc_returns_none(self, tmp_path):
+        inst = _make_instance()
+        # Pid that doesn't exist -> /proc read fails.
+        inst._process = _FakeProc(pid = 999_999_999)
+        inst._gguf_path = str(tmp_path / "model.gguf")  # doesn't need to exist
+        inst._healthy = False
+
+        out = inst.load_progress()
+        # FileNotFoundError on /proc path -> load_progress returns None.
+        assert out is None
--- a/studio/backend/tests/test_llama_cpp_load_progress_live.py
+++ b/studio/backend/tests/test_llama_cpp_load_progress_live.py
@ -0,0 +1,202 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
+
+"""Live, no-mock integration test for ``LlamaCppBackend.load_progress()``.
+
+The companion files (``test_llama_cpp_load_progress.py`` and
+``test_llama_cpp_load_progress_matrix.py``) patch ``builtins.open`` to
+feed synthetic VmRSS values. This file is the opposite: it uses **real**
+subprocesses, **real** file sizes, and the **real** ``/proc``
+interface. It is the sanity check that the contract we keep in the
+mocked tests still maps to what the kernel actually returns on a live
+Linux system.
+
+Why both: the mocked tests can be fooled by a buggy implementation that
+parses ``/proc`` output in a format the kernel no longer uses, or that
+makes assumptions about ``Path.stat()`` vs ``os.path.getsize``. This
+file hits the real APIs so any format drift gets caught.
+
+Skipped cleanly on non-Linux (no ``/proc``).
+"""
+
+from __future__ import annotations
+
+import os
+import subprocess
+import sys
+import time
+import types as _types
+from pathlib import Path
+
+import pytest
+
+# ---------------------------------------------------------------------------
+# Same stubs as the matrix file (keep self-contained so the file can be
+# run standalone as well as via the full suite).
+# ---------------------------------------------------------------------------
+
+_BACKEND_DIR = str(Path(__file__).resolve().parent.parent)
+if _BACKEND_DIR not in sys.path:
+    sys.path.insert(0, _BACKEND_DIR)
+
+_loggers_stub = _types.ModuleType("loggers")
+_loggers_stub.get_logger = lambda name: __import__("logging").getLogger(name)
+sys.modules.setdefault("loggers", _loggers_stub)
+_structlog_stub = _types.ModuleType("structlog")
+sys.modules.setdefault("structlog", _structlog_stub)
+_httpx_stub = _types.ModuleType("httpx")
+for _exc in (
+    "ConnectError",
+    "TimeoutException",
+    "ReadTimeout",
+    "ReadError",
+    "RemoteProtocolError",
+    "CloseError",
+):
+    setattr(_httpx_stub, _exc, type(_exc, (Exception,), {}))
+_httpx_stub.Timeout = type("Timeout", (), {"__init__": lambda self, *a, **k: None})
+_httpx_stub.Client = type(
+    "Client",
+    (),
+    {
+        "__init__": lambda self, **kw: None,
+        "__enter__": lambda self: self,
+        "__exit__": lambda self, *a: None,
+    },
+)
+sys.modules.setdefault("httpx", _httpx_stub)
+
+from core.inference.llama_cpp import LlamaCppBackend
+
+
+pytestmark = pytest.mark.skipif(
+    not Path("/proc").exists(),
+    reason = "live /proc test is Linux-only",
+)
+
+
+def _make_backend(pid: int, gguf_path: str, healthy: bool = False):
+    inst = LlamaCppBackend.__new__(LlamaCppBackend)
+    inst._process = type("P", (), {"pid": pid})()
+    inst._gguf_path = gguf_path
+    inst._healthy = healthy
+    return inst
+
+
+def test_live_rss_matches_kernel_vmrss(tmp_path):
+    """Spawn a real child, let it allocate real bytes, confirm
+    ``bytes_loaded`` tracks the kernel's VmRSS within a sane tolerance."""
+    # Child that allocates ~100 MB of zero'd bytes and then idles.
+    script = tmp_path / "burn.py"
+    script.write_text(
+        "import time, sys\n"
+        "buf = bytearray(100 * 1024 * 1024)\n"  # 100 MB
+        "# touch every page so RSS actually grows\n"
+        "for i in range(0, len(buf), 4096):\n"
+        "    buf[i] = 1\n"
+        "sys.stdout.write('ready\\n')\n"
+        "sys.stdout.flush()\n"
+        "time.sleep(10)\n"
+    )
+    proc = subprocess.Popen(
+        [sys.executable, str(script)],
+        stdout = subprocess.PIPE,
+        stderr = subprocess.PIPE,
+    )
+    try:
+        # Wait for the child to finish touching pages.
+        ready = proc.stdout.readline()
+        assert ready.strip() == b"ready"
+
+        # Create a fake 200 MB sparse gguf so bytes_total is concrete.
+        gguf = tmp_path / "model.gguf"
+        with open(gguf, "wb") as f:
+            f.truncate(200 * 1024 * 1024)
+
+        inst = _make_backend(proc.pid, str(gguf), healthy = False)
+        out = inst.load_progress()
+
+        assert out is not None, "load_progress returned None for live pid"
+        assert out["phase"] == "mmap"
+        assert out["bytes_total"] == 200 * 1024 * 1024
+        # VmRSS for the Python child includes the interpreter + the 100MB
+        # buffer, so a realistic floor is 50 MB and ceiling is 200 MB.
+        assert (
+            out["bytes_loaded"] >= 50 * 1024 * 1024
+        ), f"bytes_loaded unexpectedly low: {out['bytes_loaded']}"
+        assert out["bytes_loaded"] <= 200 * 1024 * 1024
+        assert 0.0 < out["fraction"] <= 1.0
+    finally:
+        proc.terminate()
+        try:
+            proc.wait(timeout = 5)
+        except subprocess.TimeoutExpired:
+            proc.kill()
+
+
+def test_live_ready_phase_when_healthy(tmp_path):
+    gguf = tmp_path / "m.gguf"
+    with open(gguf, "wb") as f:
+        f.truncate(1 * 1024 * 1024)
+
+    inst = _make_backend(os.getpid(), str(gguf), healthy = True)
+    out = inst.load_progress()
+    assert out is not None
+    assert out["phase"] == "ready"
+    assert out["bytes_total"] == 1 * 1024 * 1024
+    # Self-pid RSS is well above 1 MiB for CPython; fraction caps at 1.
+    assert out["fraction"] == 1.0
+
+
+def test_live_dead_pid_returns_none(tmp_path):
+    """A recently-dead pid may linger in /proc for ms; use a clearly
+    invalid id so the read reliably fails."""
+    gguf = tmp_path / "m.gguf"
+    gguf.touch()
+
+    inst = _make_backend(9_999_999_999, str(gguf), healthy = False)
+    out = inst.load_progress()
+    assert out is None
+
+
+def test_live_shard_aggregation_counts_real_files(tmp_path):
+    """With 4 real sibling shards on disk, ``bytes_total`` equals their
+    summed size to the byte."""
+    shard_size = 7 * 1024 * 1024  # 7 MB each
+    for i in range(1, 5):
+        f = tmp_path / f"model-{i:05d}-of-00004.gguf"
+        with open(f, "wb") as fh:
+            fh.truncate(shard_size)
+    # Unrelated file in same dir -- must not be counted.
+    with open(tmp_path / "config.json", "wb") as fh:
+        fh.truncate(123)
+
+    inst = _make_backend(
+        os.getpid(),
+        str(tmp_path / "model-00001-of-00004.gguf"),
+        healthy = False,
+    )
+    out = inst.load_progress()
+    assert out is not None
+    assert out["bytes_total"] == 4 * shard_size
+
+
+def test_live_repeated_polling_stays_sane(tmp_path):
+    """Sampling the same backend 20 times should not raise or produce
+    non-numeric output, even under normal kernel RSS jitter."""
+    gguf = tmp_path / "m.gguf"
+    with open(gguf, "wb") as f:
+        f.truncate(500 * 1024 * 1024)
+
+    inst = _make_backend(os.getpid(), str(gguf), healthy = False)
+    seen = []
+    for _ in range(20):
+        out = inst.load_progress()
+        assert out is not None
+        assert isinstance(out["bytes_loaded"], int)
+        assert isinstance(out["bytes_total"], int)
+        assert 0.0 <= out["fraction"] <= 1.0
+        seen.append(out["bytes_loaded"])
+        time.sleep(0.01)
+    # RSS of a healthy Python process doesn't go below ~5 MB.
+    assert min(seen) > 1 * 1024 * 1024
--- a/studio/backend/tests/test_llama_cpp_load_progress_matrix.py
+++ b/studio/backend/tests/test_llama_cpp_load_progress_matrix.py
@ -0,0 +1,473 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
+
+"""Extended test matrix for ``LlamaCppBackend.load_progress()``.
+
+Companion to ``test_llama_cpp_load_progress.py`` (which pins the basic
+contract). This file widens coverage to the edge cases that bit users
+or were hypothesized to bite them on cross-platform installs:
+
+  * Platform matrix — macOS/Windows simulation via ``/proc`` absence.
+  * ``VmRSS`` parsing — tab vs space delimiter, missing line, malformed
+    integer.
+  * Filesystem edges — HF-cache symlinks, broken symlinks, nonexistent
+    paths, relative paths.
+  * Shard aggregation — partial multi-shard downloads where some shards
+    are still ``.incomplete``, two shard series in the same dir,
+    ``mmproj-*.gguf`` sibling exclusion for non-sharded primaries,
+    single-file models.
+  * Lifecycle races — process set before ``_gguf_path`` is assigned,
+    process dead mid-sample, ``_healthy`` flipped to True.
+  * Concurrent sampling — 10 threads × 50 iterations against a single
+    backend, hitting real ``/proc`` (no mocks — see the note in
+    ``TestConcurrentSampling`` for why).
+  * Fraction bounds — capped at 1.0 when RSS exceeds total; 0.0 when
+    total is zero.
+
+All tests are Linux-only in practice (we stub ``/proc`` where needed).
+The stable subset runs in well under a second.
+"""
+
+from __future__ import annotations
+
+import io
+import os
+import sys
+import threading
+import types as _types
+from pathlib import Path
+from unittest.mock import patch
+
+import pytest
+
+# ---------------------------------------------------------------------------
+# Stub heavy / unavailable external dependencies before importing the
+# module under test. Same pattern as test_llama_cpp_load_progress.py.
+# ---------------------------------------------------------------------------
+
+_BACKEND_DIR = str(Path(__file__).resolve().parent.parent)
+if _BACKEND_DIR not in sys.path:
+    sys.path.insert(0, _BACKEND_DIR)
+
+_loggers_stub = _types.ModuleType("loggers")
+_loggers_stub.get_logger = lambda name: __import__("logging").getLogger(name)
+sys.modules.setdefault("loggers", _loggers_stub)
+
+_structlog_stub = _types.ModuleType("structlog")
+sys.modules.setdefault("structlog", _structlog_stub)
+
+_httpx_stub = _types.ModuleType("httpx")
+for _exc_name in (
+    "ConnectError",
+    "TimeoutException",
+    "ReadTimeout",
+    "ReadError",
+    "RemoteProtocolError",
+    "CloseError",
+):
+    setattr(_httpx_stub, _exc_name, type(_exc_name, (Exception,), {}))
+
+
+class _FakeTimeout:
+    def __init__(self, *a, **kw):
+        pass
+
+
+_httpx_stub.Timeout = _FakeTimeout
+_httpx_stub.Client = type(
+    "Client",
+    (),
+    {
+        "__init__": lambda self, **kw: None,
+        "__enter__": lambda self: self,
+        "__exit__": lambda self, *a: None,
+    },
+)
+sys.modules.setdefault("httpx", _httpx_stub)
+
+from core.inference.llama_cpp import LlamaCppBackend
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def _make():
+    inst = LlamaCppBackend.__new__(LlamaCppBackend)
+    inst._process = None
+    inst._gguf_path = None
+    inst._healthy = False
+    return inst
+
+
+class _Proc:
+    def __init__(self, pid):
+        self.pid = pid
+
+
+def _sparse(path, size):
+    with open(path, "wb") as f:
+        if size > 0:
+            f.truncate(size)
+
+
+def _fake_proc_reader(rss_kb):
+    """Return an ``open()`` replacement that fakes /proc reads with a VmRSS line."""
+
+    def fake_open(path, *args, **kwargs):
+        if str(path).startswith("/proc/"):
+            return io.StringIO(f"VmRSS:\t{rss_kb}\tkB\n")
+        return open(path, *args, **kwargs)
+
+    return fake_open
+
+
+# ---------------------------------------------------------------------------
+# A. Platform matrix
+# ---------------------------------------------------------------------------
+
+
+class TestPlatformMatrix:
+    """The method is Linux-first via /proc. On macOS/Windows it must
+    degrade to None rather than crash."""
+
+    def test_linux_live_proc_is_self_pid(self, tmp_path):
+        """Self-pid /proc read uses the real kernel interface."""
+        gguf = tmp_path / "m.gguf"
+        _sparse(gguf, 1 * 1024**3)
+        inst = _make()
+        inst._process = _Proc(os.getpid())
+        inst._gguf_path = str(gguf)
+        inst._healthy = False
+        out = inst.load_progress()
+        assert out is not None
+        assert out["phase"] == "mmap"
+        assert out["bytes_total"] == 1 * 1024**3
+        # Our Python process has some RSS -- just sanity-check positive.
+        assert out["bytes_loaded"] > 0
+
+    def test_macos_no_proc_returns_none(self, tmp_path):
+        """Simulate macOS: /proc open fails with FileNotFoundError."""
+        gguf = tmp_path / "m.gguf"
+        _sparse(gguf, 1 * 1024**3)
+        inst = _make()
+        inst._process = _Proc(pid = 12345)
+        inst._gguf_path = str(gguf)
+
+        def fake_open(path, *args, **kwargs):
+            if str(path).startswith("/proc/"):
+                raise FileNotFoundError(f"No such file: {path}")
+            return open(path, *args, **kwargs)
+
+        with patch("builtins.open", side_effect = fake_open):
+            out = inst.load_progress()
+        assert out is None
+
+    def test_windows_no_proc_returns_none(self, tmp_path):
+        """Simulate Windows: opening /proc raises PermissionError or OSError."""
+        gguf = tmp_path / "m.gguf"
+        _sparse(gguf, 1 * 1024**3)
+        inst = _make()
+        inst._process = _Proc(pid = 4567)
+        inst._gguf_path = str(gguf)
+
+        def fake_open(path, *args, **kwargs):
+            if str(path).startswith("/proc/"):
+                raise PermissionError("access denied")
+            return open(path, *args, **kwargs)
+
+        with patch("builtins.open", side_effect = fake_open):
+            out = inst.load_progress()
+        assert out is None
+
+
+# ---------------------------------------------------------------------------
+# B. VmRSS parsing edge cases
+# ---------------------------------------------------------------------------
+
+
+class TestVmRSSParsing:
+    def test_standard_tab_delimited(self, tmp_path):
+        gguf = tmp_path / "m.gguf"
+        _sparse(gguf, 4 * 1024**3)
+        inst = _make()
+        inst._process = _Proc(os.getpid())
+        inst._gguf_path = str(gguf)
+        with patch("builtins.open", side_effect = _fake_proc_reader(2 * 1024**2)):
+            out = inst.load_progress()
+        assert out["bytes_loaded"] == 2 * 1024**3
+
+    def test_space_separated_fallback(self, tmp_path):
+        """Some kernels emit single-space rather than tab."""
+        gguf = tmp_path / "m.gguf"
+        _sparse(gguf, 4 * 1024**3)
+        inst = _make()
+        inst._process = _Proc(os.getpid())
+        inst._gguf_path = str(gguf)
+
+        def fake_open(path, *a, **kw):
+            if str(path).startswith("/proc/"):
+                return io.StringIO("VmRSS: 4194304 kB\n")
+            return open(path, *a, **kw)
+
+        with patch("builtins.open", side_effect = fake_open):
+            out = inst.load_progress()
+        assert out["bytes_loaded"] == 4 * 1024**3
+
+    def test_missing_vmrss_line(self, tmp_path):
+        """Kernel with VmRSS stripped (zombie / kthread) -> 0."""
+        gguf = tmp_path / "m.gguf"
+        _sparse(gguf, 1 * 1024**3)
+        inst = _make()
+        inst._process = _Proc(os.getpid())
+        inst._gguf_path = str(gguf)
+
+        def fake_open(path, *a, **kw):
+            if str(path).startswith("/proc/"):
+                return io.StringIO("Name:\ttest\nState:\tZ (zombie)\n")
+            return open(path, *a, **kw)
+
+        with patch("builtins.open", side_effect = fake_open):
+            out = inst.load_progress()
+        assert out is not None
+        assert out["bytes_loaded"] == 0
+        assert out["fraction"] == 0.0
+
+    def test_malformed_vmrss_value(self, tmp_path):
+        """Non-integer VmRSS value should be treated as if the line were
+        absent (early ValueError caught)."""
+        gguf = tmp_path / "m.gguf"
+        _sparse(gguf, 1 * 1024**3)
+        inst = _make()
+        inst._process = _Proc(os.getpid())
+        inst._gguf_path = str(gguf)
+
+        def fake_open(path, *a, **kw):
+            if str(path).startswith("/proc/"):
+                return io.StringIO("VmRSS:\tXXXX\tkB\n")
+            return open(path, *a, **kw)
+
+        with patch("builtins.open", side_effect = fake_open):
+            out = inst.load_progress()
+        # The implementation catches ValueError on int() and returns None.
+        assert out is None
+
+
+# ---------------------------------------------------------------------------
+# C. Filesystem edge cases
+# ---------------------------------------------------------------------------
+
+
+class TestFilesystemEdges:
+    def test_symlink_primary_follows_to_blob(self, tmp_path):
+        """HF cache stores blobs under blobs/ and symlinks them from
+        snapshots/. The method must follow the symlink."""
+        blob = tmp_path / "blob"
+        _sparse(blob, 12 * 1024**3)
+        snap = tmp_path / "snap"
+        snap.mkdir()
+        link = snap / "m.gguf"
+        link.symlink_to(blob)
+
+        inst = _make()
+        inst._process = _Proc(os.getpid())
+        inst._gguf_path = str(link)
+        with patch("builtins.open", side_effect = _fake_proc_reader(6 * 1024**2)):
+            out = inst.load_progress()
+        assert out["bytes_total"] == 12 * 1024**3
+
+    def test_broken_symlink_skipped(self, tmp_path):
+        snap = tmp_path / "snap"
+        snap.mkdir()
+        link = snap / "m.gguf"
+        link.symlink_to(tmp_path / "missing-blob")
+        inst = _make()
+        inst._process = _Proc(os.getpid())
+        inst._gguf_path = str(link)
+        with patch("builtins.open", side_effect = _fake_proc_reader(1024)):
+            out = inst.load_progress()
+        assert out["bytes_total"] == 0
+        assert out["bytes_loaded"] == 1024 * 1024
+
+    def test_nonexistent_path_skipped(self, tmp_path):
+        inst = _make()
+        inst._process = _Proc(os.getpid())
+        inst._gguf_path = str(tmp_path / "ghost.gguf")
+        with patch("builtins.open", side_effect = _fake_proc_reader(1024)):
+            out = inst.load_progress()
+        assert out["bytes_total"] == 0
+
+    def test_relative_gguf_path(self, tmp_path):
+        """Relative paths shouldn't crash; behaviour depends on CWD but
+        the method must not raise."""
+        cwd = os.getcwd()
+        try:
+            os.chdir(tmp_path)
+            _sparse(Path("rel.gguf"), 8 * 1024**3)
+            inst = _make()
+            inst._process = _Proc(os.getpid())
+            inst._gguf_path = "rel.gguf"
+            with patch("builtins.open", side_effect = _fake_proc_reader(0)):
+                out = inst.load_progress()
+            assert out is not None
+            assert out["bytes_total"] == 8 * 1024**3
+        finally:
+            os.chdir(cwd)
+
+
+# ---------------------------------------------------------------------------
+# D. Shard aggregation
+# ---------------------------------------------------------------------------
+
+
+class TestShardAggregation:
+    def test_partial_multi_shard_download(self, tmp_path):
+        """Primary present but shards 2..N still downloading as
+        ``.incomplete``. Sums only the fully-arrived ``.gguf`` files."""
+        _sparse(tmp_path / "m-00001-of-00004.gguf", 30 * 1024**3)
+        _sparse(tmp_path / "m-00002-of-00004.gguf", 30 * 1024**3)
+        # 3 and 4 still downloading as .incomplete
+        _sparse(tmp_path / "m-00003-of-00004.gguf.incomplete", 5 * 1024**3)
+        inst = _make()
+        inst._process = _Proc(os.getpid())
+        inst._gguf_path = str(tmp_path / "m-00001-of-00004.gguf")
+        with patch("builtins.open", side_effect = _fake_proc_reader(0)):
+            out = inst.load_progress()
+        assert out["bytes_total"] == 60 * 1024**3  # only the .gguf siblings
+
+    def test_two_shard_series_in_same_dir(self, tmp_path):
+        """Defensive: if two quant series share a dir, prefix filter
+        only sums siblings of the chosen primary."""
+        for i in range(1, 3):
+            _sparse(tmp_path / f"m_q4-{i:05d}-of-00002.gguf", 10 * 1024**3)
+            _sparse(tmp_path / f"m_q8-{i:05d}-of-00002.gguf", 20 * 1024**3)
+        inst = _make()
+        inst._process = _Proc(os.getpid())
+        inst._gguf_path = str(tmp_path / "m_q8-00001-of-00002.gguf")
+        with patch("builtins.open", side_effect = _fake_proc_reader(0)):
+            out = inst.load_progress()
+        assert out["bytes_total"] == 40 * 1024**3  # just q8 series
+
+    def test_mmproj_sibling_not_counted(self, tmp_path):
+        """Vision models drop an ``mmproj-*.gguf`` alongside. For a
+        single-file (non-sharded) primary we only count the primary."""
+        _sparse(tmp_path / "m.gguf", 8 * 1024**3)
+        _sparse(tmp_path / "mmproj-BF16.gguf", 2 * 1024**3)
+        inst = _make()
+        inst._process = _Proc(os.getpid())
+        inst._gguf_path = str(tmp_path / "m.gguf")
+        with patch("builtins.open", side_effect = _fake_proc_reader(0)):
+            out = inst.load_progress()
+        # Non-sharded primary: only the primary is counted.
+        assert out["bytes_total"] == 8 * 1024**3
+
+    def test_single_file_model(self, tmp_path):
+        """Non-sharded model: primary only."""
+        _sparse(tmp_path / "small.gguf", 4 * 1024**3)
+        inst = _make()
+        inst._process = _Proc(os.getpid())
+        inst._gguf_path = str(tmp_path / "small.gguf")
+        with patch("builtins.open", side_effect = _fake_proc_reader(2 * 1024**2)):
+            out = inst.load_progress()
+        assert out["bytes_total"] == 4 * 1024**3
+        assert out["bytes_loaded"] == 2 * 1024**3
+
+
+# ---------------------------------------------------------------------------
+# E. Lifecycle races
+# ---------------------------------------------------------------------------
+
+
+class TestLifecycleRaces:
+    def test_process_set_but_gguf_path_not_yet(self, tmp_path):
+        """Moment between Popen and self._gguf_path=model_path."""
+        inst = _make()
+        inst._process = _Proc(os.getpid())
+        inst._gguf_path = None
+        with patch("builtins.open", side_effect = _fake_proc_reader(1024)):
+            out = inst.load_progress()
+        assert out is not None
+        assert out["phase"] == "mmap"
+        assert out["bytes_total"] == 0
+        assert out["bytes_loaded"] == 1024 * 1024
+
+    def test_process_died_mid_sample(self, tmp_path):
+        """/proc/<pid> disappears -> None."""
+        _sparse(tmp_path / "m.gguf", 1 * 1024**3)
+        inst = _make()
+        inst._process = _Proc(pid = 999_999_999)
+        inst._gguf_path = str(tmp_path / "m.gguf")
+        assert inst.load_progress() is None
+
+    def test_healthy_true_ready_phase(self, tmp_path):
+        _sparse(tmp_path / "m.gguf", 1 * 1024**3)
+        inst = _make()
+        inst._process = _Proc(os.getpid())
+        inst._gguf_path = str(tmp_path / "m.gguf")
+        inst._healthy = True
+        with patch("builtins.open", side_effect = _fake_proc_reader(1024)):
+            out = inst.load_progress()
+        assert out["phase"] == "ready"
+
+
+# ---------------------------------------------------------------------------
+# F. Concurrent sampling  (simulates multiple browser tabs polling)
+# ---------------------------------------------------------------------------
+
+
+class TestConcurrentSampling:
+    def test_parallel_invocations_never_raise(self, tmp_path):
+        """Many concurrent samplers hitting the same backend must not raise.
+
+        We intentionally do NOT patch ``builtins.open`` here because
+        ``unittest.mock.patch`` is not thread-safe: interleaved
+        enter/exit across threads can leak a Mock into ``builtins.open``
+        and poison every subsequent test in the session. Instead, we
+        let each thread hit the real ``/proc/self/status`` of the test
+        process, which is exactly the code path that matters in prod.
+        """
+        _sparse(tmp_path / "m.gguf", 1 * 1024**3)
+        inst = _make()
+        inst._process = _Proc(os.getpid())
+        inst._gguf_path = str(tmp_path / "m.gguf")
+        errors = []
+
+        def run():
+            try:
+                for _ in range(50):
+                    inst.load_progress()
+            except Exception as e:  # pragma: no cover
+                errors.append(e)
+
+        threads = [threading.Thread(target = run) for _ in range(10)]
+        for t in threads:
+            t.start()
+        for t in threads:
+            t.join()
+        assert not errors, errors
+
+
+# ---------------------------------------------------------------------------
+# G. Fraction bounds
+# ---------------------------------------------------------------------------
+
+
+class TestFractionBounds:
+    def test_fraction_capped_at_one(self, tmp_path):
+        _sparse(tmp_path / "m.gguf", 1 * 1024**3)
+        inst = _make()
+        inst._process = _Proc(os.getpid())
+        inst._gguf_path = str(tmp_path / "m.gguf")
+        # RSS > total (post-paged-in + extra structures)
+        with patch("builtins.open", side_effect = _fake_proc_reader(2 * 1024**2)):
+            out = inst.load_progress()
+        assert 0.0 <= out["fraction"] <= 1.0
+
+    def test_fraction_zero_when_total_zero(self):
+        inst = _make()
+        inst._process = _Proc(os.getpid())
+        inst._gguf_path = None
+        with patch("builtins.open", side_effect = _fake_proc_reader(1024**2)):
+            out = inst.load_progress()
+        assert out["fraction"] == 0.0
--- a/studio/backend/tests/test_llama_cpp_max_context_threshold.py
+++ b/studio/backend/tests/test_llama_cpp_max_context_threshold.py
@ -0,0 +1,244 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
+
+"""Tests for the ``max_context_length`` warning-threshold semantics.
+
+``/api/inference/status.max_context_length`` is what the ctx slider in
+the chat settings sheet reads to decide when to render the "Exceeds
+estimated VRAM capacity. The model may use system RAM." warning:
+
+    ctxDisplayValue > ggufMaxContextLength → show warning
+
+For models whose weights fit on some GPU subset, the warning threshold
+is the largest ctx that fits fully in VRAM (the binary-search cap from
+``_fit_context_to_vram``). For models whose weights exceed 90% of every
+GPU subset's free memory, the warning must fire as soon as the user
+drags above the 4096 spec default (otherwise a user loading e.g.
+MiniMax-M2.7 on a 97 GB GPU sees a slider up to 196608 with no
+indication that any value above 4096 will trigger ``--fit on`` and
+degrade performance).
+
+These tests pin both cases. No GPU probing, no subprocess, no GGUF I/O.
+Cross-platform: Linux, macOS, Windows, WSL.
+"""
+
+from __future__ import annotations
+
+import sys
+import types as _types
+from pathlib import Path
+
+import pytest
+
+# ---------------------------------------------------------------------------
+# Stub heavy / unavailable external dependencies before importing the
+# module under test.  Same pattern as test_kv_cache_estimation.py.
+# ---------------------------------------------------------------------------
+
+_BACKEND_DIR = str(Path(__file__).resolve().parent.parent)
+if _BACKEND_DIR not in sys.path:
+    sys.path.insert(0, _BACKEND_DIR)
+
+# loggers
+_loggers_stub = _types.ModuleType("loggers")
+_loggers_stub.get_logger = lambda name: __import__("logging").getLogger(name)
+sys.modules.setdefault("loggers", _loggers_stub)
+
+# structlog
+_structlog_stub = _types.ModuleType("structlog")
+sys.modules.setdefault("structlog", _structlog_stub)
+
+# httpx
+_httpx_stub = _types.ModuleType("httpx")
+for _exc_name in (
+    "ConnectError",
+    "TimeoutException",
+    "ReadTimeout",
+    "ReadError",
+    "RemoteProtocolError",
+    "CloseError",
+):
+    setattr(_httpx_stub, _exc_name, type(_exc_name, (Exception,), {}))
+
+
+class _FakeTimeout:
+    def __init__(self, *a, **kw):
+        pass
+
+
+_httpx_stub.Timeout = _FakeTimeout
+_httpx_stub.Client = type(
+    "Client",
+    (),
+    {
+        "__init__": lambda self, **kw: None,
+        "__enter__": lambda self: self,
+        "__exit__": lambda self, *a: None,
+    },
+)
+sys.modules.setdefault("httpx", _httpx_stub)
+
+from core.inference.llama_cpp import LlamaCppBackend
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+GIB = 1024**3
+
+
+def _make_backend(native_ctx = 131072):
+    inst = LlamaCppBackend.__new__(LlamaCppBackend)
+    inst._context_length = native_ctx
+    inst._n_layers = 80
+    inst._n_kv_heads = 8
+    inst._n_heads = 64
+    inst._embedding_length = 8192
+    inst._kv_key_length = 128
+    inst._kv_value_length = 128
+    inst._kv_lora_rank = None
+    inst._sliding_window = None
+    inst._ssm_inner_size = None
+    inst._full_attention_interval = None
+    inst._key_length_mla = None
+    return inst
+
+
+def _compute_max_available_ctx(native_ctx, model_gib, gpus, kv_per_token_bytes = 325_000):
+    """Run the ceiling-probe block from load_model and return the final
+    ``max_available_ctx`` value the backend would assign to
+    ``_max_context_length``.
+    """
+    inst = _make_backend(native_ctx = native_ctx)
+    model_size = int(model_gib * GIB)
+
+    inst._estimate_kv_cache_bytes = (
+        lambda n, _t = None: 0 if n <= 0 else n * kv_per_token_bytes
+    )
+    inst._can_estimate_kv = lambda: True
+
+    context_length = inst._context_length
+    effective_ctx = context_length
+    max_available_ctx = context_length
+
+    cache_type_kv = None
+    native_ctx_for_cap = context_length
+
+    ranked_for_cap = sorted(gpus, key = lambda g: g[1], reverse = True)
+    best_cap = 0
+    for n_gpus in range(1, len(ranked_for_cap) + 1):
+        subset = ranked_for_cap[:n_gpus]
+        pool_mib = sum(free for _, free in subset)
+        capped = inst._fit_context_to_vram(
+            native_ctx_for_cap,
+            pool_mib,
+            model_size,
+            cache_type_kv,
+        )
+        kv = inst._estimate_kv_cache_bytes(capped, cache_type_kv)
+        total_mib = (model_size + kv) / (1024 * 1024)
+        if total_mib <= pool_mib * 0.90:
+            best_cap = max(best_cap, capped)
+    if best_cap > 0:
+        max_available_ctx = best_cap
+    else:
+        max_available_ctx = min(4096, native_ctx_for_cap)
+
+    return max_available_ctx
+
+
+# ---------------------------------------------------------------------------
+# Weights exceed every GPU subset's VRAM  (MiniMax-M2.7-like)
+# ---------------------------------------------------------------------------
+
+
+class TestMaxContextLengthForWeightsExceedVRAM:
+    """The UI ``max_context_length`` threshold must fall back to 4096 so
+    the warning fires as soon as the user drags above the spec default.
+    """
+
+    def test_minimax_like(self):
+        """131 GB weights, single 97 GB GPU, native ctx 196608."""
+        got = _compute_max_available_ctx(
+            native_ctx = 196608,
+            model_gib = 131,
+            gpus = [(0, 97_000)],
+        )
+        assert got == 4096
+
+    def test_multi_gpu_all_subsets_fail(self):
+        """400 GB weights across a 4x80 GB pool (320 GB total, still too small)."""
+        got = _compute_max_available_ctx(
+            native_ctx = 131072,
+            model_gib = 400,
+            gpus = [(0, 80_000), (1, 80_000), (2, 80_000), (3, 80_000)],
+        )
+        assert got == 4096
+
+    def test_native_below_fallback_is_preserved(self):
+        """If the model's native ctx is itself smaller than 4096, do not
+        advertise a larger value than the model supports."""
+        got = _compute_max_available_ctx(
+            native_ctx = 2048,
+            model_gib = 200,
+            gpus = [(0, 80_000)],
+        )
+        assert got == 2048
+
+
+# ---------------------------------------------------------------------------
+# Fittable models (regression guard)
+# ---------------------------------------------------------------------------
+
+
+class TestMaxContextLengthForFittableModels:
+    """The existing best-cap behaviour must be unchanged."""
+
+    def test_small_model_fits_easily(self):
+        """8 GB model on 24 GB GPU: should auto-pick a large ctx."""
+        got = _compute_max_available_ctx(
+            native_ctx = 131072,
+            model_gib = 8,
+            gpus = [(0, 24_000)],
+            kv_per_token_bytes = 8192,
+        )
+        assert got > 4096
+        assert got <= 131072
+
+    def test_medium_model_multi_gpu(self):
+        """60 GB model split across 2 GPUs: picks a fitting ctx."""
+        got = _compute_max_available_ctx(
+            native_ctx = 131072,
+            model_gib = 60,
+            gpus = [(0, 40_000), (1, 40_000)],
+            kv_per_token_bytes = 8192,
+        )
+        assert got > 4096
+
+    def test_tiny_model_on_huge_gpu_near_native(self):
+        """2 GB model, 80 GB GPU, negligible KV: should approach native."""
+        got = _compute_max_available_ctx(
+            native_ctx = 131072,
+            model_gib = 2,
+            gpus = [(0, 80_000)],
+            kv_per_token_bytes = 64,
+        )
+        assert got >= 131072 - 256  # rounded to 256 boundary
+
+
+# ---------------------------------------------------------------------------
+# Property plumbing
+# ---------------------------------------------------------------------------
+
+
+class TestMaxContextLengthProperty:
+    def test_falls_back_to_native_when_unset(self):
+        inst = _make_backend(native_ctx = 131072)
+        inst._max_context_length = None
+        assert inst.max_context_length == 131072
+
+    def test_returns_stored_value_when_set(self):
+        inst = _make_backend(native_ctx = 131072)
+        inst._max_context_length = 4096
+        assert inst.max_context_length == 4096
--- a/studio/backend/tests/test_llama_cpp_no_context_shift.py
+++ b/studio/backend/tests/test_llama_cpp_no_context_shift.py
@ -0,0 +1,137 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
+
+"""``--no-context-shift`` launch-flag contract.
+
+When llama-server runs with its default context-shift behavior, the UI
+has no way to tell the user that the KV cache has been rotated --
+earlier turns silently vanish from the conversation. The Studio
+backend always passes ``--no-context-shift`` so the server returns a
+clean error instead, and the chat adapter can point the user at the
+``Context Length`` input in the settings panel.
+
+This file is a static read of the launch command: we ask
+``LlamaCppBackend`` to assemble its ``cmd`` list and assert the flag
+is always present. Testing via the real subprocess would require an
+actual GGUF on disk, which is out of scope for the fast test suite.
+"""
+
+from __future__ import annotations
+
+import inspect
+import sys
+import types as _types
+from pathlib import Path
+
+import pytest
+
+# ---------------------------------------------------------------------------
+# Same external-dep stubs as the other llama_cpp tests.
+# ---------------------------------------------------------------------------
+
+_BACKEND_DIR = str(Path(__file__).resolve().parent.parent)
+if _BACKEND_DIR not in sys.path:
+    sys.path.insert(0, _BACKEND_DIR)
+
+_loggers_stub = _types.ModuleType("loggers")
+_loggers_stub.get_logger = lambda name: __import__("logging").getLogger(name)
+sys.modules.setdefault("loggers", _loggers_stub)
+
+_structlog_stub = _types.ModuleType("structlog")
+sys.modules.setdefault("structlog", _structlog_stub)
+
+_httpx_stub = _types.ModuleType("httpx")
+for _exc in (
+    "ConnectError",
+    "TimeoutException",
+    "ReadTimeout",
+    "ReadError",
+    "RemoteProtocolError",
+    "CloseError",
+):
+    setattr(_httpx_stub, _exc, type(_exc, (Exception,), {}))
+_httpx_stub.Timeout = type("T", (), {"__init__": lambda s, *a, **k: None})
+_httpx_stub.Client = type(
+    "C",
+    (),
+    {
+        "__init__": lambda s, **kw: None,
+        "__enter__": lambda s: s,
+        "__exit__": lambda s, *a: None,
+    },
+)
+sys.modules.setdefault("httpx", _httpx_stub)
+
+from core.inference import llama_cpp as llama_cpp_module
+
+
+def _load_model_source() -> str:
+    """Return the source of ``LlamaCppBackend.load_model``.
+
+    Using ``inspect.getsource`` instead of reading the file directly
+    scopes the assertions to the function that actually launches
+    llama-server, so neither the presence check nor the location check
+    can be fooled by a stray occurrence of ``"--no-context-shift"``
+    elsewhere in the module.
+    """
+    return inspect.getsource(llama_cpp_module.LlamaCppBackend.load_model)
+
+
+def test_no_context_shift_is_in_load_model():
+    """The flag is part of the static launch-command template.
+
+    We check the source of ``load_model`` rather than mocking the whole
+    call chain (GPU probing, GGUF stat, etc.): the flag is written as
+    a literal in one place and any regression has to delete it, which
+    a text search will catch.
+    """
+    assert '"--no-context-shift"' in _load_model_source(), (
+        "llama-server must be launched with --no-context-shift so the "
+        "UI can surface a clean 'context full' error instead of silently "
+        "losing old turns to a KV-cache rotation."
+    )
+
+
+def test_flag_sits_inside_the_base_cmd_list():
+    """Pin the flag's location so a future refactor can't accidentally
+    move it into a branch that only fires on some code paths.
+
+    We slice from ``cmd = [`` to the first ``]`` at the same indent.
+    Using ``inspect.getsource`` means the function lives in its own
+    string and there are no siblings to worry about, so a plain
+    bracket search would also work -- anchoring on the trailing indent
+    just keeps the slice from wandering into a later expression if the
+    opening literal ever grows an in-line comment trailing it.
+    """
+    source = _load_model_source()
+    start = source.find("cmd = [")
+    assert start >= 0, "could not find the base cmd = [...] block"
+    # Find the first line containing only ``]`` (possibly indented).
+    # Works for any indentation style the formatter picks.
+    rest = source[start:]
+    end_rel = -1
+    for line_start, line in _iter_lines_with_offset(rest):
+        if line_start == 0:
+            # Skip the opening ``cmd = [`` line itself.
+            continue
+        if line.strip() == "]":
+            end_rel = line_start
+            break
+    assert end_rel > 0, "could not find end of cmd = [...] block"
+    block = rest[:end_rel]
+    assert '"--no-context-shift"' in block, (
+        "--no-context-shift must be in the base cmd list, not in a "
+        "conditional branch -- otherwise some code paths would still "
+        "run with silent context shift enabled."
+    )
+    # Also pin that it is next to -c / --ctx so the grouping makes sense.
+    assert '"-c"' in block
+    assert '"--flash-attn"' in block
+
+
+def _iter_lines_with_offset(text: str):
+    """Yield (offset, line) pairs over ``text`` without losing offsets."""
+    offset = 0
+    for line in text.splitlines(keepends = True):
+        yield offset, line
+        offset += len(line)
--- a/studio/backend/tests/test_models_get_model_config_case_resolution.py
+++ b/studio/backend/tests/test_models_get_model_config_case_resolution.py
@ -0,0 +1,81 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
+
+import asyncio
+import sys
+import types
+
+# Keep this test runnable in lightweight environments where optional logging
+# deps are not installed.
+if "structlog" not in sys.modules:
+
+    class _DummyLogger:
+        def __getattr__(self, _name):
+            return lambda *args, **kwargs: None
+
+    sys.modules["structlog"] = types.SimpleNamespace(
+        BoundLogger = _DummyLogger,
+        get_logger = lambda *args, **kwargs: _DummyLogger(),
+    )
+
+import routes.models as models_route
+import utils.models.model_config as model_config_module
+
+
+def test_get_model_config_resolves_cached_case_before_model_checks(monkeypatch):
+    calls: dict[str, str] = {}
+
+    class _DummyModelConfig:
+        is_lora = False
+        base_model = None
+
+    def _record_load(model_name):
+        calls["load_model_defaults"] = model_name
+        return {}
+
+    def _record_vision(model_name, hf_token = None):
+        calls["is_vision_model"] = model_name
+        return False
+
+    def _record_embedding(model_name, hf_token = None):
+        calls["is_embedding_model"] = model_name
+        return False
+
+    def _record_audio(model_name, hf_token = None):
+        calls["detect_audio_type"] = model_name
+        return None
+
+    def _record_from_identifier(cls, model_name):
+        calls["from_identifier"] = model_name
+        return _DummyModelConfig()
+
+    monkeypatch.setattr(models_route, "is_local_path", lambda _: False)
+    monkeypatch.setattr(
+        models_route, "resolve_cached_repo_id_case", lambda _: "Org/Model"
+    )
+    monkeypatch.setattr(models_route, "load_model_defaults", _record_load)
+    monkeypatch.setattr(models_route, "is_vision_model", _record_vision)
+    monkeypatch.setattr(models_route, "is_embedding_model", _record_embedding)
+    monkeypatch.setattr(model_config_module, "detect_audio_type", _record_audio)
+    monkeypatch.setattr(
+        models_route.ModelConfig,
+        "from_identifier",
+        classmethod(_record_from_identifier),
+    )
+    monkeypatch.setattr(models_route, "_get_max_position_embeddings", lambda _: 4096)
+    monkeypatch.setattr(models_route, "_get_model_size_bytes", lambda *_args, **_kw: 0)
+
+    result = asyncio.run(
+        models_route.get_model_config(
+            model_name = "org/model",
+            hf_token = None,
+            current_subject = "test-subject",
+        )
+    )
+
+    assert result.model_name == "Org/Model"
+    assert calls["load_model_defaults"] == "Org/Model"
+    assert calls["is_vision_model"] == "Org/Model"
+    assert calls["is_embedding_model"] == "Org/Model"
+    assert calls["detect_audio_type"] == "Org/Model"
+    assert calls["from_identifier"] == "Org/Model"
--- a/studio/backend/tests/test_native_context_length.py
+++ b/studio/backend/tests/test_native_context_length.py
@ -0,0 +1,518 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
+
+"""Tests for the native_context_length feature (PR #4746).
+
+Verifies that the new `native_context_length` property on LlamaCppBackend
+and the corresponding Pydantic model fields work correctly.  The raw GGUF
+`_context_length` must never be overwritten by VRAM-capping logic.
+
+Requires no GPU, network, or external libraries beyond pytest and pydantic.
+"""
+
+import io
+import json
+import struct
+import sys
+import types as _types
+from pathlib import Path
+from unittest.mock import patch
+
+import pytest
+
+# ---------------------------------------------------------------------------
+# Stub heavy / unavailable external dependencies before importing the
+# module under test.  Same pattern as test_kv_cache_estimation.py.
+# ---------------------------------------------------------------------------
+
+_BACKEND_DIR = str(Path(__file__).resolve().parent.parent)
+if _BACKEND_DIR not in sys.path:
+    sys.path.insert(0, _BACKEND_DIR)
+
+# loggers
+_loggers_stub = _types.ModuleType("loggers")
+_loggers_stub.get_logger = lambda name: __import__("logging").getLogger(name)
+sys.modules.setdefault("loggers", _loggers_stub)
+
+# structlog
+_structlog_stub = _types.ModuleType("structlog")
+sys.modules.setdefault("structlog", _structlog_stub)
+
+# httpx -- stub only the names referenced at import / class-definition time
+_httpx_stub = _types.ModuleType("httpx")
+for _exc_name in (
+    "ConnectError",
+    "TimeoutException",
+    "ReadTimeout",
+    "ReadError",
+    "RemoteProtocolError",
+    "CloseError",
+):
+    setattr(_httpx_stub, _exc_name, type(_exc_name, (Exception,), {}))
+
+
+class _FakeTimeout:
+    def __init__(self, *a, **kw):
+        pass
+
+
+_httpx_stub.Timeout = _FakeTimeout
+_httpx_stub.Client = type(
+    "Client",
+    (),
+    {
+        "__init__": lambda self, **kw: None,
+        "__enter__": lambda self: self,
+        "__exit__": lambda self, *a: None,
+    },
+)
+sys.modules.setdefault("httpx", _httpx_stub)
+
+from core.inference.llama_cpp import LlamaCppBackend
+from models.inference import LoadResponse, InferenceStatusResponse
+
+
+# ── Helpers ──────────────────────────────────────────────────────────
+
+
+def _write_kv(buf: io.BytesIO, key: str, value, vtype: int) -> None:
+    """Append a single GGUF KV pair to *buf*."""
+    key_bytes = key.encode("utf-8")
+    buf.write(struct.pack("<Q", len(key_bytes)))
+    buf.write(key_bytes)
+    buf.write(struct.pack("<I", vtype))
+    if vtype == 4:  # UINT32
+        buf.write(struct.pack("<I", value))
+    elif vtype == 10:  # UINT64
+        buf.write(struct.pack("<Q", value))
+    elif vtype == 8:  # STRING
+        val_bytes = value.encode("utf-8")
+        buf.write(struct.pack("<Q", len(val_bytes)))
+        buf.write(val_bytes)
+    else:
+        raise ValueError(f"Unsupported vtype in test helper: {vtype}")
+
+
+def make_gguf(
+    tmp_path: Path,
+    arch: str,
+    kvs: list,
+    *,
+    arch_first: bool = True,
+    filename: str = "test.gguf",
+) -> str:
+    """Create a minimal valid GGUF v3 binary in *tmp_path*."""
+    buf = io.BytesIO()
+    buf.write(struct.pack("<I", 0x46554747))  # GGUF magic
+    buf.write(struct.pack("<I", 3))  # version 3
+    buf.write(struct.pack("<Q", 0))  # tensor count = 0
+
+    ordered = []
+    arch_entry = ("general.architecture", arch, 8)
+
+    if arch_first:
+        ordered.append(arch_entry)
+    for suffix, val, vt in kvs:
+        ordered.append((f"{arch}.{suffix}", val, vt))
+    if not arch_first:
+        ordered.append(arch_entry)
+
+    buf.write(struct.pack("<Q", len(ordered)))
+    for key, val, vt in ordered:
+        _write_kv(buf, key, val, vt)
+
+    path = tmp_path / filename
+    path.write_bytes(buf.getvalue())
+    return str(path)
+
+
+@pytest.fixture
+def backend():
+    """Create a fresh LlamaCppBackend with side effects disabled."""
+    with patch.object(LlamaCppBackend, "_kill_orphaned_servers"):
+        with patch("atexit.register"):
+            return LlamaCppBackend()
+
+
+# =====================================================================
+# A. TestNativeContextLengthProperty -- the new property
+# =====================================================================
+
+
+class TestNativeContextLengthProperty:
+    """Tests the new `native_context_length` property on LlamaCppBackend."""
+
+    def test_none_on_fresh_backend(self, backend):
+        """Returns None when no model loaded."""
+        assert backend.native_context_length is None
+
+    def test_returns_raw_gguf_value(self, backend):
+        """Directly returns _context_length when set."""
+        backend._context_length = 131072
+        assert backend.native_context_length == 131072
+
+    def test_not_capped_by_effective(self, backend):
+        """native_context_length ignores _effective_context_length."""
+        backend._context_length = 131072
+        backend._effective_context_length = 32768
+        assert backend.native_context_length == 131072
+
+    def test_not_capped_by_max(self, backend):
+        """native_context_length ignores _max_context_length."""
+        backend._context_length = 131072
+        backend._max_context_length = 65536
+        assert backend.native_context_length == 131072
+
+    def test_none_after_unload(self, backend):
+        """After unload_model(), returns None."""
+        backend._context_length = 131072
+        assert backend.native_context_length == 131072
+        backend.unload_model()
+        assert backend.native_context_length is None
+
+    def test_after_gguf_parse(self, tmp_path, backend):
+        """Synthetic GGUF with context_length=16384 populates the property."""
+        path = make_gguf(
+            tmp_path,
+            "llama",
+            [("context_length", 16384, 4)],
+        )
+        backend._read_gguf_metadata(path)
+        assert backend.native_context_length == 16384
+
+    def test_resets_between_parses(self, tmp_path, backend):
+        """Second GGUF without context_length resets native to None."""
+        path_a = make_gguf(
+            tmp_path,
+            "llama",
+            [("context_length", 16384, 4)],
+            filename = "a.gguf",
+        )
+        backend._read_gguf_metadata(path_a)
+        assert backend.native_context_length == 16384
+
+        path_b = make_gguf(
+            tmp_path,
+            "gpt2",
+            [("block_count", 12, 4)],
+            filename = "b.gguf",
+        )
+        backend._read_gguf_metadata(path_b)
+        assert backend.native_context_length is None
+
+
+# =====================================================================
+# B. TestContextValueSeparation -- core invariant
+# =====================================================================
+
+
+class TestContextValueSeparation:
+    """_context_length is never overwritten by VRAM logic."""
+
+    def test_preserved_after_effective_set(self, backend):
+        """Setting _effective_context_length does not change _context_length."""
+        backend._context_length = 131072
+        backend._effective_context_length = 32768
+        assert backend._context_length == 131072
+        assert backend.native_context_length == 131072
+
+    def test_ordering_when_capped(self, backend):
+        """native >= max >= effective holds when VRAM-capped."""
+        backend._context_length = 131072
+        backend._max_context_length = 65536
+        backend._effective_context_length = 32768
+        assert backend.native_context_length >= backend.max_context_length
+        assert backend.max_context_length >= backend.context_length
+
+    def test_all_equal_when_uncapped(self, backend):
+        """All three equal when no VRAM constraint."""
+        backend._context_length = 8192
+        # No effective or max set -- properties fall back to _context_length
+        assert backend.native_context_length == 8192
+        assert backend.max_context_length == 8192
+        assert backend.context_length == 8192
+
+    def test_fit_context_does_not_modify(self, backend):
+        """_fit_context_to_vram() does not touch _context_length."""
+        backend._context_length = 131072
+        backend._n_layers = 32
+        backend._n_kv_heads = 8
+        backend._n_heads = 32
+        backend._embedding_length = 4096
+        original = backend._context_length
+
+        # Simulate a very small VRAM budget that forces capping
+        result = backend._fit_context_to_vram(
+            requested_ctx = 131072,
+            available_mib = 512,  # very small
+            model_size_bytes = 0,
+        )
+        # _fit_context_to_vram returns the capped value, not modifying _context_length
+        assert backend._context_length == original
+        assert backend.native_context_length == original
+        # The returned capped value should be <= requested
+        assert result <= 131072
+
+    def test_native_gt_context_when_capped(self, backend):
+        """native_context_length > context_length after VRAM capping."""
+        backend._context_length = 131072
+        backend._effective_context_length = 16384
+        assert backend.native_context_length > backend.context_length
+
+
+# =====================================================================
+# C. TestPydanticModels -- LoadResponse & InferenceStatusResponse
+# =====================================================================
+
+
+class TestPydanticModels:
+    """Tests native_context_length field on Pydantic models."""
+
+    def test_load_response_has_field(self):
+        """Field exists in LoadResponse.model_fields."""
+        assert "native_context_length" in LoadResponse.model_fields
+
+    def test_load_response_defaults_none(self):
+        """Omitting native_context_length defaults to None."""
+        resp = LoadResponse(
+            status = "loaded",
+            model = "test",
+            display_name = "Test",
+            inference = {},
+        )
+        assert resp.native_context_length is None
+
+    def test_load_response_accepts_int(self):
+        """native_context_length=131072 stores correctly."""
+        resp = LoadResponse(
+            status = "loaded",
+            model = "test",
+            display_name = "Test",
+            inference = {},
+            native_context_length = 131072,
+        )
+        assert resp.native_context_length == 131072
+
+    def test_load_response_json_null(self):
+        """None serializes to JSON null."""
+        resp = LoadResponse(
+            status = "loaded",
+            model = "test",
+            display_name = "Test",
+            inference = {},
+        )
+        data = json.loads(resp.model_dump_json())
+        assert data["native_context_length"] is None
+
+    def test_load_response_json_int(self):
+        """131072 serializes to JSON number."""
+        resp = LoadResponse(
+            status = "loaded",
+            model = "test",
+            display_name = "Test",
+            inference = {},
+            native_context_length = 131072,
+        )
+        data = json.loads(resp.model_dump_json())
+        assert data["native_context_length"] == 131072
+
+    def test_status_response_has_field(self):
+        """Field exists in InferenceStatusResponse.model_fields."""
+        assert "native_context_length" in InferenceStatusResponse.model_fields
+
+    def test_status_response_defaults_none(self):
+        """Omitting native_context_length defaults to None."""
+        resp = InferenceStatusResponse()
+        assert resp.native_context_length is None
+
+    def test_roundtrip_preserves_value(self):
+        """model_validate_json(model_dump_json()) round-trips."""
+        resp = LoadResponse(
+            status = "loaded",
+            model = "test",
+            display_name = "Test",
+            inference = {},
+            native_context_length = 131072,
+        )
+        roundtripped = LoadResponse.model_validate_json(resp.model_dump_json())
+        assert roundtripped.native_context_length == 131072
+
+
+# =====================================================================
+# D. TestRouteCompleteness -- source-level verification
+# =====================================================================
+
+
+class TestRouteCompleteness:
+    """All response construction sites in routes/inference.py include native_context_length."""
+
+    @pytest.fixture(autouse = True)
+    def _load_source(self):
+        """Read routes/inference.py source once."""
+        routes_path = Path(__file__).resolve().parent.parent / "routes" / "inference.py"
+        self._source = routes_path.read_text()
+
+    def _find_construction_blocks(self, class_name: str) -> list[str]:
+        """Extract all code blocks that construct a given response class."""
+        blocks = []
+        idx = 0
+        while True:
+            start = self._source.find(f"{class_name}(", idx)
+            if start == -1:
+                break
+            # Find matching closing paren (simple depth counter)
+            depth = 0
+            end = start
+            for i, ch in enumerate(self._source[start:], start):
+                if ch == "(":
+                    depth += 1
+                elif ch == ")":
+                    depth -= 1
+                    if depth == 0:
+                        end = i + 1
+                        break
+            blocks.append(self._source[start:end])
+            idx = end
+        return blocks
+
+    def test_gguf_load_responses_have_field(self):
+        """Every GGUF LoadResponse (is_gguf = True) includes native_context_length."""
+        blocks = self._find_construction_blocks("LoadResponse")
+        gguf_blocks = [
+            b for b in blocks if "is_gguf = True" in b or "is_gguf=True" in b
+        ]
+        assert (
+            len(gguf_blocks) >= 2
+        ), f"Expected at least 2 GGUF LoadResponse blocks, found {len(gguf_blocks)}"
+        for i, block in enumerate(gguf_blocks):
+            assert (
+                "native_context_length" in block
+            ), f"GGUF LoadResponse block #{i} missing native_context_length:\n{block[:200]}"
+
+    def test_non_gguf_load_responses_omit_field(self):
+        """Non-GGUF LoadResponse blocks do not set native_context_length (defaults to None)."""
+        blocks = self._find_construction_blocks("LoadResponse")
+        non_gguf = [
+            b for b in blocks if "is_gguf = True" not in b and "is_gguf=True" not in b
+        ]
+        # Non-GGUF paths should not reference native_context_length
+        # (Pydantic defaults it to None, so not setting it is correct)
+        for block in non_gguf:
+            assert (
+                "native_context_length" not in block
+            ), f"Non-GGUF LoadResponse should not set native_context_length:\n{block[:200]}"
+
+    def test_status_path(self):
+        """InferenceStatusResponse construction with llama_backend has the field."""
+        blocks = self._find_construction_blocks("InferenceStatusResponse")
+        found = False
+        for block in blocks:
+            if "llama_backend" in block and "native_context_length" in block:
+                found = True
+                break
+        assert found, "No InferenceStatusResponse block with llama_backend has native_context_length"
+
+
+# =====================================================================
+# E. TestEdgeCases
+# =====================================================================
+
+
+class TestNativeContextEdgeCases:
+    """Edge cases for native_context_length."""
+
+    def test_context_length_zero(self, tmp_path, backend):
+        """GGUF context_length=0 returns 0, not None."""
+        path = make_gguf(tmp_path, "llama", [("context_length", 0, 4)])
+        backend._read_gguf_metadata(path)
+        assert backend.native_context_length == 0
+
+    def test_context_length_uint32_max(self, tmp_path, backend):
+        """2^32 - 1 survives without truncation."""
+        val = 2**32 - 1
+        path = make_gguf(tmp_path, "llama", [("context_length", val, 4)])
+        backend._read_gguf_metadata(path)
+        assert backend.native_context_length == val
+
+    def test_context_length_uint64(self, tmp_path, backend):
+        """UINT64 type context_length parsed correctly."""
+        val = 2**33  # exceeds UINT32 range
+        path = make_gguf(tmp_path, "llama", [("context_length", val, 10)])
+        backend._read_gguf_metadata(path)
+        assert backend.native_context_length == val
+
+    def test_no_context_length_in_gguf(self, tmp_path, backend):
+        """GGUF without context_length key yields None."""
+        path = make_gguf(tmp_path, "llama", [("block_count", 32, 4)])
+        backend._read_gguf_metadata(path)
+        assert backend.native_context_length is None
+
+    def test_native_equals_context_when_uncapped(self, backend):
+        """Both equal when no VRAM cap applied."""
+        backend._context_length = 8192
+        assert backend.native_context_length == backend.context_length
+
+    def test_native_survives_parse_then_cap(self, tmp_path, backend):
+        """Parse then set effective cap: native unchanged."""
+        path = make_gguf(
+            tmp_path,
+            "llama",
+            [
+                ("context_length", 131072, 4),
+                ("block_count", 32, 4),
+                ("attention.head_count", 32, 4),
+                ("attention.head_count_kv", 8, 4),
+                ("embedding_length", 4096, 4),
+            ],
+        )
+        backend._read_gguf_metadata(path)
+        assert backend.native_context_length == 131072
+
+        # Simulate VRAM capping by setting effective and max
+        backend._effective_context_length = 16384
+        backend._max_context_length = 32768
+        assert backend.native_context_length == 131072
+
+
+# =====================================================================
+# F. TestCrossPlatform -- binary I/O and serialization
+# =====================================================================
+
+
+class TestCrossPlatform:
+    """Binary I/O and serialization correctness across platforms."""
+
+    def test_le_uint32_context_length(self, tmp_path, backend):
+        """Little-endian UINT32 parsed correctly."""
+        path = make_gguf(tmp_path, "llama", [("context_length", 16384, 4)])
+        backend._read_gguf_metadata(path)
+        assert backend.native_context_length == 16384
+
+    def test_le_uint64_context_length(self, tmp_path, backend):
+        """Little-endian UINT64 parsed correctly."""
+        path = make_gguf(tmp_path, "llama", [("context_length", 16384, 10)])
+        backend._read_gguf_metadata(path)
+        assert backend.native_context_length == 16384
+
+    def test_gguf_magic_le_byte_order(self, tmp_path):
+        """Magic 0x46554747 matches GGUF spec (little-endian 'GGUF')."""
+        path = tmp_path / "magic_check.gguf"
+        buf = io.BytesIO()
+        buf.write(struct.pack("<I", 0x46554747))
+        raw = buf.getvalue()
+        # 'G' = 0x47, 'G' = 0x47, 'U' = 0x55, 'F' = 0x46
+        assert raw == b"GGUF"
+
+    def test_json_serialization_deterministic(self):
+        """model_dump_json() is consistent across calls."""
+        resp = LoadResponse(
+            status = "loaded",
+            model = "test",
+            display_name = "Test",
+            inference = {},
+            native_context_length = 131072,
+        )
+        json1 = resp.model_dump_json()
+        json2 = resp.model_dump_json()
+        assert json1 == json2
+        assert '"native_context_length":131072' in json1
--- a/studio/backend/tests/test_openai_tool_passthrough.py
+++ b/studio/backend/tests/test_openai_tool_passthrough.py
@ -0,0 +1,465 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved.
+
+"""
+Tests for the OpenAI /v1/chat/completions client-side tool pass-through.
+
+Covers:
+- ChatCompletionRequest accepts standard OpenAI `tools` / `tool_choice` / `stop`.
+- ChatMessage accepts role="tool" with `tool_call_id` and role="assistant"
+  with `content: None` + `tool_calls`.
+- ChatCompletionRequest carries unknown fields via `extra="allow"`.
+- anthropic_tool_choice_to_openai() covers all four Anthropic shapes.
+- _build_passthrough_payload() honors a caller-supplied tool_choice and
+  defaults to "auto" when unset.
+- _friendly_error() maps httpx transport errors to a "Lost connection"
+  message so passthrough failures are legible instead of bare 500s.
+
+No running server or GPU required.
+"""
+
+import os
+import sys
+
+_backend = os.path.join(os.path.dirname(__file__), "..")
+sys.path.insert(0, _backend)
+
+import httpx
+import pytest
+from pydantic import ValidationError
+
+from models.inference import (
+    ChatCompletionRequest,
+    ChatMessage,
+)
+from core.inference.anthropic_compat import (
+    anthropic_tool_choice_to_openai,
+)
+from routes.inference import _build_passthrough_payload, _friendly_error
+
+
+# =====================================================================
+# ChatMessage — tool role, tool_calls, optional content
+# =====================================================================
+
+
+class TestChatMessageToolRoles:
+    def test_tool_role_with_tool_call_id(self):
+        msg = ChatMessage(
+            role = "tool",
+            tool_call_id = "call_abc123",
+            content = '{"temperature": 72}',
+        )
+        assert msg.role == "tool"
+        assert msg.tool_call_id == "call_abc123"
+        assert msg.content == '{"temperature": 72}'
+
+    def test_tool_role_with_name(self):
+        msg = ChatMessage(
+            role = "tool",
+            tool_call_id = "call_abc123",
+            name = "get_weather",
+            content = '{"temperature": 72}',
+        )
+        assert msg.name == "get_weather"
+
+    def test_assistant_with_tool_calls_no_content(self):
+        msg = ChatMessage(
+            role = "assistant",
+            content = None,
+            tool_calls = [
+                {
+                    "id": "call_1",
+                    "type": "function",
+                    "function": {
+                        "name": "get_weather",
+                        "arguments": '{"city": "Paris"}',
+                    },
+                }
+            ],
+        )
+        assert msg.role == "assistant"
+        assert msg.content is None
+        assert msg.tool_calls is not None
+        assert len(msg.tool_calls) == 1
+        assert msg.tool_calls[0]["function"]["name"] == "get_weather"
+
+    def test_assistant_with_content_and_tool_calls(self):
+        msg = ChatMessage(
+            role = "assistant",
+            content = "Let me check the weather.",
+            tool_calls = [
+                {
+                    "id": "call_1",
+                    "type": "function",
+                    "function": {"name": "get_weather", "arguments": "{}"},
+                }
+            ],
+        )
+        assert msg.content == "Let me check the weather."
+        assert msg.tool_calls[0]["id"] == "call_1"
+
+    def test_plain_user_message_still_works(self):
+        msg = ChatMessage(role = "user", content = "Hello")
+        assert msg.role == "user"
+        assert msg.tool_call_id is None
+        assert msg.tool_calls is None
+        assert msg.name is None
+
+    def test_invalid_role_rejected(self):
+        with pytest.raises(ValidationError):
+            ChatMessage(role = "function", content = "x")
+
+    def test_content_absent_on_assistant_tool_call_defaults_to_none(self):
+        # Assistant messages that carry only tool_calls are the one
+        # documented case where `content=None` is permitted.
+        msg = ChatMessage(
+            role = "assistant",
+            tool_calls = [
+                {
+                    "id": "call_1",
+                    "type": "function",
+                    "function": {"name": "f", "arguments": "{}"},
+                }
+            ],
+        )
+        assert msg.content is None
+
+    def test_tool_role_missing_tool_call_id_rejected(self):
+        # Per OpenAI spec, role="tool" messages must carry tool_call_id so
+        # upstream backends can associate the result with its prior call.
+        # Pin the boundary-level rejection so a malformed tool-result
+        # message never reaches the passthrough path.
+        with pytest.raises(ValidationError) as exc_info:
+            ChatMessage(role = "tool", content = '{"temperature": 72}')
+        assert "tool_call_id" in str(exc_info.value)
+
+    def test_tool_role_empty_tool_call_id_rejected(self):
+        with pytest.raises(ValidationError):
+            ChatMessage(
+                role = "tool",
+                tool_call_id = "",
+                content = '{"temperature": 72}',
+            )
+
+    # ── Role-aware content requirements ────────────────────────────
+
+    def test_user_empty_content_rejected(self):
+        with pytest.raises(ValidationError):
+            ChatMessage(role = "user", content = "")
+
+    def test_system_empty_content_rejected(self):
+        with pytest.raises(ValidationError):
+            ChatMessage(role = "system", content = "")
+
+    def test_user_empty_list_content_rejected(self):
+        with pytest.raises(ValidationError):
+            ChatMessage(role = "user", content = [])
+
+    def test_tool_empty_content_rejected(self):
+        with pytest.raises(ValidationError) as exc_info:
+            ChatMessage(role = "tool", tool_call_id = "call_1", content = "")
+        assert "content" in str(exc_info.value)
+
+    def test_assistant_without_content_or_tool_calls_rejected(self):
+        with pytest.raises(ValidationError) as exc_info:
+            ChatMessage(role = "assistant")
+        assert "content" in str(exc_info.value) or "tool_calls" in str(exc_info.value)
+
+    # ── Role-constrained tool-call metadata ────────────────────────
+
+    def test_tool_calls_on_user_rejected(self):
+        with pytest.raises(ValidationError) as exc_info:
+            ChatMessage(
+                role = "user",
+                content = "Hi",
+                tool_calls = [
+                    {
+                        "id": "c1",
+                        "type": "function",
+                        "function": {"name": "f", "arguments": "{}"},
+                    }
+                ],
+            )
+        assert "tool_calls" in str(exc_info.value)
+
+    def test_tool_call_id_on_user_rejected(self):
+        with pytest.raises(ValidationError) as exc_info:
+            ChatMessage(role = "user", content = "Hi", tool_call_id = "call_1")
+        assert "tool_call_id" in str(exc_info.value)
+
+    def test_name_on_user_rejected(self):
+        with pytest.raises(ValidationError) as exc_info:
+            ChatMessage(role = "user", content = "Hi", name = "get_weather")
+        assert "name" in str(exc_info.value)
+
+
+# =====================================================================
+# ChatCompletionRequest — standard OpenAI tool fields
+# =====================================================================
+
+
+class TestChatCompletionRequestToolFields:
+    def _make(self, **kwargs):
+        base = {"messages": [{"role": "user", "content": "Hi"}]}
+        base.update(kwargs)
+        return ChatCompletionRequest(**base)
+
+    def test_tools_parses(self):
+        req = self._make(
+            tools = [
+                {
+                    "type": "function",
+                    "function": {
+                        "name": "get_weather",
+                        "description": "Return the weather in a city",
+                        "parameters": {
+                            "type": "object",
+                            "properties": {"city": {"type": "string"}},
+                            "required": ["city"],
+                        },
+                    },
+                }
+            ],
+        )
+        assert req.tools is not None
+        assert len(req.tools) == 1
+        assert req.tools[0]["function"]["name"] == "get_weather"
+
+    def test_tool_choice_string_auto(self):
+        assert self._make(tool_choice = "auto").tool_choice == "auto"
+
+    def test_tool_choice_string_required(self):
+        assert self._make(tool_choice = "required").tool_choice == "required"
+
+    def test_tool_choice_string_none(self):
+        assert self._make(tool_choice = "none").tool_choice == "none"
+
+    def test_tool_choice_named_function(self):
+        tc = {"type": "function", "function": {"name": "get_weather"}}
+        assert self._make(tool_choice = tc).tool_choice == tc
+
+    def test_stop_string(self):
+        assert self._make(stop = "\nUser:").stop == "\nUser:"
+
+    def test_stop_list(self):
+        assert self._make(stop = ["\nUser:", "\nAssistant:"]).stop == [
+            "\nUser:",
+            "\nAssistant:",
+        ]
+
+    def test_tools_default_none(self):
+        req = self._make()
+        assert req.tools is None
+        assert req.tool_choice is None
+        assert req.stop is None
+
+    def test_extra_fields_accepted(self):
+        # `frequency_penalty`, `seed`, `response_format` are not yet
+        # explicitly declared but must survive Pydantic parsing now that
+        # extra="allow" is set.
+        req = self._make(
+            frequency_penalty = 0.5,
+            seed = 42,
+            response_format = {"type": "json_object"},
+        )
+        # Extras land in model_extra
+        assert req.model_extra is not None
+        assert req.model_extra.get("frequency_penalty") == 0.5
+        assert req.model_extra.get("seed") == 42
+        assert req.model_extra.get("response_format") == {"type": "json_object"}
+
+    def test_unsloth_extensions_still_work(self):
+        req = self._make(
+            enable_tools = True,
+            enabled_tools = ["web_search", "python"],
+            session_id = "abc",
+        )
+        assert req.enable_tools is True
+        assert req.enabled_tools == ["web_search", "python"]
+        assert req.session_id == "abc"
+
+    def test_stream_defaults_false_matching_openai_spec(self):
+        # OpenAI's /v1/chat/completions spec defaults `stream` to false.
+        # Studio previously defaulted to true, which broke naive curl
+        # clients that omit `stream` (they expect a JSON blob, got SSE).
+        # Pin the corrected default so it can't silently regress.
+        req = self._make()
+        assert req.stream is False
+
+    def test_multiturn_tool_loop_messages(self):
+        req = ChatCompletionRequest(
+            messages = [
+                {"role": "user", "content": "What's the weather in Paris?"},
+                {
+                    "role": "assistant",
+                    "content": None,
+                    "tool_calls": [
+                        {
+                            "id": "call_1",
+                            "type": "function",
+                            "function": {
+                                "name": "get_weather",
+                                "arguments": '{"city": "Paris"}',
+                            },
+                        }
+                    ],
+                },
+                {
+                    "role": "tool",
+                    "tool_call_id": "call_1",
+                    "content": '{"temperature": 14, "unit": "celsius"}',
+                },
+            ],
+            tools = [
+                {
+                    "type": "function",
+                    "function": {
+                        "name": "get_weather",
+                        "parameters": {"type": "object"},
+                    },
+                }
+            ],
+        )
+        assert len(req.messages) == 3
+        assert req.messages[1].role == "assistant"
+        assert req.messages[1].content is None
+        assert req.messages[1].tool_calls[0]["id"] == "call_1"
+        assert req.messages[2].role == "tool"
+        assert req.messages[2].tool_call_id == "call_1"
+
+
+# =====================================================================
+# anthropic_tool_choice_to_openai — pure translation helper
+# =====================================================================
+
+
+class TestAnthropicToolChoiceToOpenAI:
+    def test_auto(self):
+        assert anthropic_tool_choice_to_openai({"type": "auto"}) == "auto"
+
+    def test_any_becomes_required(self):
+        assert anthropic_tool_choice_to_openai({"type": "any"}) == "required"
+
+    def test_none(self):
+        assert anthropic_tool_choice_to_openai({"type": "none"}) == "none"
+
+    def test_tool_named(self):
+        result = anthropic_tool_choice_to_openai(
+            {"type": "tool", "name": "get_weather"}
+        )
+        assert result == {
+            "type": "function",
+            "function": {"name": "get_weather"},
+        }
+
+    def test_tool_missing_name_returns_none(self):
+        assert anthropic_tool_choice_to_openai({"type": "tool"}) is None
+
+    def test_none_input_returns_none(self):
+        assert anthropic_tool_choice_to_openai(None) is None
+
+    def test_unrecognized_shape_returns_none(self):
+        assert anthropic_tool_choice_to_openai({"type": "wibble"}) is None
+        assert anthropic_tool_choice_to_openai("auto") is None
+        assert anthropic_tool_choice_to_openai(42) is None
+
+
+# =====================================================================
+# _build_passthrough_payload — tool_choice propagation
+# =====================================================================
+
+
+class TestBuildPassthroughPayloadToolChoice:
+    def _args(self):
+        return dict(
+            openai_messages = [{"role": "user", "content": "Hi"}],
+            openai_tools = [
+                {
+                    "type": "function",
+                    "function": {"name": "f", "parameters": {"type": "object"}},
+                }
+            ],
+            temperature = 0.6,
+            top_p = 0.95,
+            top_k = 20,
+            max_tokens = 128,
+            stream = False,
+        )
+
+    def test_default_tool_choice_is_auto(self):
+        body = _build_passthrough_payload(**self._args())
+        assert body["tool_choice"] == "auto"
+
+    def test_override_tool_choice_required(self):
+        body = _build_passthrough_payload(**self._args(), tool_choice = "required")
+        assert body["tool_choice"] == "required"
+
+    def test_override_tool_choice_none(self):
+        body = _build_passthrough_payload(**self._args(), tool_choice = "none")
+        assert body["tool_choice"] == "none"
+
+    def test_override_tool_choice_named_function(self):
+        tc = {"type": "function", "function": {"name": "f"}}
+        body = _build_passthrough_payload(**self._args(), tool_choice = tc)
+        assert body["tool_choice"] == tc
+
+    def test_stream_adds_include_usage(self):
+        args = self._args()
+        args["stream"] = True
+        body = _build_passthrough_payload(**args)
+        assert body.get("stream_options") == {"include_usage": True}
+
+    def test_repetition_penalty_renamed(self):
+        body = _build_passthrough_payload(**self._args(), repetition_penalty = 1.1)
+        assert body.get("repeat_penalty") == 1.1
+        assert "repetition_penalty" not in body
+
+
+# =====================================================================
+# _friendly_error — httpx transport failures
+# =====================================================================
+
+
+class TestFriendlyErrorHttpx:
+    """The async pass-through helpers talk to llama-server via httpx.
+    When the subprocess is down, httpx raises RequestError subclasses
+    whose string form (``"All connection attempts failed"``, ``"[Errno 111]
+    Connection refused"``, ...) does NOT contain the substring
+    ``"Lost connection to llama-server"`` the sync path uses, so the
+    previous substring-only `_friendly_error` returned a useless generic
+    message. These tests pin the new isinstance-based mapping.
+    """
+
+    def _req(self):
+        return httpx.Request("POST", "http://127.0.0.1:65535/v1/chat/completions")
+
+    def test_connect_error_mapped(self):
+        exc = httpx.ConnectError("All connection attempts failed", request = self._req())
+        assert "Lost connection" in _friendly_error(exc)
+
+    def test_read_error_mapped(self):
+        exc = httpx.ReadError("EOF", request = self._req())
+        assert "Lost connection" in _friendly_error(exc)
+
+    def test_remote_protocol_error_mapped(self):
+        exc = httpx.RemoteProtocolError("peer closed", request = self._req())
+        assert "Lost connection" in _friendly_error(exc)
+
+    def test_read_timeout_mapped(self):
+        exc = httpx.ReadTimeout("timed out", request = self._req())
+        assert "Lost connection" in _friendly_error(exc)
+
+    def test_non_httpx_unchanged(self):
+        # Non-httpx exceptions still fall through to the existing substring
+        # heuristics — a context-size message must still produce the
+        # "Message too long" path.
+        ctx_msg = (
+            "request (4096 tokens) exceeds the available context size (2048 tokens)"
+        )
+        assert "Message too long" in _friendly_error(ValueError(ctx_msg))
+
+    def test_generic_exception_returns_generic_message(self):
+        assert (
+            _friendly_error(RuntimeError("unrelated")) == "An internal error occurred"
+        )
--- a/Show more
+++ b/Show more