unsloth/studio/backend/models/models.py
Daniel Han f0d03655e8
Studio: add folder browser modal for Custom Folders (#5035)
* Studio: add folder browser modal for Custom Folders

The Custom Folders row in the model picker currently only accepts a
typed path. On a remote-served Studio (Colab, shared workstation) that
means the user has to guess or paste the exact server-side absolute
path. A native browser folder picker can't solve this: HTML
`<input type="file" webkitdirectory>` hides the absolute path for
security, and the File System Access API (Chrome/Edge only) returns
handles rather than strings, neither of which the server can act on.

This PR adds a small in-app directory browser that lists paths on the
server and hands the chosen string back to the existing
`POST /api/models/scan-folders` flow.

## Backend

* New endpoint `GET /api/models/browse-folders`:
  * `path` query param (expands `~`, accepts relative or absolute; empty
    defaults to the user's home directory).
  * `show_hidden` boolean to include dotfiles/dotdirs.
  * Returns `{current, parent, entries[], suggestions[]}`. `parent` is
    null at the filesystem root.
  * Immediate subdirectories only (no recursion); files are never
    returned.
  * `entries[].has_models` is a cheap hint: the directory looks like it
    holds models if it is named `models--*` (HF hub cache layout) or
    one of the first 64 children is a .gguf/.safetensors/config.json/
    adapter_config.json or another `models--*` subfolder.
  * Sort order: model-bearing dirs, then plain, then hidden; case-
    insensitive alphabetical within each bucket.
  * Suggestions auto-populate from HOME, the HF cache root, and any
    already-registered scan folders, deduplicated.
  * Error surface: 404 for missing path, 400 for non-directory, 403 on
    permission errors. Auth-required like the other models routes.

* New Pydantic schemas `BrowseEntry` and `BrowseFoldersResponse` in
  `studio/backend/models/models.py`.

## Frontend

* New `FolderBrowser` component
  (`studio/frontend/src/components/assistant-ui/model-selector/folder-browser.tsx`)
  using the existing `Dialog` primitive. Features:
  * Clickable breadcrumb with a `..` row for parent navigation.
  * Quick-pick chips for the server-provided suggestions.
  * `Show hidden` checkbox.
  * In-flight fetch cancellation via AbortController so rapid
    navigation doesn't flash stale results.
  * Badges model-bearing directories inline.

* `chat-api.ts` gains `browseFolders(path?, showHidden?)` and matching
  types.

* `pickers.tsx` adds a folder-magnifier icon next to the existing `Add`
  button. Opening the browser seeds it with whatever the user has
  already typed; confirming fills the text input, leaving the existing
  validation and save flow unchanged.

## What it does NOT change

* The existing text-input flow still works; the browser is additive.
* No new permissions or escalation; the endpoint reads only directories
  the server process is already allowed to read.
* No model scanning or filesystem mutation happens from the browser
  itself -- it just returns basenames for render.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: cap folder-browser entries and expose truncated flag

Pointing the folder browser at a huge directory (``/usr/lib``,
``/proc``, or a synthetic tree with thousands of subfolders) previously
walked the whole listing and stat-probed every child via
``_looks_like_model_dir``. That is both a DoS shape for the server
process and a large-payload surprise for the client.

Introduce a hard cap of 2000 subdirectory entries and a
``truncated: bool`` field on the response. The frontend renders a small
hint below the list when it fires, prompting the user to narrow the
path. Below-cap directories are unchanged.

Verified end-to-end against the live backend with a synthetic tree of
2050 directories: response lands at 2000 entries, ``truncated=true``,
listing finishes in sub-second time (versus tens of seconds if we were
stat-storming).

* Studio: suggest LM Studio / Ollama dirs + 2-level model probe

Three improvements to the folder-browser, driven by actually dropping
an LM Studio-style install (publisher/model/weights.gguf) into the
sandbox and walking the UX:

## 1. Quick-pick chips for other local-LLM tools

`well_known_model_dirs()` (new) returns paths commonly used by
adjacent tools. Only paths that exist are returned so the UI never
shows dead chips.

* LM Studio current + legacy roots + user-configured
  `downloadsFolder` from its `settings.json` (reuses the existing
  `lmstudio_model_dirs()` helper).
* Ollama: `$OLLAMA_MODELS` env override, then `~/.ollama/models`,
  `/usr/share/ollama/.ollama/models`, and `/var/lib/ollama/.ollama/models`
  (the systemd-service install path surfaced in the upstream "where is
  everything?" issue).
* Generic user-choice locations: `~/models`, `~/Models`.

Dedup is stable across all sources.

## 2. Two-level model-bearing probe

LM Studio and Ollama both use `root/publisher/model/weights.gguf`.
The previous `has_models` heuristic only probed one level, so the
publisher dir (whose immediate children are model dirs, not weight
files) was always marked as non-model-bearing. Pulled the direct-
signal logic into `_has_direct_model_signal` and added a grandchild
probe so the classic layout is now recognised.

Still O(PROBE^2) worst-case, still returns immediately for
`models--*` names (HF cache layout) and for any direct weight file.

## 3. model_files_here hint on response body

A leaf model dir (just GGUFs, no subdirs) previously rendered as
`(empty directory)` in the modal, confusing users into thinking the
folder wasn't scannable. Added a `model_files_here` count on the
response (capped at 200) and a small hint row in the modal: `N model
files in this folder. Click "Use this folder" to scan it.`

## Verification

Simulated an LM Studio install by downloading the real 84 MB
`unsloth/SmolLM2-135M-Instruct-Q2_K.gguf` into
`~/.lmstudio/models/unsloth/SmolLM2-135M-Instruct-GGUF/`. Confirmed
end-to-end:

* Home listing suggests `~/.lmstudio/models` as a chip.
* Browsing `~/.lmstudio/models` flags `unsloth` (publisher) as
  `has_models=true` via the 2-level probe.
* Browsing the publisher flags `SmolLM2-135M-Instruct-GGUF` (model
  dir) as `has_models=true`.
* Browsing the model dir returns empty entries but
  `model_files_here=1`, and the frontend renders a hint telling the
  user it is a valid target.

* Studio: one-click scan-folder add + prominent remove + plain search icon

Three small Custom Folders UX fixes after real-use walkthrough:

* **One-click add from the folder browser**. Confirming `Use this
  folder` now submits the path directly to
  `POST /api/models/scan-folders` instead of just populating the text
  input. `handleAddFolder` takes an optional explicit path so the
  submit lands in the same tick as `setFolderInput`, avoiding a
  state-flush race. The typed-path + `Add` button flow is unchanged.

* **Prominent remove X on scan folders**. The per-folder delete
  button was `text-muted-foreground/40` and hidden entirely on
  desktop until hovered (`md:opacity-0 md:group-hover:opacity-100`).
  Dropped the hover-only cloak, bumped color to `text-foreground/70`,
  added a red hover/focus background, and sized the icon up from
  `size-2.5` to `size-3`. Always visible on every viewport.

* **Plain search icon for the Browse button**. `FolderSearchIcon`
  replaced with `Search01Icon` so it reads as a simple "find a
  folder" action alongside the existing `Add01Icon`.

* Studio: align Custom Folders + and X buttons on the same right edge

The Custom Folders header used `px-2.5` with a `p-0.5` icon button,
while each folder row used `px-3` with a `p-1` button. That put the
X icon 4px further from the right edge than the +. Normalised both
rows to `px-2.5` with `p-1` so the two icons share a column.

* Studio: empty-state button opens the folder browser directly

The first-run empty state for Custom Folders was a text link reading
"+ Add a folder to scan for local models" whose click toggled the
text input. That's the wrong default: a user hitting the empty state
usually doesn't know what absolute path to type, which is exactly
what the folder browser is for.

* Reword to "Browse for a models folder" with a search-icon
  affordance so the label matches what the click does.
* Click opens the folder browser modal directly. The typed-path +
  Add button flow is still available via the + icon in the
  section header, so users who know their path keep that option.
* Slightly bump the muted foreground opacity (70 -> hover:foreground)
  so the button reads as a primary empty-state action rather than a
  throwaway hint.

* Studio: Custom Folders header gets a dedicated search + add button pair

The Custom Folders section header had a single toggle button that
flipped between + and X. That put the folder-browser entry point
behind the separate empty-state link. Cleaner layout: two buttons in
the header, search first, then add.

* Search icon (left) opens the folder browser modal directly.
* Plus icon (right) toggles the text-path input (unchanged).
* The first-run empty-state link is removed -- the two header icons
  cover both flows on every state.

Both buttons share the same padding / icon size so they line up with
each other and with the per-folder remove X.

* Studio: sandbox folder browser + bound caps + UX recoveries

PR review fixes for the Custom Folders folder browser. Closes the
high-severity CodeQL path-traversal alert and addresses the codex /
gemini P2 findings.

Backend (studio/backend/routes/models.py):

* New _build_browse_allowlist + _is_path_inside_allowlist sandbox.
  browse_folders now refuses any target that doesn't resolve under
  HOME, HF cache, Studio dirs, registered scan folders, or the
  well-known third-party model dirs. realpath() is used so symlink
  traversal cannot escape the sandbox. Also gates the parent crumb
  so the up-row hides instead of 403'ing.
* _BROWSE_ENTRY_CAP now bounds *visited* iterdir entries, not
  *appended* entries. Dirs full of files (or hidden subdirs when
  show_hidden is False) used to defeat the cap.
* _count_model_files gets the same visited-count fix.
* PermissionError no longer swallowed silently inside the
  enumeration / counter loops -- now logged at debug.

Frontend (folder-browser.tsx, pickers.tsx, chat-api.ts):

* splitBreadcrumb stops mangling literal backslashes inside POSIX
  filenames; only Windows-style absolute paths trigger separator
  normalization. The Windows drive crumb value is now C:/ (drive
  root) instead of C: (drive-relative CWD-on-C).
* browseFolders accepts and forwards an AbortSignal so cancelled
  navigations actually cancel the in-flight backend enumeration.
* On initial-path fetch error, FolderBrowser now falls back to HOME
  instead of leaving the modal as an empty dead end.
* When the auto-add path (one-click "Use this folder") fails, the
  failure now surfaces via toast in addition to the inline
  paragraph (which is hidden when the typed-input panel is closed).

* Studio: rebuild browse target from trusted root for CodeQL clean dataflow

CodeQL's py/path-injection rule kept flagging the post-validation
filesystem operations because the sandbox check lived inside a
helper function (_is_path_inside_allowlist) and CodeQL only does
intra-procedural taint tracking by default. The user-derived
``target`` was still flowing into ``target.exists`` /
``target.is_dir`` / ``target.iterdir``.

The fix: after resolving the user-supplied ``candidate_path``,
locate the matching trusted root from the allowlist and rebuild
``target`` by appending each individually-validated segment to
that trusted root. Each segment is rejected if it isn't a single
safe path component (no separators, no ``..``, no empty/dot).
The downstream filesystem ops now operate on a Path constructed
entirely from ``allowed_roots`` (trusted) plus those validated
segments, so CodeQL's dataflow no longer sees a tainted source.

Behavior is unchanged for all valid inputs -- only the
construction of ``target`` is restructured. Live + unit tests
all pass (58 selected, 7 deselected for Playwright env).

* Studio: walk browse paths from trusted roots for CodeQL

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@h100-8-cheapest.us-east5-a.c.unsloth.internal>
2026-04-15 08:04:33 -07:00

280 lines
9.8 KiB
Python

# SPDX-License-Identifier: AGPL-3.0-only
# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
"""
Pydantic schemas for Model Management API
"""
from pydantic import BaseModel, Field
from typing import Optional, List, Dict, Any, Literal
ModelType = Literal["text", "vision", "audio", "embeddings"]
class CheckpointInfo(BaseModel):
"""Information about a discovered checkpoint directory."""
display_name: str = Field(
..., description = "User-friendly checkpoint name (folder name)"
)
path: str = Field(..., description = "Full path to the checkpoint directory")
loss: Optional[float] = Field(None, description = "Training loss at this checkpoint")
class ModelCheckpoints(BaseModel):
"""A training run and its associated checkpoints."""
name: str = Field(..., description = "Training run folder name")
checkpoints: List[CheckpointInfo] = Field(
default_factory = list,
description = "List of checkpoints for this training run (final + intermediate)",
)
base_model: Optional[str] = Field(
None,
description = "Base model name from adapter_config.json or config.json",
)
peft_type: Optional[str] = Field(
None,
description = "PEFT type (e.g. LORA) if adapter training, None for full fine-tune",
)
lora_rank: Optional[int] = Field(
None,
description = "LoRA rank (r) if applicable",
)
is_quantized: bool = Field(
False,
description = "Whether the model uses BNB quantization (e.g. bnb-4bit)",
)
class CheckpointListResponse(BaseModel):
"""Response for listing available checkpoints in an outputs directory."""
outputs_dir: str = Field(..., description = "Directory that was scanned")
models: List[ModelCheckpoints] = Field(
default_factory = list,
description = "List of training runs with their checkpoints",
)
class ModelDetails(BaseModel):
"""Detailed model configuration and metadata - can be used for both list and detail views"""
id: str = Field(..., description = "Model identifier")
model_name: Optional[str] = Field(
None, description = "Model identifier (alias for id, for backward compatibility)"
)
name: Optional[str] = Field(None, description = "Display name for the model")
config: Optional[Dict[str, Any]] = Field(
None, description = "Model configuration dictionary"
)
is_vision: bool = Field(False, description = "Whether model is a vision model")
is_embedding: bool = Field(
False, description = "Whether model is an embedding/sentence-transformer model"
)
is_lora: bool = Field(False, description = "Whether model is a LoRA adapter")
is_gguf: bool = Field(
False, description = "Whether model is a GGUF model (llama.cpp format)"
)
is_audio: bool = Field(False, description = "Whether model is a TTS audio model")
audio_type: Optional[str] = Field(
None, description = "Audio codec type: snac, csm, bicodec, dac"
)
has_audio_input: bool = Field(
False, description = "Whether model accepts audio input (ASR)"
)
model_type: Optional[ModelType] = Field(
None, description = "Collapsed model modality: text, vision, audio, or embeddings"
)
base_model: Optional[str] = Field(
None, description = "Base model if this is a LoRA adapter"
)
max_position_embeddings: Optional[int] = Field(
None, description = "Maximum context length supported by the model"
)
model_size_bytes: Optional[int] = Field(
None, description = "Total size of model weight files in bytes"
)
class LoRAInfo(BaseModel):
"""LoRA adapter or exported model information"""
display_name: str = Field(..., description = "Display name for the LoRA")
adapter_path: str = Field(
..., description = "Path to the LoRA adapter or exported model"
)
base_model: Optional[str] = Field(None, description = "Base model identifier")
source: Optional[str] = Field(None, description = "'training' or 'exported'")
export_type: Optional[str] = Field(
None, description = "'lora', 'merged', or 'gguf' (for exports)"
)
class LoRAScanResponse(BaseModel):
"""Response schema for scanning trained LoRA adapters"""
loras: List[LoRAInfo] = Field(
default_factory = list, description = "List of found LoRA adapters"
)
outputs_dir: str = Field(..., description = "Directory that was scanned")
class ModelListResponse(BaseModel):
"""Response schema for listing models"""
models: List[ModelDetails] = Field(
default_factory = list, description = "List of models"
)
default_models: List[str] = Field(
default_factory = list, description = "List of default model IDs"
)
class GgufVariantDetail(BaseModel):
"""A single GGUF quantization variant in a HuggingFace repo."""
filename: str = Field(
..., description = "GGUF filename (e.g., 'gemma-3-4b-it-Q4_K_M.gguf')"
)
quant: str = Field(..., description = "Quantization label (e.g., 'Q4_K_M')")
size_bytes: int = Field(0, description = "File size in bytes")
downloaded: bool = Field(
False, description = "Whether this variant is already in the local HF cache"
)
class GgufVariantsResponse(BaseModel):
"""Response for listing GGUF quantization variants in a HuggingFace repo."""
repo_id: str = Field(..., description = "HuggingFace repo ID")
variants: List[GgufVariantDetail] = Field(
default_factory = list, description = "Available GGUF variants"
)
has_vision: bool = Field(
False, description = "Whether the model has vision support (mmproj files)"
)
default_variant: Optional[str] = Field(
None, description = "Recommended default quantization variant"
)
class LocalModelInfo(BaseModel):
"""Discovered local model candidate."""
id: str = Field(..., description = "Identifier to use for loading/training")
display_name: str = Field(..., description = "Display label")
path: str = Field(..., description = "Local path where model data was discovered")
source: Literal["models_dir", "hf_cache", "lmstudio", "custom"] = Field(
...,
description = "Discovery source",
)
model_id: Optional[str] = Field(
None,
description = "HF repo id for cached models, e.g. org/model",
)
updated_at: Optional[float] = Field(
None,
description = "Unix timestamp of latest observed update",
)
class LocalModelListResponse(BaseModel):
"""Response schema for listing local/cached models."""
models_dir: str = Field(
..., description = "Directory scanned for custom local models"
)
hf_cache_dir: Optional[str] = Field(
None,
description = "HF cache root that was scanned",
)
lmstudio_dirs: List[str] = Field(
default_factory = list,
description = "LM Studio model directories that were scanned",
)
models: List[LocalModelInfo] = Field(
default_factory = list,
description = "Discovered local/cached models",
)
class AddScanFolderRequest(BaseModel):
"""Request body for adding a custom scan folder."""
path: str = Field(
..., description = "Absolute or relative directory path to scan for models"
)
class ScanFolderInfo(BaseModel):
"""A registered custom model scan folder."""
id: int = Field(..., description = "Database row ID")
path: str = Field(..., description = "Normalized absolute path")
created_at: str = Field(..., description = "ISO 8601 creation timestamp")
class BrowseEntry(BaseModel):
"""A directory entry surfaced by the folder browser."""
name: str = Field(..., description = "Entry name (basename, not full path)")
has_models: bool = Field(
False,
description = (
"Hint that the directory likely contains models "
"(*.gguf, *.safetensors, config.json, or HF-style "
"`models--*` subfolders). Used by the UI to highlight "
"promising candidates; the scanner itself is authoritative."
),
)
hidden: bool = Field(
False,
description = "Name starts with a dot (e.g. `.cache`)",
)
class BrowseFoldersResponse(BaseModel):
"""Response schema for the folder browser endpoint."""
current: str = Field(..., description = "Absolute path of the directory just listed")
parent: Optional[str] = Field(
None,
description = (
"Parent directory of `current`, or null if `current` is the "
"filesystem root. The frontend uses this to render an `Up` row."
),
)
entries: List[BrowseEntry] = Field(
default_factory = list,
description = (
"Subdirectories of `current`. Sorted with model-bearing "
"directories first, then alphabetically case-insensitive; "
"hidden entries come last within each group."
),
)
suggestions: List[str] = Field(
default_factory = list,
description = (
"Handy starting points (home, HF cache, already-registered "
"scan folders). Rendered as quick-pick chips above the list."
),
)
truncated: bool = Field(
False,
description = (
"True when the listing was capped because the directory had "
"more subfolders than the server is willing to enumerate in "
"one request. The UI should show a hint telling the user to "
"narrow their path."
),
)
model_files_here: int = Field(
0,
description = (
"Count of GGUF/safetensors files immediately inside "
"``current``. Used by the UI to surface a hint on leaf "
"model directories (which otherwise look `empty` because "
"they contain only files, no subdirectories)."
),
)