unsloth

mirror of https://github.com/unslothai/unsloth synced 2026-04-21 13:37:39 +00:00

History

Daniel Han 9a261aec5f Studio: Expose openai and anthropic compatible external API end points (#4956 ) * Studio: add API key authentication for programmatic access External users want to hit the Studio API (chat completions with tool calling, training, export, etc.) without going through the browser login flow. This adds sk-unsloth- prefixed API keys that work as a drop-in replacement for JWTs in the Authorization: Bearer header. Backend: - New api_keys table in SQLite (storage.py) - create/list/revoke/validate functions with SHA-256 hashed storage - API key detection in _get_current_subject before the JWT path - POST/GET/DELETE /api/auth/api-keys endpoints on the auth router Frontend: - /api-keys page with create form, one-time key reveal, keys table - API Keys link in desktop and mobile navbar - Route registered with requireAuth guard Zero changes to any existing route handler -- every endpoint that uses Depends(get_current_subject) automatically works with API keys. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use actual origin in API key usage examples The examples on /api-keys were hardcoded to localhost:8888 which is wrong for remote users. Use window.location.origin so the examples show the correct URL regardless of where the user is connecting from. * Add `unsloth studio run` CLI command for one-liner model serving Adds a `run` subcommand that starts Studio, loads a model, creates an API key, and prints a ready-to-use curl command -- similar to `ollama run` or `vllm serve`. Usage: unsloth studio run -m unsloth/Qwen3-1.7B-GGUF --gguf-variant UD-Q4_K_XL * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add end-to-end tests for `unsloth studio run` and API key usage Tests the 4 usage examples from the API Keys page: 1. curl basic (non-streaming) chat completions 2. curl streaming (SSE) chat completions 3. OpenAI Python SDK streaming completions 4. curl with tools (web_search + python) Also tests --help output, invalid key rejection, and no-key rejection. All 7 tests pass against Qwen3-1.7B-GGUF. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add /v1/completions, /v1/embeddings, /v1/responses endpoints and --parallel support - llama_cpp.py: accept n_parallel param, pass to llama-server --parallel - run.py: plumb llama_parallel_slots through to app.state - inference.py: add /completions and /embeddings as transparent proxies to llama-server, add /responses as application-level endpoint that converts to ChatCompletionRequest; thread n_parallel through load_model - studio.py: set llama_parallel_slots=4 for `unsloth studio run` path * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make /v1/responses endpoint match OpenAI Responses API format The existing /v1/responses shim returned Chat Completions format, which broke OpenAI SDK clients using openai.responses.create(). This commit replaces the endpoint with a proper implementation that: - Returns `output` array with `output_text` content parts instead of `choices` with `message` - Uses `input_tokens`/`output_tokens` instead of `prompt_tokens`/ `completion_tokens` in usage - Sets `object: "response"` and `id: "resp_..."` - Emits named SSE events for streaming (response.created, response.output_text.delta, response.completed, etc.) - Accepts all OpenAI Responses API fields (tools, store, metadata, previous_response_id) without erroring -- silently ignored - Maps `developer` role to `system` and `input_text`/`input_image` content parts to the internal Chat format Adds Pydantic schemas for request/response models and 23 unit tests covering schema validation, input normalisation, and response format. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: add Anthropic-compatible /v1/messages endpoint (#4981) * Add Anthropic-compatible /v1/messages endpoint with tool support Translate Anthropic Messages API format to/from internal OpenAI format and reuse the existing server-side agentic tool loop. Supports streaming SSE (message_start, content_block_delta, etc.) and non-streaming JSON. Includes offline unit tests and e2e tests in test_studio_run.py. * Add enable_tools, enabled_tools, session_id to /v1/messages endpoint Support the same shorthand as /v1/chat/completions: enable_tools=true with an optional enabled_tools list uses built-in server tools without requiring full Anthropic tool definitions. session_id is passed through for sandbox isolation. max_tokens is now optional. * Strip leaked tool-call XML from Anthropic endpoint content Apply _TOOL_XML_RE to content events in both streaming and non-streaming tool paths, matching the OpenAI endpoint behavior. * Emit custom tool_result SSE event in Anthropic stream Adds a non-standard tool_result event between the tool_use block close and the next text block, so clients can see server-side tool execution results. Anthropic SDKs ignore unknown event types. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Split /v1/messages into server-side and client-side tool paths enable_tools=true runs the existing server-side agentic loop with built-in tools (web_search/python/terminal). A bare tools=[...] field now triggers a client-side pass-through: client-provided tools are forwarded to llama-server and any tool_use output is returned to the caller with stop_reason=tool_use for client execution. This fixes Claude Code (and any Anthropic SDK client) which sends tools=[...] expecting client-side execution but was previously routed through execute_tool() and failing with 'Unknown tool'. Adds AnthropicPassthroughEmitter to convert llama-server OpenAI SSE chunks into Anthropic SSE events, plus unit tests covering text blocks, tool_use blocks, mixed, stop reasons, and usage. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix httpcore GeneratorExit in /v1/messages passthrough stream Explicitly aclose aiter_lines() before the surrounding async with blocks unwind, mirroring the prior fix in external_provider.py (`a41160d3`) and cc757b78's RuntimeError suppression. * Wire stop_sequences through /v1/messages; warn on tool_choice Plumb payload.stop_sequences to all three code paths (server-side tool loop, no-tool plain, client-side passthrough) so Anthropic SDK clients setting stop_sequences get the behavior they expect. The llama_cpp backend already accepted `stop` on both generate_chat_ completion and generate_chat_completion_with_tools; the Anthropic handler simply wasn't passing it. tool_choice remains declared on the request model for Anthropic SDK compatibility (the SDK often sets it by default) but is not yet honored. Log a structured warning on each request carrying a non- null tool_choice so the silent drop is visible to operators. * Wire min_p / repetition_penalty / presence_penalty through /v1/messages Align the Anthropic endpoint's sampling surface with /v1/chat/completions. Adds the three fields as x-unsloth extensions on AnthropicMessagesRequest and threads them through all three code paths: server-side tool loop, no-tool plain, and client-side passthrough. The passthrough builder emits "repeat_penalty" (not "repetition_penalty") because that is llama-server's field name; the backend methods already apply the same rename internally. * Fix block ordering and prev_text reset in non-streaming tool path _anthropic_tool_non_streaming was building the response by appending all tool_use blocks first, then a single concatenated text block at the end — losing generation order and merging pre-tool and post-tool text into one block. It also never reset prev_text between synthesis turns, so the first N characters of each post-tool turn were dropped (where N = length of the prior turn's final cumulative text). Rewrite to build content_blocks incrementally in generation order, matching the streaming emitter's behavior: deltas within a turn are merged into the trailing text block, tool_use blocks interrupt the text sequence, and prev_text is reset on tool_end so turn N+1 diffs against an empty baseline. Caught by gemini-code-assist[bot] review on #4981. * Make test_studio_run.py e2e tests pytest-compatible Add a hybrid session-scoped studio_server fixture in conftest.py that feeds base_url / api_key into the existing e2e test functions. Three invocation modes are now supported: 1. Script mode (unchanged) — python tests/test_studio_run.py 2. Pytest + external server — point at a running instance via UNSLOTH_E2E_BASE_URL / UNSLOTH_E2E_API_KEY env vars, no per-run GGUF load cost 3. Pytest + fixture-managed server — pytest drives _start_server / _kill_server itself via --unsloth-model / --unsloth-gguf-variant, CI-friendly The existing _start_server / _kill_server helpers and main() stay untouched so the script entry point keeps working exactly as before. Test function signatures are unchanged — the (base_url, api_key) parameters now resolve via the new fixtures when running under pytest. * Rename test_studio_run.py -> test_studio_api.py The file is entirely about HTTP API endpoint testing (OpenAI-compatible /v1/chat/completions, Anthropic-compatible /v1/messages, API key auth, plus a CLI --help sanity check on the command that runs the API). None of its tests cover training, export, chat-UI, or internal-Python-API concerns. The old name misleadingly suggested "tests for the unsloth studio run CLI subcommand" — the new name reflects the actual scope. Updates: - git mv the file (rename tracked, history preserved) - Rewrite opening docstring to state the API surface focus and call out what is explicitly out of scope - Update all 4 Usage-block path references to the new filename - LOG_FILE renamed to test_studio_api.log - conftest.py fixture import rewritten from test_studio_run to test_studio_api, plus 7 docstring/comment references updated No functional changes to test logic, signatures, or main(). --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix httpcore asyncgen cleanup in /v1/messages and /v1/completions The earlier fix in `985e92a9` was incomplete: it closed aiter_lines() explicitly but still used `async with httpx.AsyncClient()` / `async with client.stream()` inside the generator. When the generator is orphaned (e.g. client disconnects mid-stream and Starlette drops the StreamingResponse iterator without explicitly calling aclose()), Python's asyncgen finalizer runs the cleanup in a DIFFERENT task than the one that originally entered the httpx context managers. The `async with` exits then trigger httpcore's HTTP11ConnectionByteStream .aclose(), which enters anyio.CancelScope.__exit__ with a mismatched task and raises RuntimeError("Attempted to exit cancel scope in a different task"). That error escapes any user-owned try/except because it happens during GC finalization. Replace `async with` with manual client/response lifecycle in both /v1/messages passthrough and /v1/completions proxy. Close the response and client in a finally block wrapped in `try: ... except Exception: pass`. This suppresses RuntimeError (and other Exception subclasses) from the anyio cleanup noise while letting GeneratorExit (a BaseException, not Exception) propagate cleanly so the generator terminates as Python expects. Traceback observed in user report: File ".../httpcore/_async/connection_pool.py", line 404, in __aiter__ yield part RuntimeError: async generator ignored GeneratorExit ... File ".../anyio/_backends/_asyncio.py", line 455, in __exit__ raise RuntimeError( RuntimeError: Attempted to exit cancel scope in a different task * Expand unsloth studio run banner with SDK base URL and more curl examples Add an explicit "OpenAI / Anthropic SDK base URL" line inside the info box so SDK users don't accidentally copy the bare server URL (without /v1) into their OpenAI/Anthropic SDK constructors and hit 404s. Replace the single /v1/chat/completions curl example with three labeled blocks: chat/completions, Anthropic /messages, and OpenAI Responses. The Anthropic example includes max_tokens (Anthropic SDKs require it even though Studio accepts None). All examples derived from a computed sdk_base_url so the /v1 prefix stays in sync if the public path ever changes. * Hash API keys with HMAC-SHA256 + persistent server secret Stores the HMAC secret in a new app_secrets singleton table. Fixes CodeQL py/weak-sensitive-data-hashing alert on storage.py:74-76, 394-395. Refresh tokens stay on plain SHA-256 (unchanged _hash_token) so existing user sessions survive upgrade — API keys are new on this branch so there is no migration. * Use PBKDF2 for API key hashing per CodeQL recommendation HMAC-SHA256 was still flagged by py/weak-sensitive-data-hashing. Switch to hashlib.pbkdf2_hmac, which is in CodeQL's recommended allowlist (Argon2/scrypt/bcrypt/PBKDF2). Persistent server-side salt stays in app_secrets for defense-in-depth. 100k iterations to match auth/hashing.py's password hasher. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com> Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>		2026-04-13 21:08:11 +04:00
..
commands	Studio: Expose openai and anthropic compatible external API end points (#4956 )	2026-04-13 21:08:11 +04:00
__init__.py	Consolidate dual venvs and separate install from update (#4530 )	2026-03-25 05:24:21 -07:00
config.py	Rename cli/ to unsloth_cli/ to fix namespace collision with stringzilla (#4393 )	2026-03-17 20:40:21 -07:00
options.py	Rename cli/ to unsloth_cli/ to fix namespace collision with stringzilla (#4393 )	2026-03-17 20:40:21 -07:00