unsloth

mirror of https://github.com/unslothai/unsloth synced 2026-04-21 13:37:39 +00:00

History

Roland Tannous ac2daf8b7a Studio: forward standard OpenAI tools / tool_choice to llama-server (#5099 ) * fix(studio): forward OpenAI tools/tool_choice to llama-server (#4999) Studio's /v1/chat/completions silently stripped standard OpenAI `tools` and `tool_choice` fields, so clients using standard function calling (opencode, Claude Code, Cursor, Continue, ...) never got structured tool_calls back. Adds a client-side pass-through path mirroring the existing Anthropic /v1/messages flow: when `tools` is present without Studio's `enable_tools` shorthand, the request is forwarded to llama-server verbatim so the client sees native id, finish_reason ("tool_calls"), delta.tool_calls, and accurate usage tokens. Also wires Anthropic tool_choice forwarding: /v1/messages previously accepted tool_choice on the request model but silently dropped it with a warning. Translate the four Anthropic shapes to OpenAI format and forward them so agentic clients can actually enforce tool use. - ChatCompletionRequest: add tools, tool_choice, stop; extra="allow" - ChatMessage: accept role="tool", optional tool_call_id / tool_calls / name; content is now optional (assistant with only tool_calls) - routes/inference.py: _openai_passthrough_stream / _openai_passthrough_non_streaming helpers, routing branch in openai_chat_completions, vision+tools via content-parts injection - _build_passthrough_payload: tool_choice parameter (default "auto") - anthropic_compat: anthropic_tool_choice_to_openai() translator - tests/test_openai_tool_passthrough.py: Pydantic + translator unit tests - tests/test_studio_api.py: 5 new E2E tests (non-stream, stream, multi-turn, OpenAI SDK, Anthropic tool_choice=any regression) * fix(studio): surface httpx transport errors from OpenAI passthrough When the managed llama-server subprocess crashes mid-request, the async pass-through helpers in routes/inference.py used to return a bare 500 (non-streaming) or an "An internal error occurred" SSE chunk (streaming) because _friendly_error only recognized the sync path's "Lost connection to llama-server" substring -- httpx transport failures (ConnectError / ReadError / RemoteProtocolError / ReadTimeout) stringify differently and fell through to the generic case. - _friendly_error: map any httpx.RequestError subclass to the same "Lost connection to the model server" message the sync chat path emits. Placed before the substring heuristics so the streaming path automatically picks it up via its existing except Exception catch. - _openai_passthrough_non_streaming: wrap the httpx.AsyncClient.post in a try/except httpx.RequestError and re-raise as HTTPException 502 with the friendly detail. - tests/test_openai_tool_passthrough.py: new TestFriendlyErrorHttpx class pinning the mapping for ConnectError, ReadError, RemoteProtocolError, ReadTimeout, and confirming non-httpx paths (context-size heuristic, generic fallback) are unchanged. * fix(studio): close aiter_bytes/aiter_lines explicitly in passthroughs The httpcore asyncgen cleanup fix in `5cedd9a5` is incomplete on Python 3.13 + httpcore 1.0.x: it switched to manual client/response lifecycle but still used anonymous `async for raw_line in resp.aiter_lines():` patterns in all three streaming paths. Python's async for does NOT auto-close the iterator on break/return, so the aiter_lines / aiter_bytes async generator remains alive, reachable only from the surrounding coroutine frame. Once `_stream()` returns the frame is GC'd and the orphaned asyncgen is finalized on a LATER GC pass in a DIFFERENT asyncio task, where httpcore's HTTP11ConnectionByteStream.aclose() enters anyio.CancelScope.__exit__ with a mismatched task and prints "Exception ignored in: <async generator>" / "async generator ignored GeneratorExit" / "Attempted to exit cancel scope in a different task" to the server log. User observed this on /v1/messages after successful (status 200) requests, with the traceback pointing at HTTP11ConnectionByteStream .__aiter__ / .aclose inside httpcore. Fix: save resp.aiter_lines() / resp.aiter_bytes() as a variable and explicitly `await iter.aclose()` in the finally block BEFORE resp.aclose() / client.aclose(). This closes the asyncgen inside the current task's event loop, so the internal httpcore byte stream is cleaned up before Python's asyncgen GC hook has anything orphaned to finalize. Each aclose is wrapped in try/except Exception so nested anyio cleanup noise can't bubble out. Applied to all three streaming passthrough paths: - _anthropic_passthrough_stream (/v1/messages client-side tool path) - _openai_passthrough_stream (/v1/chat/completions client-side tool path, new in this PR) - openai_completions (/v1/completions bytes proxy from PR #4956) * fix(studio): default ChatCompletionRequest.stream to false per OpenAI spec OpenAI's /v1/chat/completions spec defaults `stream` to false, so clients that omit the field (naive curl, minimal integrations) expect a single JSON response back. Studio was defaulting to true, silently switching those clients into SSE and breaking any parser that didn't also handle streaming. ResponsesRequest and AnthropicMessagesRequest already default to false correctly; only ChatCompletionRequest was wrong. Studio's own frontend always sets `stream` explicitly on every chat-adapter / chat-api / runtime-provider call site, so the flip has no UI impact. SDK users (OpenAI Python/JS SDK, opencode, Claude Code, Cursor, Continue) also always pass `stream` explicitly, so they're unaffected. The only clients feeling the change are raw-curl users who were relying on the wrong default -- those get the correct OpenAI behavior now. Added a regression test pinning the default so it can't silently flip back. * fix(studio): reject images in OpenAI tool passthrough for text-only GGUFs The new tool passthrough branch runs before _extract_content_parts, skipping the existing not is_vision guard. Requests combining tools with an image on a text-only tool-capable GGUF were forwarded to llama-server, producing opaque upstream errors instead of the pre-existing clear 400. Restore the guard inline at the dispatch point, checking both legacy image_base64 and inline image_url parts. * fix(studio): require tool_call_id on role=tool chat messages Enforce the OpenAI spec rule that role="tool" messages must carry a tool_call_id. Without it, upstream backends cannot associate a tool result with the assistant's prior tool_calls entry and the request fails in non-obvious ways through the passthrough path. Reject at the request boundary with a 422 instead. * fix(studio): harden OpenAI tool passthrough validation and error surfacing Three related fixes called out by the PR review: 1. Preserve upstream status codes in the streaming passthrough. The httpx request is now dispatched before the StreamingResponse is constructed. Non-200 upstream responses and httpx RequestError transport failures raise HTTPException with the real status instead of being buried inside a 200 SSE error frame, so OpenAI SDK clients see APIError/BadRequestError/... as expected. 2. Require non-empty content on user/system/tool messages. Per the OpenAI spec, content may only be omitted on assistant messages that carry tool_calls; enforce that at the request boundary so malformed messages never reach the passthrough path. 3. Role-constrain tool-call metadata. tool_calls is only valid on role=assistant, tool_call_id and name only on role=tool. Without this, a user/system message with tool_calls would flip the passthrough branch on and be forwarded to llama-server, surfacing as an opaque upstream error. * fix(studio): normalize image mode and passthrough JSON verbatim Two Gemini-code-assist review findings on PR #5099: 1. Unconditionally convert decoded images to RGB before PNG encoding. The prior code only handled RGBA, letting CMYK/I/F images crash at img.save(format="PNG") and surface as opaque 400s. Applied to both the passthrough helper and the non-passthrough GGUF path that originally carried this pattern, keeping the two sites in sync. 2. Return the upstream JSON body as raw bytes via Response rather than parse-then-re-serialize with JSONResponse. Matches the passthrough helper's "verbatim" contract and drops a redundant round-trip. --------- Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>		2026-04-18 12:53:23 +04:00
..
backend	Studio: forward standard OpenAI tools / tool_choice to llama-server (#5099 )	2026-04-18 12:53:23 +04:00
frontend	Add Qwen3.6 inference defaults for Studio (#5065 )	2026-04-16 11:42:42 -07:00
__init__.py	Final cleanup	2026-03-12 18:28:04 +00:00
install_llama_prebuilt.py	fix(rocm): tighten gfx regex to ignore generic ISA lines (#5033 )	2026-04-15 05:24:41 -07:00
install_python_stack.py	fix(rocm): tighten gfx regex to ignore generic ISA lines (#5033 )	2026-04-15 05:24:41 -07:00
LICENSE.AGPL-3.0	Add AGPL-3.0 license to studio folder	2026-03-09 19:36:25 +00:00
setup.bat	Final cleanup	2026-03-12 18:28:04 +00:00
setup.ps1	Trim verbose comments in PATH helpers	2026-04-16 12:01:01 +00:00
setup.sh	Add AMD ROCm/HIP support across installer and hardware detection (#4720 )	2026-04-10 01:56:12 -07:00
Unsloth_Studio_Colab.ipynb	Allow install_python_stack to run on Colab (#4633 )	2026-03-27 00:29:27 +04:00