Studio: forward standard OpenAI tools / tool_choice to llama-server (#5099)

* fix(studio): forward OpenAI tools/tool_choice to llama-server (#4999) Studio's /v1/chat/completions silently stripped standard OpenAI `tools` and `tool_choice` fields, so clients using standard function calling (opencode, Claude Code, Cursor, Continue, ...) never got structured tool_calls back. Adds a client-side pass-through path mirroring the existing Anthropic /v1/messages flow: when `tools` is present without Studio's `enable_tools` shorthand, the request is forwarded to llama-server verbatim so the client sees native id, finish_reason ("tool_calls"), delta.tool_calls, and accurate usage tokens. Also wires Anthropic tool_choice forwarding: /v1/messages previously accepted tool_choice on the request model but silently dropped it with a warning. Translate the four Anthropic shapes to OpenAI format and forward them so agentic clients can actually enforce tool use. - ChatCompletionRequest: add tools, tool_choice, stop; extra="allow" - ChatMessage: accept role="tool", optional tool_call_id / tool_calls / name; content is now optional (assistant with only tool_calls) - routes/inference.py: _openai_passthrough_stream / _openai_passthrough_non_streaming helpers, routing branch in openai_chat_completions, vision+tools via content-parts injection - _build_passthrough_payload: tool_choice parameter (default "auto") - anthropic_compat: anthropic_tool_choice_to_openai() translator - tests/test_openai_tool_passthrough.py: Pydantic + translator unit tests - tests/test_studio_api.py: 5 new E2E tests (non-stream, stream, multi-turn, OpenAI SDK, Anthropic tool_choice=any regression) * fix(studio): surface httpx transport errors from OpenAI passthrough When the managed llama-server subprocess crashes mid-request, the async pass-through helpers in routes/inference.py used to return a bare 500 (non-streaming) or an "An internal error occurred" SSE chunk (streaming) because _friendly_error only recognized the sync path's "Lost connection to llama-server" substring -- httpx transport failures (ConnectError / ReadError / RemoteProtocolError / ReadTimeout) stringify differently and fell through to the generic case. - _friendly_error: map any httpx.RequestError subclass to the same "Lost connection to the model server" message the sync chat path emits. Placed before the substring heuristics so the streaming path automatically picks it up via its existing except Exception catch. - _openai_passthrough_non_streaming: wrap the httpx.AsyncClient.post in a try/except httpx.RequestError and re-raise as HTTPException 502 with the friendly detail. - tests/test_openai_tool_passthrough.py: new TestFriendlyErrorHttpx class pinning the mapping for ConnectError, ReadError, RemoteProtocolError, ReadTimeout, and confirming non-httpx paths (context-size heuristic, generic fallback) are unchanged. * fix(studio): close aiter_bytes/aiter_lines explicitly in passthroughs The httpcore asyncgen cleanup fix in 5cedd9a5 is incomplete on Python 3.13 + httpcore 1.0.x: it switched to manual client/response lifecycle but still used anonymous `async for raw_line in resp.aiter_lines():` patterns in all three streaming paths. Python's async for does NOT auto-close the iterator on break/return, so the aiter_lines / aiter_bytes async generator remains alive, reachable only from the surrounding coroutine frame. Once `_stream()` returns the frame is GC'd and the orphaned asyncgen is finalized on a LATER GC pass in a DIFFERENT asyncio task, where httpcore's HTTP11ConnectionByteStream.aclose() enters anyio.CancelScope.__exit__ with a mismatched task and prints "Exception ignored in: <async generator>" / "async generator ignored GeneratorExit" / "Attempted to exit cancel scope in a different task" to the server log. User observed this on /v1/messages after successful (status 200) requests, with the traceback pointing at HTTP11ConnectionByteStream .__aiter__ / .aclose inside httpcore. Fix: save resp.aiter_lines() / resp.aiter_bytes() as a variable and explicitly `await iter.aclose()` in the finally block BEFORE resp.aclose() / client.aclose(). This closes the asyncgen inside the current task's event loop, so the internal httpcore byte stream is cleaned up before Python's asyncgen GC hook has anything orphaned to finalize. Each aclose is wrapped in try/except Exception so nested anyio cleanup noise can't bubble out. Applied to all three streaming passthrough paths: - _anthropic_passthrough_stream (/v1/messages client-side tool path) - _openai_passthrough_stream (/v1/chat/completions client-side tool path, new in this PR) - openai_completions (/v1/completions bytes proxy from PR #4956) * fix(studio): default ChatCompletionRequest.stream to false per OpenAI spec OpenAI's /v1/chat/completions spec defaults `stream` to false, so clients that omit the field (naive curl, minimal integrations) expect a single JSON response back. Studio was defaulting to true, silently switching those clients into SSE and breaking any parser that didn't also handle streaming. ResponsesRequest and AnthropicMessagesRequest already default to false correctly; only ChatCompletionRequest was wrong. Studio's own frontend always sets `stream` explicitly on every chat-adapter / chat-api / runtime-provider call site, so the flip has no UI impact. SDK users (OpenAI Python/JS SDK, opencode, Claude Code, Cursor, Continue) also always pass `stream` explicitly, so they're unaffected. The only clients feeling the change are raw-curl users who were relying on the wrong default -- those get the correct OpenAI behavior now. Added a regression test pinning the default so it can't silently flip back. * fix(studio): reject images in OpenAI tool passthrough for text-only GGUFs The new tool passthrough branch runs before _extract_content_parts, skipping the existing not is_vision guard. Requests combining tools with an image on a text-only tool-capable GGUF were forwarded to llama-server, producing opaque upstream errors instead of the pre-existing clear 400. Restore the guard inline at the dispatch point, checking both legacy image_base64 and inline image_url parts. * fix(studio): require tool_call_id on role=tool chat messages Enforce the OpenAI spec rule that role="tool" messages must carry a tool_call_id. Without it, upstream backends cannot associate a tool result with the assistant's prior tool_calls entry and the request fails in non-obvious ways through the passthrough path. Reject at the request boundary with a 422 instead. * fix(studio): harden OpenAI tool passthrough validation and error surfacing Three related fixes called out by the PR review: 1. Preserve upstream status codes in the streaming passthrough. The httpx request is now dispatched before the StreamingResponse is constructed. Non-200 upstream responses and httpx RequestError transport failures raise HTTPException with the real status instead of being buried inside a 200 SSE error frame, so OpenAI SDK clients see APIError/BadRequestError/... as expected. 2. Require non-empty content on user/system/tool messages. Per the OpenAI spec, content may only be omitted on assistant messages that carry tool_calls; enforce that at the request boundary so malformed messages never reach the passthrough path. 3. Role-constrain tool-call metadata. tool_calls is only valid on role=assistant, tool_call_id and name only on role=tool. Without this, a user/system message with tool_calls would flip the passthrough branch on and be forwarded to llama-server, surfacing as an opaque upstream error. * fix(studio): normalize image mode and passthrough JSON verbatim Two Gemini-code-assist review findings on PR #5099: 1. Unconditionally convert decoded images to RGB before PNG encoding. The prior code only handled RGBA, letting CMYK/I/F images crash at img.save(format="PNG") and surface as opaque 400s. Applied to both the passthrough helper and the non-passthrough GGUF path that originally carried this pattern, keeping the two sites in sync. 2. Return the upstream JSON body as raw bytes via Response rather than parse-then-re-serialize with JSONResponse. Matches the passthrough helper's "verbatim" contract and drops a redundant round-trip. --------- Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-04-21 13:37:39 +00:00 · 2026-04-18 12:53:23 +04:00 · 2026-04-18 12:53:23 +04:00 · ac2daf8b7a
commit ac2daf8b7a
parent 7d0d2f256c
5 changed files with 1327 additions and 72 deletions
--- a/studio/backend/core/inference/anthropic_compat.py
+++ b/studio/backend/core/inference/anthropic_compat.py
@ -114,6 +114,39 @@ def anthropic_tools_to_openai(tools: list) -> list[dict]:
    return result


+def anthropic_tool_choice_to_openai(tc: Any) -> Any:
+    """Translate Anthropic `tool_choice` into OpenAI `tool_choice`.
+
+    Anthropic formats (all dict shapes with a ``type`` discriminator):
+
+    - ``{"type": "auto"}``                       → ``"auto"``
+    - ``{"type": "any"}``                        → ``"required"``
+    - ``{"type": "none"}``                       → ``"none"``
+    - ``{"type": "tool", "name": "get_weather"}``
+          → ``{"type": "function", "function": {"name": "get_weather"}}``
+
+    Returns ``None`` for ``None`` or any unrecognized shape (caller may
+    then fall back to its own default, typically ``"auto"``).
+    """
+    if tc is None:
+        return None
+    if not isinstance(tc, dict):
+        return None
+    t = tc.get("type")
+    if t == "auto":
+        return "auto"
+    if t == "any":
+        return "required"
+    if t == "none":
+        return "none"
+    if t == "tool":
+        name = tc.get("name")
+        if not name:
+            return None
+        return {"type": "function", "function": {"name": name}}
+    return None
+
+
 def build_anthropic_sse_event(event_type: str, data: dict) -> str:
    """Format a single Anthropic SSE event."""
    return f"event: {event_type}\ndata: {json.dumps(data)}\n\n"
--- a/studio/backend/models/inference.py
+++ b/studio/backend/models/inference.py
@ -11,7 +11,7 @@ import time
 import uuid
 from typing import Annotated, Any, Dict, Literal, Optional, List, Union

-from pydantic import BaseModel, Discriminator, Field, Tag
+from pydantic import BaseModel, Discriminator, Field, Tag, model_validator


 class LoadRequest(BaseModel):
@ -338,14 +338,68 @@ class ChatMessage(BaseModel):

    ``content`` may be a plain string (text-only) or a list of
    content parts for multimodal messages (OpenAI vision format).
+    Assistant messages that only contain tool calls may set ``content``
+    to ``None`` with ``tool_calls`` populated. ``role="tool"`` messages
+    carry the result of a client-executed tool call and require
+    ``tool_call_id`` per the OpenAI spec.
    """

-    role: Literal["system", "user", "assistant"] = Field(
+    role: Literal["system", "user", "assistant", "tool"] = Field(
        ..., description = "Message role"
    )
-    content: Union[str, list[ContentPart]] = Field(
-        ..., description = "Message content (string or multimodal parts)"
+    content: Optional[Union[str, list[ContentPart]]] = Field(
+        None, description = "Message content (string or multimodal parts)"
    )
+    tool_call_id: Optional[str] = Field(
+        None,
+        description = "OpenAI tool-result messages: id of the tool call this result belongs to.",
+    )
+    tool_calls: Optional[list[dict]] = Field(
+        None,
+        description = "OpenAI assistant messages: structured tool calls the model decided to make.",
+    )
+    name: Optional[str] = Field(
+        None,
+        description = "OpenAI tool-result messages: name of the tool whose result this is.",
+    )
+
+    @model_validator(mode = "after")
+    def _validate_role_shape(self) -> "ChatMessage":
+        # Enforce the per-role OpenAI spec shape at the request boundary.
+        # Without this, malformed messages (e.g. user entries with no
+        # content, tool_calls on a user/system role, role="tool" without
+        # tool_call_id) would be silently forwarded to llama-server via
+        # the passthrough path, surfacing as opaque upstream errors or
+        # broken tool-call reconciliation downstream.
+
+        # Tool-call metadata must appear only on the appropriate role.
+        if self.tool_calls is not None and self.role != "assistant":
+            raise ValueError('"tool_calls" is only valid on role="assistant" messages.')
+        if self.tool_call_id is not None and self.role != "tool":
+            raise ValueError('"tool_call_id" is only valid on role="tool" messages.')
+        if self.name is not None and self.role != "tool":
+            raise ValueError('"name" is only valid on role="tool" messages.')
+
+        # Per-role content requirements.
+        if self.role == "tool":
+            if not self.tool_call_id:
+                raise ValueError(
+                    'role="tool" messages require "tool_call_id" per the OpenAI spec.'
+                )
+            if not self.content:
+                raise ValueError('role="tool" messages require non-empty "content".')
+        elif self.role == "assistant":
+            # Assistant messages may omit content when tool_calls is set.
+            if not self.content and not self.tool_calls:
+                raise ValueError(
+                    'role="assistant" messages require either "content" or "tool_calls".'
+                )
+        else:  # "user" | "system"
+            if not self.content:
+                raise ValueError(
+                    f'role="{self.role}" messages require non-empty "content".'
+                )
+        return self


 class ChatCompletionRequest(BaseModel):
@ -355,18 +409,49 @@ class ChatCompletionRequest(BaseModel):
    Extensions (non-OpenAI fields) are marked with 'x-unsloth'.
    """

+    # Accept unknown fields defensively so future OpenAI fields (seed,
+    # response_format, logprobs, frequency_penalty, etc.) don't get
+    # silently dropped by Pydantic before route code runs. Mirrors
+    # AnthropicMessagesRequest and ResponsesRequest.
+    model_config = {"extra": "allow"}
+
    model: str = Field(
        "default",
        description = "Model identifier (informational; the active model is used)",
    )
    messages: list[ChatMessage] = Field(..., description = "Conversation messages")
-    stream: bool = Field(True, description = "Whether to stream the response via SSE")
+    stream: bool = Field(
+        False,
+        description = (
+            "Whether to stream the response via SSE. Default matches OpenAI's "
+            "spec (`false`); opt into streaming by sending `stream: true`."
+        ),
+    )
    temperature: float = Field(0.6, ge = 0.0, le = 2.0)
    top_p: float = Field(0.95, ge = 0.0, le = 1.0)
    max_tokens: Optional[int] = Field(
        None, ge = 1, description = "Maximum tokens to generate (None = until EOS)"
    )
    presence_penalty: float = Field(0.0, ge = 0.0, le = 2.0, description = "Presence penalty")
+    stop: Optional[Union[str, list[str]]] = Field(
+        None,
+        description = "OpenAI stop sequences: a single string or list of strings at which generation halts.",
+    )
+    tools: Optional[list[dict]] = Field(
+        None,
+        description = (
+            "OpenAI function-tool definitions. When provided without `enable_tools=true`, "
+            "Studio forwards the tools to the backend so the model returns structured "
+            "tool_calls for the client to execute (standard OpenAI function calling)."
+        ),
+    )
+    tool_choice: Optional[Union[str, dict]] = Field(
+        None,
+        description = (
+            "OpenAI tool choice: 'auto' | 'required' | 'none' | "
+            "{'type': 'function', 'function': {'name': ...}}"
+        ),
+    )

    # ── Unsloth extensions (ignored by standard OpenAI clients) ──
    top_k: int = Field(20, ge = -1, le = 100, description = "[x-unsloth] Top-k sampling")
--- a/studio/backend/routes/inference.py
+++ b/studio/backend/routes/inference.py
@ -29,6 +29,14 @@ from utils.models import extract_model_size_b as _extract_model_size_b

 def _friendly_error(exc: Exception) -> str:
    """Extract a user-friendly message from known llama-server errors."""
+    # httpx transport-layer failures reaching the managed llama-server —
+    # raised by the async pass-through helpers that talk to llama-server
+    # directly. Treat any RequestError subclass (ConnectError, ReadError,
+    # RemoteProtocolError, WriteError, PoolTimeout, ...) as "the upstream
+    # subprocess is unreachable", which for Studio always means the
+    # llama-server subprocess crashed or is still coming up.
+    if isinstance(exc, httpx.RequestError):
+        return "Lost connection to the model server. It may have crashed -- try reloading the model."
    msg = str(exc)
    m = _re.search(
        r"request \((\d+) tokens?\) exceeds the available context size \((\d+) tokens?\)",
@ -106,6 +114,7 @@ from models.inference import (
 from core.inference.anthropic_compat import (
    anthropic_messages_to_openai,
    anthropic_tools_to_openai,
+    anthropic_tool_choice_to_openai,
    AnthropicStreamEmitter,
    AnthropicPassthroughEmitter,
 )
@ -1122,6 +1131,56 @@ async def openai_chat_completions(
                )
                return JSONResponse(content = response.model_dump())

+    # ── Standard OpenAI function-calling pass-through (GGUF only) ────
+    # When a client (opencode / Claude Code via OpenAI compat / Cursor /
+    # Continue / ...) sends standard OpenAI `tools` without Studio's
+    # `enable_tools` shorthand, forward the request to llama-server
+    # verbatim so structured `tool_calls` flow back to the client. This
+    # branch runs BEFORE `_extract_content_parts` because that helper is
+    # unaware of `role="tool"` messages and assistant messages that only
+    # carry `tool_calls` (content=None) — both of which are valid in
+    # multi-turn client-side tool loops.
+    _has_tool_messages = any(m.role == "tool" or m.tool_calls for m in payload.messages)
+    if (
+        using_gguf
+        and llama_backend.supports_tools
+        and not payload.enable_tools
+        and ((payload.tools and len(payload.tools) > 0) or _has_tool_messages)
+    ):
+        # Preserve the vision guard that would otherwise run in the
+        # non-passthrough path below: text-only tool-capable GGUFs
+        # should return a clear 400 here rather than forwarding the
+        # image to llama-server and surfacing an opaque upstream error.
+        if not llama_backend.is_vision and (
+            payload.image_base64
+            or any(
+                isinstance(m.content, list)
+                and any(isinstance(p, ImageContentPart) for p in m.content)
+                for m in payload.messages
+            )
+        ):
+            raise HTTPException(
+                status_code = 400,
+                detail = "Image provided but current GGUF model does not support vision.",
+            )
+
+        cancel_event = threading.Event()
+        completion_id = f"chatcmpl-{uuid.uuid4().hex[:12]}"
+        if payload.stream:
+            return await _openai_passthrough_stream(
+                request,
+                cancel_event,
+                llama_backend,
+                payload,
+                model_name,
+                completion_id,
+            )
+        return await _openai_passthrough_non_streaming(
+            llama_backend,
+            payload,
+            model_name,
+        )
+
    # ── Parse messages (handles multimodal content parts) ─────
    system_prompt, chat_messages, extracted_image_b64 = _extract_content_parts(
        payload.messages
@ -1151,9 +1210,11 @@ async def openai_chat_completions(
                from PIL import Image as _Image

                raw = _b64.b64decode(image_b64)
-                img = _Image.open(_BytesIO(raw))
-                if img.mode == "RGBA":
-                    img = img.convert("RGB")
+                # Normalize to RGB so PNG encoding succeeds regardless of
+                # source mode (RGBA, P, L, CMYK, I, F, ...). Previously
+                # we only converted RGBA, which left CMYK/I/F to raise at
+                # img.save(PNG).
+                img = _Image.open(_BytesIO(raw)).convert("RGB")
                buf = _BytesIO()
                img.save(buf, format = "PNG")
                image_b64 = _b64.b64encode(buf.getvalue()).decode("ascii")
@ -1933,25 +1994,32 @@ async def openai_completions(
    if is_stream:

        async def _stream():
-            # Manual httpx client/response lifecycle — see
-            # _anthropic_passthrough_stream for the full rationale. Briefly:
-            # `async with` inside an async generator causes
-            # "Attempted to exit cancel scope in a different task" /
-            # "async generator ignored GeneratorExit" on Python 3.13 +
-            # httpcore 1.0.x when the generator is orphaned and finalized
-            # by GC. Closing via a finally block that catches Exception
-            # (but not BaseException) suppresses the anyio cleanup noise
-            # while letting GeneratorExit propagate cleanly.
+            # Manual httpx client/response lifecycle AND explicit
+            # aiter_bytes() iterator close — see _anthropic_passthrough_stream
+            # for the full rationale. Saving `bytes_iter = resp.aiter_bytes()`
+            # and `await bytes_iter.aclose()` in the finally block is the
+            # part that matters for avoiding the Python 3.13 + httpcore
+            # 1.0.x "Exception ignored in: <async_generator>" / anyio
+            # cancel-scope trace: an anonymous async for leaves the
+            # iterator unclosed, so Python's asyncgen GC finalizer runs
+            # cleanup on a later pass in a different asyncio task.
            client = httpx.AsyncClient(timeout = 600)
            resp = None
+            bytes_iter = None
            try:
                req = client.build_request("POST", target_url, json = body)
                resp = await client.send(req, stream = True)
-                async for chunk in resp.aiter_bytes():
+                bytes_iter = resp.aiter_bytes()
+                async for chunk in bytes_iter:
                    yield chunk
            except Exception as e:
                logger.error("openai_completions stream error: %s", e)
            finally:
+                if bytes_iter is not None:
+                    try:
+                        await bytes_iter.aclose()
+                    except Exception:
+                        pass
                if resp is not None:
                    try:
                        await resp.aclose()
@ -2339,22 +2407,12 @@ async def anthropic_messages(
    )
    stop = payload.stop_sequences or None

-    # tool_choice is declared on AnthropicMessagesRequest for Anthropic SDK
-    # compatibility (the SDK often sets it by default), but it is not
-    # currently honored by Unsloth's backend. Warn once per request so the
-    # silent drop is visible to operators instead of looking like a model
-    # quality issue to clients.
-    if payload.tool_choice is not None:
-        logger.warning(
-            "anthropic_messages.tool_choice_ignored",
-            tool_choice = payload.tool_choice,
-            note = (
-                "tool_choice is accepted for Anthropic SDK compatibility but not "
-                "honored by Unsloth. Use enable_tools / enabled_tools (server-side "
-                "built-in tools) or restrict the `tools` array (client-side) to "
-                "control which tools the model sees."
-            ),
-        )
+    # Translate Anthropic tool_choice to OpenAI format for forwarding to
+    # llama-server. Falls back to "auto" when unset or unrecognized, which
+    # matches the prior hardcoded behavior.
+    openai_tool_choice = anthropic_tool_choice_to_openai(payload.tool_choice)
+    if openai_tool_choice is None:
+        openai_tool_choice = "auto"

    cancel_event = threading.Event()

@ -2392,6 +2450,7 @@ async def anthropic_messages(
                min_p = min_p,
                repetition_penalty = repetition_penalty,
                presence_penalty = presence_penalty,
+                tool_choice = openai_tool_choice,
            )
        return await _anthropic_passthrough_non_streaming(
            llama_backend,
@ -2407,6 +2466,7 @@ async def anthropic_messages(
            min_p = min_p,
            repetition_penalty = repetition_penalty,
            presence_penalty = presence_penalty,
+            tool_choice = openai_tool_choice,
        )

    if server_tools:
@ -2750,11 +2810,12 @@ def _build_passthrough_payload(
    min_p = None,
    repetition_penalty = None,
    presence_penalty = None,
+    tool_choice = "auto",
 ):
    body = {
        "messages": openai_messages,
        "tools": openai_tools,
-        "tool_choice": "auto",
+        "tool_choice": tool_choice,
        "temperature": temperature,
        "top_p": top_p,
        "top_k": top_k,
@ -2792,6 +2853,7 @@ async def _anthropic_passthrough_stream(
    min_p = None,
    repetition_penalty = None,
    presence_penalty = None,
+    tool_choice = "auto",
 ):
    """Streaming client-side pass-through: forward tools to llama-server and
    translate its streaming response to Anthropic SSE without executing anything."""
@ -2808,6 +2870,7 @@ async def _anthropic_passthrough_stream(
        min_p = min_p,
        repetition_penalty = repetition_penalty,
        presence_penalty = presence_penalty,
+        tool_choice = tool_choice,
    )

    async def _stream():
@ -2815,33 +2878,42 @@ async def _anthropic_passthrough_stream(
        for line in emitter.start(message_id, model_name):
            yield line

-        # Manage the httpx client and response MANUALLY — no `async with`.
+        # Manage the httpx client, response, AND the aiter_lines() async
+        # generator MANUALLY — no `async with`, no anonymous iterator.
        #
-        # On Python 3.13 + httpcore 1.0.x, an orphaned async generator (e.g.
-        # when the client disconnects mid-stream and Starlette drops the
-        # StreamingResponse iterator without explicitly calling aclose())
-        # is finalized by Python's asyncgen GC hook in a DIFFERENT asyncio
-        # task than the one that originally entered the httpx context
-        # managers. When `async with` exits run in the wrong task, httpcore's
-        # internal `HTTP11ConnectionByteStream.aclose()` hits
-        # `anyio.CancelScope.__exit__` with a mismatched task and raises
-        # RuntimeError("Attempted to exit cancel scope in a different task"),
-        # which escapes as "Exception ignored in:" because it happens during
-        # GC finalization outside any user-owned try/except.
+        # On Python 3.13 + httpcore 1.0.x, `async for raw_line in
+        # resp.aiter_lines():` creates an anonymous async generator. When
+        # the loop exits via `break` (or the generator is orphaned when a
+        # client disconnects mid-stream), Python's `async for` protocol
+        # does NOT auto-close the iterator the way a sync `for` loop
+        # would. The iterator remains reachable only from the current
+        # coroutine frame; once `_stream()` returns, the frame is GC'd
+        # and the iterator becomes unreachable. Python's asyncgen
+        # finalizer hook then runs its aclose() on a LATER GC pass in a
+        # DIFFERENT asyncio task, where httpcore's
+        # `HTTP11ConnectionByteStream.aclose()` enters
+        # `anyio.CancelScope.__exit__` with a mismatched task and prints
+        # `RuntimeError: Attempted to exit cancel scope in a different
+        # task` / `RuntimeError: async generator ignored GeneratorExit`
+        # as "Exception ignored in:" unraisable warnings.
        #
-        # The fix: do not use `async with` for the client/response. Close
-        # them in a finally block wrapped in `try: ... except Exception: pass`.
-        # This narrowly suppresses RuntimeError / other Exception subclasses
-        # from the anyio cleanup noise while letting GeneratorExit (a
-        # BaseException, not Exception) propagate through cleanly so the
-        # generator terminates as Python expects.
+        # The fix: save `resp.aiter_lines()` as `lines_iter`, and in the
+        # finally block explicitly `await lines_iter.aclose()` BEFORE
+        # `resp.aclose()` / `client.aclose()`. This closes the iterator
+        # inside our own task's event loop, so the internal httpcore
+        # byte-stream is cleaned up before Python's asyncgen finalizer
+        # has anything orphaned to finalize. Each aclose is wrapped in
+        # `try: ... except Exception: pass` so anyio cleanup noise from
+        # nested aclose paths can't bubble out.
        client = httpx.AsyncClient(timeout = 600)
        resp = None
+        lines_iter = None
        try:
            req = client.build_request("POST", target_url, json = body)
            resp = await client.send(req, stream = True)

-            async for raw_line in resp.aiter_lines():
+            lines_iter = resp.aiter_lines()
+            async for raw_line in lines_iter:
                if await request.is_disconnected():
                    cancel_event.set()
                    break
@ -2859,6 +2931,11 @@ async def _anthropic_passthrough_stream(
        except Exception as e:
            logger.error("anthropic_messages passthrough stream error: %s", e)
        finally:
+            if lines_iter is not None:
+                try:
+                    await lines_iter.aclose()
+                except Exception:
+                    pass
            if resp is not None:
                try:
                    await resp.aclose()
@ -2897,6 +2974,7 @@ async def _anthropic_passthrough_non_streaming(
    min_p = None,
    repetition_penalty = None,
    presence_penalty = None,
+    tool_choice = "auto",
 ):
    """Non-streaming client-side pass-through."""
    target_url = f"{llama_backend.base_url}/v1/chat/completions"
@ -2912,6 +2990,7 @@ async def _anthropic_passthrough_non_streaming(
        min_p = min_p,
        repetition_penalty = repetition_penalty,
        presence_penalty = presence_penalty,
+        tool_choice = tool_choice,
    )

    async with httpx.AsyncClient() as client:
@ -2969,3 +3048,265 @@ async def _anthropic_passthrough_non_streaming(
        ),
    )
    return JSONResponse(content = resp_obj.model_dump())
+
+
+# =====================================================================
+# Client-side tool pass-through (OpenAI-native /v1/chat/completions)
+# =====================================================================
+
+
+def _openai_messages_for_passthrough(payload) -> list[dict]:
+    """Build OpenAI-format message dicts for the /v1/chat/completions
+    passthrough path.
+
+    Messages from ``payload.messages`` are dumped through Pydantic (dropping
+    unset optional fields) so they are already in standard OpenAI format
+    — including ``role="tool"`` tool-result messages and assistant messages
+    that carry structured ``tool_calls``. Content-parts images already in
+    the message list are left untouched.
+
+    When a client uses Studio's legacy ``image_base64`` top-level field, the
+    image is re-encoded to PNG (llama-server's stb_image has limited format
+    support) and spliced into the last user message as an OpenAI
+    ``image_url`` content part so vision + function-calling requests work
+    transparently.
+    """
+    messages = [m.model_dump(exclude_none = True) for m in payload.messages]
+
+    if not payload.image_base64:
+        return messages
+
+    try:
+        import base64 as _b64
+        from io import BytesIO as _BytesIO
+        from PIL import Image as _Image
+
+        raw = _b64.b64decode(payload.image_base64)
+        img = _Image.open(_BytesIO(raw)).convert("RGB")
+        buf = _BytesIO()
+        img.save(buf, format = "PNG")
+        png_b64 = _b64.b64encode(buf.getvalue()).decode("ascii")
+    except Exception as e:
+        raise HTTPException(
+            status_code = 400,
+            detail = f"Failed to process image: {e}",
+        )
+
+    data_url = f"data:image/png;base64,{png_b64}"
+    image_part = {"type": "image_url", "image_url": {"url": data_url}}
+
+    for msg in reversed(messages):
+        if msg.get("role") != "user":
+            continue
+        existing = msg.get("content")
+        if isinstance(existing, str):
+            msg["content"] = [{"type": "text", "text": existing}, image_part]
+        elif isinstance(existing, list):
+            existing.append(image_part)
+        else:
+            msg["content"] = [image_part]
+        break
+    else:
+        messages.append({"role": "user", "content": [image_part]})
+
+    return messages
+
+
+def _build_openai_passthrough_body(payload) -> dict:
+    """Assemble the llama-server request body from a ChatCompletionRequest.
+
+    Only explicitly-known OpenAI / llama-server fields are forwarded so that
+    Studio-specific extensions (``enable_tools``, ``enabled_tools``,
+    ``session_id``, ...) never leak to the backend.
+    """
+    messages = _openai_messages_for_passthrough(payload)
+    tool_choice = payload.tool_choice if payload.tool_choice is not None else "auto"
+    return _build_passthrough_payload(
+        messages,
+        payload.tools,
+        payload.temperature,
+        payload.top_p,
+        payload.top_k,
+        payload.max_tokens,
+        payload.stream,
+        stop = payload.stop,
+        min_p = payload.min_p,
+        repetition_penalty = payload.repetition_penalty,
+        presence_penalty = payload.presence_penalty,
+        tool_choice = tool_choice,
+    )
+
+
+async def _openai_passthrough_stream(
+    request,
+    cancel_event,
+    llama_backend,
+    payload,
+    model_name,
+    completion_id,
+):
+    """Streaming client-side pass-through for /v1/chat/completions.
+
+    Forwards the client's OpenAI function-calling request to llama-server and
+    relays the SSE stream back verbatim. This preserves llama-server's
+    native response ``id``, ``finish_reason`` (including ``"tool_calls"``),
+    ``delta.tool_calls``, and the trailing ``usage`` chunk so the client
+    observes a standard OpenAI response.
+    """
+    target_url = f"{llama_backend.base_url}/v1/chat/completions"
+    body = _build_openai_passthrough_body(payload)
+
+    # Dispatch the upstream request BEFORE returning StreamingResponse so
+    # transport errors and non-200 upstream statuses surface as real HTTP
+    # errors to the client. OpenAI SDKs rely on status codes to raise
+    # ``APIError``/``BadRequestError``/...; burying the failure inside a
+    # 200 SSE ``error`` frame silently breaks their error handling.
+    client = httpx.AsyncClient(timeout = 600)
+    resp = None
+    try:
+        req = client.build_request("POST", target_url, json = body)
+        resp = await client.send(req, stream = True)
+    except httpx.RequestError as e:
+        # llama-server subprocess crashed / still starting / unreachable.
+        logger.error("openai passthrough stream: upstream unreachable: %s", e)
+        if resp is not None:
+            try:
+                await resp.aclose()
+            except Exception:
+                pass
+        try:
+            await client.aclose()
+        except Exception:
+            pass
+        raise HTTPException(
+            status_code = 502,
+            detail = _friendly_error(e),
+        )
+
+    if resp.status_code != 200:
+        err_bytes = await resp.aread()
+        err_text = err_bytes.decode("utf-8", errors = "replace")
+        logger.error(
+            "openai passthrough upstream error: status=%s body=%s",
+            resp.status_code,
+            err_text[:500],
+        )
+        upstream_status = resp.status_code
+        try:
+            await resp.aclose()
+        except Exception:
+            pass
+        try:
+            await client.aclose()
+        except Exception:
+            pass
+        raise HTTPException(
+            status_code = upstream_status,
+            detail = f"llama-server error: {err_text[:500]}",
+        )
+
+    async def _stream():
+        # Same httpx lifecycle pattern as _anthropic_passthrough_stream:
+        # avoid `async with` on the client/response AND explicitly save
+        # resp.aiter_lines() so we can close it ourselves in the finally
+        # block. See the long comment there for the full rationale on
+        # why the anonymous `async for raw_line in resp.aiter_lines():`
+        # pattern leaks an unclosed async generator that Python's
+        # asyncgen GC hook then finalizes in a different asyncio task,
+        # producing "Exception ignored in:" / "async generator ignored
+        # GeneratorExit" / anyio cancel-scope traces on Python 3.13 +
+        # httpcore 1.0.x.
+        lines_iter = None
+        try:
+            lines_iter = resp.aiter_lines()
+            async for raw_line in lines_iter:
+                if await request.is_disconnected():
+                    cancel_event.set()
+                    break
+                if not raw_line:
+                    continue
+                if not raw_line.startswith("data: "):
+                    continue
+                # Relay the llama-server SSE chunk verbatim so the client
+                # sees its native `id`, `finish_reason`, `delta.tool_calls`,
+                # and final `usage` unchanged.
+                yield raw_line + "\n\n"
+                if raw_line[6:].strip() == "[DONE]":
+                    break
+        except Exception as e:
+            # Mid-stream failures still have to be reported inside the SSE
+            # body because the 200 response headers have already been
+            # committed by the time the first chunk flushes.
+            logger.error("openai passthrough stream error: %s", e)
+            err = {
+                "error": {
+                    "message": _friendly_error(e),
+                    "type": "server_error",
+                },
+            }
+            yield f"data: {json.dumps(err)}\n\n"
+        finally:
+            if lines_iter is not None:
+                try:
+                    await lines_iter.aclose()
+                except Exception:
+                    pass
+            try:
+                await resp.aclose()
+            except Exception:
+                pass
+            try:
+                await client.aclose()
+            except Exception:
+                pass
+
+    return StreamingResponse(
+        _stream(),
+        media_type = "text/event-stream",
+        headers = {
+            "Cache-Control": "no-cache",
+            "Connection": "keep-alive",
+            "X-Accel-Buffering": "no",
+        },
+    )
+
+
+async def _openai_passthrough_non_streaming(
+    llama_backend,
+    payload,
+    model_name,
+):
+    """Non-streaming client-side pass-through for /v1/chat/completions.
+
+    Returns llama-server's JSON response verbatim (via JSONResponse) so the
+    client sees the native response ``id``, ``finish_reason`` (including
+    ``"tool_calls"``), structured ``tool_calls``, and accurate ``usage``
+    token counts.
+    """
+    target_url = f"{llama_backend.base_url}/v1/chat/completions"
+    body = _build_openai_passthrough_body(payload)
+
+    try:
+        async with httpx.AsyncClient() as client:
+            resp = await client.post(target_url, json = body, timeout = 600)
+    except httpx.RequestError as e:
+        # llama-server subprocess crashed / still starting / unreachable.
+        # Surface the same friendly message the sync chat path emits so
+        # operators don't see a bare 500 with no diagnostic.
+        logger.error("openai passthrough non-streaming: upstream unreachable: %s", e)
+        raise HTTPException(
+            status_code = 502,
+            detail = _friendly_error(e),
+        )
+
+    if resp.status_code != 200:
+        raise HTTPException(
+            status_code = resp.status_code,
+            detail = f"llama-server error: {resp.text[:500]}",
+        )
+
+    # Pass the upstream body through as raw bytes — skips a redundant
+    # parse+re-serialize round-trip and keeps the response truly
+    # verbatim (matches the docstring). Status is guaranteed 200 by
+    # the check above.
+    return Response(content = resp.content, media_type = "application/json")
--- a/studio/backend/tests/test_openai_tool_passthrough.py
+++ b/studio/backend/tests/test_openai_tool_passthrough.py
@ -0,0 +1,465 @@
+# SPDX-License-Identifier: AGPL-3.0-only
+# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved.
+
+"""
+Tests for the OpenAI /v1/chat/completions client-side tool pass-through.
+
+Covers:
+- ChatCompletionRequest accepts standard OpenAI `tools` / `tool_choice` / `stop`.
+- ChatMessage accepts role="tool" with `tool_call_id` and role="assistant"
+  with `content: None` + `tool_calls`.
+- ChatCompletionRequest carries unknown fields via `extra="allow"`.
+- anthropic_tool_choice_to_openai() covers all four Anthropic shapes.
+- _build_passthrough_payload() honors a caller-supplied tool_choice and
+  defaults to "auto" when unset.
+- _friendly_error() maps httpx transport errors to a "Lost connection"
+  message so passthrough failures are legible instead of bare 500s.
+
+No running server or GPU required.
+"""
+
+import os
+import sys
+
+_backend = os.path.join(os.path.dirname(__file__), "..")
+sys.path.insert(0, _backend)
+
+import httpx
+import pytest
+from pydantic import ValidationError
+
+from models.inference import (
+    ChatCompletionRequest,
+    ChatMessage,
+)
+from core.inference.anthropic_compat import (
+    anthropic_tool_choice_to_openai,
+)
+from routes.inference import _build_passthrough_payload, _friendly_error
+
+
+# =====================================================================
+# ChatMessage — tool role, tool_calls, optional content
+# =====================================================================
+
+
+class TestChatMessageToolRoles:
+    def test_tool_role_with_tool_call_id(self):
+        msg = ChatMessage(
+            role = "tool",
+            tool_call_id = "call_abc123",
+            content = '{"temperature": 72}',
+        )
+        assert msg.role == "tool"
+        assert msg.tool_call_id == "call_abc123"
+        assert msg.content == '{"temperature": 72}'
+
+    def test_tool_role_with_name(self):
+        msg = ChatMessage(
+            role = "tool",
+            tool_call_id = "call_abc123",
+            name = "get_weather",
+            content = '{"temperature": 72}',
+        )
+        assert msg.name == "get_weather"
+
+    def test_assistant_with_tool_calls_no_content(self):
+        msg = ChatMessage(
+            role = "assistant",
+            content = None,
+            tool_calls = [
+                {
+                    "id": "call_1",
+                    "type": "function",
+                    "function": {
+                        "name": "get_weather",
+                        "arguments": '{"city": "Paris"}',
+                    },
+                }
+            ],
+        )
+        assert msg.role == "assistant"
+        assert msg.content is None
+        assert msg.tool_calls is not None
+        assert len(msg.tool_calls) == 1
+        assert msg.tool_calls[0]["function"]["name"] == "get_weather"
+
+    def test_assistant_with_content_and_tool_calls(self):
+        msg = ChatMessage(
+            role = "assistant",
+            content = "Let me check the weather.",
+            tool_calls = [
+                {
+                    "id": "call_1",
+                    "type": "function",
+                    "function": {"name": "get_weather", "arguments": "{}"},
+                }
+            ],
+        )
+        assert msg.content == "Let me check the weather."
+        assert msg.tool_calls[0]["id"] == "call_1"
+
+    def test_plain_user_message_still_works(self):
+        msg = ChatMessage(role = "user", content = "Hello")
+        assert msg.role == "user"
+        assert msg.tool_call_id is None
+        assert msg.tool_calls is None
+        assert msg.name is None
+
+    def test_invalid_role_rejected(self):
+        with pytest.raises(ValidationError):
+            ChatMessage(role = "function", content = "x")
+
+    def test_content_absent_on_assistant_tool_call_defaults_to_none(self):
+        # Assistant messages that carry only tool_calls are the one
+        # documented case where `content=None` is permitted.
+        msg = ChatMessage(
+            role = "assistant",
+            tool_calls = [
+                {
+                    "id": "call_1",
+                    "type": "function",
+                    "function": {"name": "f", "arguments": "{}"},
+                }
+            ],
+        )
+        assert msg.content is None
+
+    def test_tool_role_missing_tool_call_id_rejected(self):
+        # Per OpenAI spec, role="tool" messages must carry tool_call_id so
+        # upstream backends can associate the result with its prior call.
+        # Pin the boundary-level rejection so a malformed tool-result
+        # message never reaches the passthrough path.
+        with pytest.raises(ValidationError) as exc_info:
+            ChatMessage(role = "tool", content = '{"temperature": 72}')
+        assert "tool_call_id" in str(exc_info.value)
+
+    def test_tool_role_empty_tool_call_id_rejected(self):
+        with pytest.raises(ValidationError):
+            ChatMessage(
+                role = "tool",
+                tool_call_id = "",
+                content = '{"temperature": 72}',
+            )
+
+    # ── Role-aware content requirements ────────────────────────────
+
+    def test_user_empty_content_rejected(self):
+        with pytest.raises(ValidationError):
+            ChatMessage(role = "user", content = "")
+
+    def test_system_empty_content_rejected(self):
+        with pytest.raises(ValidationError):
+            ChatMessage(role = "system", content = "")
+
+    def test_user_empty_list_content_rejected(self):
+        with pytest.raises(ValidationError):
+            ChatMessage(role = "user", content = [])
+
+    def test_tool_empty_content_rejected(self):
+        with pytest.raises(ValidationError) as exc_info:
+            ChatMessage(role = "tool", tool_call_id = "call_1", content = "")
+        assert "content" in str(exc_info.value)
+
+    def test_assistant_without_content_or_tool_calls_rejected(self):
+        with pytest.raises(ValidationError) as exc_info:
+            ChatMessage(role = "assistant")
+        assert "content" in str(exc_info.value) or "tool_calls" in str(exc_info.value)
+
+    # ── Role-constrained tool-call metadata ────────────────────────
+
+    def test_tool_calls_on_user_rejected(self):
+        with pytest.raises(ValidationError) as exc_info:
+            ChatMessage(
+                role = "user",
+                content = "Hi",
+                tool_calls = [
+                    {
+                        "id": "c1",
+                        "type": "function",
+                        "function": {"name": "f", "arguments": "{}"},
+                    }
+                ],
+            )
+        assert "tool_calls" in str(exc_info.value)
+
+    def test_tool_call_id_on_user_rejected(self):
+        with pytest.raises(ValidationError) as exc_info:
+            ChatMessage(role = "user", content = "Hi", tool_call_id = "call_1")
+        assert "tool_call_id" in str(exc_info.value)
+
+    def test_name_on_user_rejected(self):
+        with pytest.raises(ValidationError) as exc_info:
+            ChatMessage(role = "user", content = "Hi", name = "get_weather")
+        assert "name" in str(exc_info.value)
+
+
+# =====================================================================
+# ChatCompletionRequest — standard OpenAI tool fields
+# =====================================================================
+
+
+class TestChatCompletionRequestToolFields:
+    def _make(self, **kwargs):
+        base = {"messages": [{"role": "user", "content": "Hi"}]}
+        base.update(kwargs)
+        return ChatCompletionRequest(**base)
+
+    def test_tools_parses(self):
+        req = self._make(
+            tools = [
+                {
+                    "type": "function",
+                    "function": {
+                        "name": "get_weather",
+                        "description": "Return the weather in a city",
+                        "parameters": {
+                            "type": "object",
+                            "properties": {"city": {"type": "string"}},
+                            "required": ["city"],
+                        },
+                    },
+                }
+            ],
+        )
+        assert req.tools is not None
+        assert len(req.tools) == 1
+        assert req.tools[0]["function"]["name"] == "get_weather"
+
+    def test_tool_choice_string_auto(self):
+        assert self._make(tool_choice = "auto").tool_choice == "auto"
+
+    def test_tool_choice_string_required(self):
+        assert self._make(tool_choice = "required").tool_choice == "required"
+
+    def test_tool_choice_string_none(self):
+        assert self._make(tool_choice = "none").tool_choice == "none"
+
+    def test_tool_choice_named_function(self):
+        tc = {"type": "function", "function": {"name": "get_weather"}}
+        assert self._make(tool_choice = tc).tool_choice == tc
+
+    def test_stop_string(self):
+        assert self._make(stop = "\nUser:").stop == "\nUser:"
+
+    def test_stop_list(self):
+        assert self._make(stop = ["\nUser:", "\nAssistant:"]).stop == [
+            "\nUser:",
+            "\nAssistant:",
+        ]
+
+    def test_tools_default_none(self):
+        req = self._make()
+        assert req.tools is None
+        assert req.tool_choice is None
+        assert req.stop is None
+
+    def test_extra_fields_accepted(self):
+        # `frequency_penalty`, `seed`, `response_format` are not yet
+        # explicitly declared but must survive Pydantic parsing now that
+        # extra="allow" is set.
+        req = self._make(
+            frequency_penalty = 0.5,
+            seed = 42,
+            response_format = {"type": "json_object"},
+        )
+        # Extras land in model_extra
+        assert req.model_extra is not None
+        assert req.model_extra.get("frequency_penalty") == 0.5
+        assert req.model_extra.get("seed") == 42
+        assert req.model_extra.get("response_format") == {"type": "json_object"}
+
+    def test_unsloth_extensions_still_work(self):
+        req = self._make(
+            enable_tools = True,
+            enabled_tools = ["web_search", "python"],
+            session_id = "abc",
+        )
+        assert req.enable_tools is True
+        assert req.enabled_tools == ["web_search", "python"]
+        assert req.session_id == "abc"
+
+    def test_stream_defaults_false_matching_openai_spec(self):
+        # OpenAI's /v1/chat/completions spec defaults `stream` to false.
+        # Studio previously defaulted to true, which broke naive curl
+        # clients that omit `stream` (they expect a JSON blob, got SSE).
+        # Pin the corrected default so it can't silently regress.
+        req = self._make()
+        assert req.stream is False
+
+    def test_multiturn_tool_loop_messages(self):
+        req = ChatCompletionRequest(
+            messages = [
+                {"role": "user", "content": "What's the weather in Paris?"},
+                {
+                    "role": "assistant",
+                    "content": None,
+                    "tool_calls": [
+                        {
+                            "id": "call_1",
+                            "type": "function",
+                            "function": {
+                                "name": "get_weather",
+                                "arguments": '{"city": "Paris"}',
+                            },
+                        }
+                    ],
+                },
+                {
+                    "role": "tool",
+                    "tool_call_id": "call_1",
+                    "content": '{"temperature": 14, "unit": "celsius"}',
+                },
+            ],
+            tools = [
+                {
+                    "type": "function",
+                    "function": {
+                        "name": "get_weather",
+                        "parameters": {"type": "object"},
+                    },
+                }
+            ],
+        )
+        assert len(req.messages) == 3
+        assert req.messages[1].role == "assistant"
+        assert req.messages[1].content is None
+        assert req.messages[1].tool_calls[0]["id"] == "call_1"
+        assert req.messages[2].role == "tool"
+        assert req.messages[2].tool_call_id == "call_1"
+
+
+# =====================================================================
+# anthropic_tool_choice_to_openai — pure translation helper
+# =====================================================================
+
+
+class TestAnthropicToolChoiceToOpenAI:
+    def test_auto(self):
+        assert anthropic_tool_choice_to_openai({"type": "auto"}) == "auto"
+
+    def test_any_becomes_required(self):
+        assert anthropic_tool_choice_to_openai({"type": "any"}) == "required"
+
+    def test_none(self):
+        assert anthropic_tool_choice_to_openai({"type": "none"}) == "none"
+
+    def test_tool_named(self):
+        result = anthropic_tool_choice_to_openai(
+            {"type": "tool", "name": "get_weather"}
+        )
+        assert result == {
+            "type": "function",
+            "function": {"name": "get_weather"},
+        }
+
+    def test_tool_missing_name_returns_none(self):
+        assert anthropic_tool_choice_to_openai({"type": "tool"}) is None
+
+    def test_none_input_returns_none(self):
+        assert anthropic_tool_choice_to_openai(None) is None
+
+    def test_unrecognized_shape_returns_none(self):
+        assert anthropic_tool_choice_to_openai({"type": "wibble"}) is None
+        assert anthropic_tool_choice_to_openai("auto") is None
+        assert anthropic_tool_choice_to_openai(42) is None
+
+
+# =====================================================================
+# _build_passthrough_payload — tool_choice propagation
+# =====================================================================
+
+
+class TestBuildPassthroughPayloadToolChoice:
+    def _args(self):
+        return dict(
+            openai_messages = [{"role": "user", "content": "Hi"}],
+            openai_tools = [
+                {
+                    "type": "function",
+                    "function": {"name": "f", "parameters": {"type": "object"}},
+                }
+            ],
+            temperature = 0.6,
+            top_p = 0.95,
+            top_k = 20,
+            max_tokens = 128,
+            stream = False,
+        )
+
+    def test_default_tool_choice_is_auto(self):
+        body = _build_passthrough_payload(**self._args())
+        assert body["tool_choice"] == "auto"
+
+    def test_override_tool_choice_required(self):
+        body = _build_passthrough_payload(**self._args(), tool_choice = "required")
+        assert body["tool_choice"] == "required"
+
+    def test_override_tool_choice_none(self):
+        body = _build_passthrough_payload(**self._args(), tool_choice = "none")
+        assert body["tool_choice"] == "none"
+
+    def test_override_tool_choice_named_function(self):
+        tc = {"type": "function", "function": {"name": "f"}}
+        body = _build_passthrough_payload(**self._args(), tool_choice = tc)
+        assert body["tool_choice"] == tc
+
+    def test_stream_adds_include_usage(self):
+        args = self._args()
+        args["stream"] = True
+        body = _build_passthrough_payload(**args)
+        assert body.get("stream_options") == {"include_usage": True}
+
+    def test_repetition_penalty_renamed(self):
+        body = _build_passthrough_payload(**self._args(), repetition_penalty = 1.1)
+        assert body.get("repeat_penalty") == 1.1
+        assert "repetition_penalty" not in body
+
+
+# =====================================================================
+# _friendly_error — httpx transport failures
+# =====================================================================
+
+
+class TestFriendlyErrorHttpx:
+    """The async pass-through helpers talk to llama-server via httpx.
+    When the subprocess is down, httpx raises RequestError subclasses
+    whose string form (``"All connection attempts failed"``, ``"[Errno 111]
+    Connection refused"``, ...) does NOT contain the substring
+    ``"Lost connection to llama-server"`` the sync path uses, so the
+    previous substring-only `_friendly_error` returned a useless generic
+    message. These tests pin the new isinstance-based mapping.
+    """
+
+    def _req(self):
+        return httpx.Request("POST", "http://127.0.0.1:65535/v1/chat/completions")
+
+    def test_connect_error_mapped(self):
+        exc = httpx.ConnectError("All connection attempts failed", request = self._req())
+        assert "Lost connection" in _friendly_error(exc)
+
+    def test_read_error_mapped(self):
+        exc = httpx.ReadError("EOF", request = self._req())
+        assert "Lost connection" in _friendly_error(exc)
+
+    def test_remote_protocol_error_mapped(self):
+        exc = httpx.RemoteProtocolError("peer closed", request = self._req())
+        assert "Lost connection" in _friendly_error(exc)
+
+    def test_read_timeout_mapped(self):
+        exc = httpx.ReadTimeout("timed out", request = self._req())
+        assert "Lost connection" in _friendly_error(exc)
+
+    def test_non_httpx_unchanged(self):
+        # Non-httpx exceptions still fall through to the existing substring
+        # heuristics — a context-size message must still produce the
+        # "Message too long" path.
+        ctx_msg = (
+            "request (4096 tokens) exceeds the available context size (2048 tokens)"
+        )
+        assert "Message too long" in _friendly_error(ValueError(ctx_msg))
+
+    def test_generic_exception_returns_generic_message(self):
+        assert (
+            _friendly_error(RuntimeError("unrelated")) == "An internal error occurred"
+        )
--- a/studio/backend/tests/test_studio_api.py
+++ b/studio/backend/tests/test_studio_api.py
@ -11,11 +11,16 @@ authentication and the CLI's ``--help`` output:
    1. curl -- basic chat completions (non-streaming)
    2. curl -- streaming chat completions
    3. Python OpenAI SDK -- streaming completions
-    4. curl -- with tools (web_search + python)
-    5. Anthropic Messages API -- basic non-streaming
-    6. Anthropic Messages API -- streaming SSE
-    7. Anthropic Python SDK -- non-streaming
-    8. Anthropic Messages API -- streaming with tools
+    4. curl -- Studio server-side tools (enable_tools=true)
+    5. curl -- Standard OpenAI function calling (non-streaming)
+    6. curl -- Standard OpenAI function calling (streaming)
+    7. curl -- Standard OpenAI function calling (multi-turn tool loop)
+    8. OpenAI Python SDK -- Standard function calling
+    9. Anthropic Messages API -- basic non-streaming
+    10. Anthropic Messages API -- streaming SSE
+    11. Anthropic Python SDK -- non-streaming
+    12. Anthropic Messages API -- streaming with tools
+    13. Anthropic Messages API -- tool_choice={"type":"any"} honored

 Training, export, fine-tuning, and chat-UI concerns are out of scope —
 see the unit suites elsewhere under ``studio/backend/tests/`` for those.
@ -266,6 +271,250 @@ def test_curl_with_tools(base_url: str, api_key: str):
    print(f"  PASS  curl with tools: {len(chunks)} chunks, {len(full)} chars content")


+# ── Standard OpenAI function-calling pass-through tests ─────────────
+#
+# Regression coverage for unslothai/unsloth#4999: Studio's
+# /v1/chat/completions used to silently strip standard OpenAI `tools`
+# and `tool_choice` fields, so clients (opencode, Claude Code, Cursor,
+# Continue, ...) could never get structured tool_calls back. These
+# tests exercise the client-side pass-through path that forwards those
+# fields to llama-server verbatim.
+#
+# They require a tool-capable GGUF (``supports_tools=True`` — e.g.
+# Qwen3, Qwen2.5-Coder, Llama-3.1-Instruct). The default test model
+# ``unsloth/Qwen3-1.7B-GGUF`` advertises tool support via its chat
+# template metadata.
+
+_WEATHER_TOOL = {
+    "type": "function",
+    "function": {
+        "name": "get_weather",
+        "description": "Look up the current weather for a given city.",
+        "parameters": {
+            "type": "object",
+            "properties": {
+                "city": {
+                    "type": "string",
+                    "description": "The name of the city, e.g. 'Paris'.",
+                },
+            },
+            "required": ["city"],
+        },
+    },
+}
+
+
+def _collect_streamed_tool_calls(chunks: list[dict]) -> list[dict]:
+    """Reassemble OpenAI streaming delta.tool_calls into full tool calls.
+
+    OpenAI streams partial tool calls across chunks — the first chunk for
+    a given index carries ``id`` + ``function.name``, and subsequent
+    chunks append fragments to ``function.arguments``.
+    """
+    by_index: dict[int, dict] = {}
+    for c in chunks:
+        choices = c.get("choices") or []
+        if not choices:
+            continue
+        delta = choices[0].get("delta") or {}
+        tool_calls = delta.get("tool_calls") or []
+        for tc in tool_calls:
+            idx = tc.get("index", 0)
+            slot = by_index.setdefault(
+                idx,
+                {
+                    "id": None,
+                    "type": "function",
+                    "function": {"name": None, "arguments": ""},
+                },
+            )
+            if tc.get("id"):
+                slot["id"] = tc["id"]
+            fn = tc.get("function") or {}
+            if fn.get("name"):
+                slot["function"]["name"] = fn["name"]
+            if fn.get("arguments"):
+                slot["function"]["arguments"] += fn["arguments"]
+    return [by_index[i] for i in sorted(by_index)]
+
+
+def _final_finish_reason(chunks: list[dict]) -> str | None:
+    for c in reversed(chunks):
+        choices = c.get("choices") or []
+        if not choices:
+            continue
+        fr = choices[0].get("finish_reason")
+        if fr is not None:
+            return fr
+    return None
+
+
+def test_openai_tools_nonstream(base_url: str, api_key: str):
+    """Standard OpenAI function calling, non-streaming, tool_choice='required'.
+
+    Regression: before the fix, Studio silently stripped `tools` and the
+    model returned plain text with finish_reason='stop'. After the fix,
+    llama-server's response is forwarded verbatim so the client sees
+    finish_reason='tool_calls' with a structured tool_calls array and
+    non-zero usage.prompt_tokens.
+    """
+    status, text = _http(
+        "POST",
+        f"{base_url}/v1/chat/completions",
+        body = {
+            "messages": [{"role": "user", "content": "What is the weather in Paris?"}],
+            "tools": [_WEATHER_TOOL],
+            "tool_choice": "required",
+            "stream": False,
+        },
+        headers = {"Authorization": f"Bearer {api_key}"},
+        timeout = 120,
+    )
+    assert status == 200, f"Expected 200, got {status}: {text[:500]}"
+    data = json.loads(text)
+    assert "choices" in data, f"Missing 'choices': {text[:300]}"
+    choice = data["choices"][0]
+    assert (
+        choice["finish_reason"] == "tool_calls"
+    ), f"Expected finish_reason='tool_calls', got {choice['finish_reason']!r}"
+    msg = choice["message"]
+    tool_calls = msg.get("tool_calls") or []
+    assert len(tool_calls) >= 1, f"No tool_calls in response: {msg}"
+    first = tool_calls[0]
+    assert first["type"] == "function"
+    assert (
+        first["function"]["name"] == "get_weather"
+    ), f"Wrong tool name: {first['function']['name']!r}"
+    # arguments must be valid JSON
+    parsed = json.loads(first["function"]["arguments"])
+    assert "city" in parsed, f"Tool call missing required 'city' arg: {parsed}"
+    # Usage must be non-zero (was 0 before the fix)
+    usage = data.get("usage") or {}
+    assert (
+        usage.get("prompt_tokens", 0) > 0
+    ), f"Expected non-zero prompt_tokens; got {usage}"
+    assert data.get("id"), "Missing response id"
+    print(
+        f"  PASS  openai tools non-stream: "
+        f"tool={first['function']['name']}, args={parsed}, "
+        f"prompt_tokens={usage['prompt_tokens']}"
+    )
+
+
+def test_openai_tools_stream(base_url: str, api_key: str):
+    """Standard OpenAI function calling, streaming, tool_choice='required'."""
+    status, chunks = _stream_http(
+        f"{base_url}/v1/chat/completions",
+        body = {
+            "messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
+            "tools": [_WEATHER_TOOL],
+            "tool_choice": "required",
+            "stream": True,
+        },
+        headers = {"Authorization": f"Bearer {api_key}"},
+        timeout = 120,
+    )
+    assert status == 200, f"Expected 200, got {status}"
+    assert len(chunks) > 0, "No SSE chunks received"
+    assert _final_finish_reason(chunks) == "tool_calls", (
+        f"Expected final finish_reason='tool_calls', got "
+        f"{_final_finish_reason(chunks)!r}"
+    )
+    assembled = _collect_streamed_tool_calls(chunks)
+    assert len(assembled) >= 1, "No tool_calls reassembled from stream"
+    first = assembled[0]
+    assert first["function"]["name"] == "get_weather"
+    parsed = json.loads(first["function"]["arguments"])
+    assert "city" in parsed
+    print(
+        f"  PASS  openai tools stream: {len(chunks)} chunks, "
+        f"tool={first['function']['name']}, args={parsed}"
+    )
+
+
+def test_openai_tools_multiturn(base_url: str, api_key: str):
+    """Multi-turn client-side tool loop: validates that role='tool' result
+    messages and assistant messages carrying tool_calls are accepted.
+
+    Regression: before the fix, ChatMessage.role was restricted to
+    {system,user,assistant} and rejected role='tool' at the Pydantic
+    validation stage. This test sends a full round trip so the model
+    receives the simulated tool result and responds with final text.
+    """
+    status, text = _http(
+        "POST",
+        f"{base_url}/v1/chat/completions",
+        body = {
+            "messages": [
+                {"role": "user", "content": "What is the weather in Paris?"},
+                {
+                    "role": "assistant",
+                    "content": None,
+                    "tool_calls": [
+                        {
+                            "id": "call_test_1",
+                            "type": "function",
+                            "function": {
+                                "name": "get_weather",
+                                "arguments": '{"city": "Paris"}',
+                            },
+                        }
+                    ],
+                },
+                {
+                    "role": "tool",
+                    "tool_call_id": "call_test_1",
+                    "content": '{"temperature_c": 14, "condition": "cloudy"}',
+                },
+            ],
+            "tools": [_WEATHER_TOOL],
+            "stream": False,
+        },
+        headers = {"Authorization": f"Bearer {api_key}"},
+        timeout = 120,
+    )
+    assert status == 200, f"Expected 200, got {status}: {text[:500]}"
+    data = json.loads(text)
+    msg = data["choices"][0]["message"]
+    # The model should respond with text now that it has the tool result
+    content = msg.get("content") or ""
+    assert len(content) > 0 or msg.get(
+        "tool_calls"
+    ), f"Expected text or follow-up tool call, got empty message: {msg}"
+    print(f"  PASS  openai tools multiturn: {content[:80]!r}")
+
+
+def test_openai_sdk_tool_calling(base_url: str, api_key: str):
+    """OpenAI Python SDK round trip — the real client shape opencode et al. use."""
+    try:
+        from openai import OpenAI
+    except ImportError:
+        print("  SKIP  openai SDK not installed")
+        return
+
+    client = OpenAI(base_url = f"{base_url}/v1", api_key = api_key)
+    resp = client.chat.completions.create(
+        model = "current",
+        messages = [{"role": "user", "content": "What's the weather in Berlin?"}],
+        tools = [_WEATHER_TOOL],
+        tool_choice = "required",
+        stream = False,
+    )
+    assert resp.choices[0].finish_reason == "tool_calls", (
+        f"Expected finish_reason='tool_calls', got "
+        f"{resp.choices[0].finish_reason!r}"
+    )
+    tool_calls = resp.choices[0].message.tool_calls
+    assert tool_calls and len(tool_calls) >= 1, "No tool_calls from SDK"
+    tc = tool_calls[0]
+    assert tc.function.name == "get_weather"
+    parsed = json.loads(tc.function.arguments)
+    assert "city" in parsed
+    print(
+        f"  PASS  openai SDK tool calling: " f"tool={tc.function.name}, args={parsed}"
+    )
+
+
 def test_invalid_key_rejected(base_url: str):
    """Requests with a bad API key should be rejected."""
    status, _text = _http(
@ -464,6 +713,73 @@ def test_anthropic_with_tools(base_url: str, api_key: str):
    )


+def test_anthropic_tool_choice_any(base_url: str, api_key: str):
+    """Anthropic Messages API: ``tool_choice: {"type": "any"}`` must be
+    honored (forwarded as OpenAI ``tool_choice: "required"`` to
+    llama-server). Regression for the secondary fix bundled with #4999 —
+    previously this field was accepted on the request model but silently
+    dropped with a warning log, so the model was free to answer from
+    memory instead of using the tool.
+    """
+    status, events = _stream_anthropic_http(
+        f"{base_url}/v1/messages",
+        body = {
+            "model": "default",
+            "max_tokens": 256,
+            "messages": [
+                # A question the model could easily answer from memory if
+                # tool_choice were not enforced.
+                {
+                    "role": "user",
+                    "content": "What is the weather in London right now?",
+                }
+            ],
+            "tools": [
+                {
+                    "name": "get_weather",
+                    "description": "Look up current weather for a city.",
+                    "input_schema": {
+                        "type": "object",
+                        "properties": {
+                            "city": {"type": "string"},
+                        },
+                        "required": ["city"],
+                    },
+                }
+            ],
+            "tool_choice": {"type": "any"},
+            "stream": True,
+        },
+        headers = {"Authorization": f"Bearer {api_key}"},
+        timeout = 120,
+    )
+    assert status == 200, f"Expected 200, got {status}"
+    assert len(events) > 0, "No SSE events received"
+
+    # With tool_choice=any, stop_reason must be tool_use (not end_turn)
+    stop_reason = None
+    for etype, data in events:
+        if etype == "message_delta":
+            stop_reason = data.get("delta", {}).get("stop_reason") or stop_reason
+    assert stop_reason == "tool_use", (
+        f"Expected stop_reason='tool_use' with tool_choice=any, got "
+        f"{stop_reason!r} — tool_choice may not be forwarded to llama-server."
+    )
+
+    # And at least one tool_use content block must be emitted
+    tool_use_starts = [
+        e
+        for e in events
+        if e[0] == "content_block_start"
+        and e[1].get("content_block", {}).get("type") == "tool_use"
+    ]
+    assert len(tool_use_starts) >= 1, "No tool_use content block emitted"
+    print(
+        f"  PASS  anthropic tool_choice=any honored: "
+        f"{len(tool_use_starts)} tool_use blocks, stop_reason={stop_reason}"
+    )
+
+
 # ── Server lifecycle ─────────────────────────────────────────────────


@ -578,10 +894,10 @@ def main():
            print(f"  ERROR {fn.__name__}: {type(exc).__name__}: {exc}")

    # ── 1. Test --help (no server needed) ────────────────────────────
-    print("\n[1/11] Testing --help output")
+    print("\n[1/16] Testing --help output")
    run_test(test_help_output)

-    # ── 2-11. Start server and run API tests ─────────────────────────
+    # ── 2-16. Start server and run API tests ─────────────────────────
    print(
        f"\nStarting server: {args.model} (variant={args.gguf_variant}) on port {PORT}..."
    )
@ -591,39 +907,54 @@ def main():
        base_url = f"http://{HOST}:{PORT}"
        print(f"Server ready.  API Key: {api_key[:20]}...\n")

-        print("[2/11] Testing curl basic (non-streaming)")
+        print("[2/16] Testing curl basic (non-streaming)")
        run_test(test_curl_basic, base_url, api_key)

-        print("[3/11] Testing curl streaming")
+        print("[3/16] Testing curl streaming")
        run_test(test_curl_streaming, base_url, api_key)

-        print("[4/11] Testing OpenAI Python SDK (streaming)")
+        print("[4/16] Testing OpenAI Python SDK (streaming)")
        run_test(test_openai_sdk, base_url, api_key)

-        print("[5/11] Testing curl with tools")
+        print("[5/16] Testing curl with tools (server-side enable_tools)")
        run_test(test_curl_with_tools, base_url, api_key)

-        print("[6/11] Testing invalid API key rejection")
+        print("[6/16] Testing OpenAI standard tools (non-streaming)")
+        run_test(test_openai_tools_nonstream, base_url, api_key)
+
+        print("[7/16] Testing OpenAI standard tools (streaming)")
+        run_test(test_openai_tools_stream, base_url, api_key)
+
+        print("[8/16] Testing OpenAI standard tools (multi-turn)")
+        run_test(test_openai_tools_multiturn, base_url, api_key)
+
+        print("[9/16] Testing OpenAI SDK tool calling")
+        run_test(test_openai_sdk_tool_calling, base_url, api_key)
+
+        print("[10/16] Testing invalid API key rejection")
        run_test(test_invalid_key_rejected, base_url)

-        print("[7/11] Testing no API key rejection")
+        print("[11/16] Testing no API key rejection")
        run_test(test_no_key_rejected, base_url)

-        print("[8/11] Testing Anthropic basic (non-streaming)")
+        print("[12/16] Testing Anthropic basic (non-streaming)")
        run_test(test_anthropic_basic, base_url, api_key)

-        print("[9/11] Testing Anthropic streaming")
+        print("[13/16] Testing Anthropic streaming")
        run_test(test_anthropic_streaming, base_url, api_key)

-        print("[10/11] Testing Anthropic Python SDK")
+        print("[14/16] Testing Anthropic Python SDK")
        run_test(test_anthropic_sdk, base_url, api_key)

-        print("[11/11] Testing Anthropic with tools")
+        print("[15/16] Testing Anthropic with tools")
        run_test(test_anthropic_with_tools, base_url, api_key)

+        print("[16/16] Testing Anthropic tool_choice=any honored")
+        run_test(test_anthropic_tool_choice_any, base_url, api_key)
+
    except RuntimeError as exc:
        print(f"\nFATAL: Server failed to start: {exc}")
-        failed += 11  # count remaining tests as failed
+        failed += 16  # count remaining tests as failed
    finally:
        if proc:
            print("\nStopping server...")