unsloth

mirror of https://github.com/unslothai/unsloth synced 2026-04-21 13:37:39 +00:00

History

Roland Tannous 21e9a91a57 Studio: forward standard OpenAI tools / tool_choice on /v1/responses (Codex compat) (#5122 ) * Studio: forward standard OpenAI tools / tool_choice on /v1/responses Mirrors the /v1/chat/completions client-side tool pass-through from #5099 so clients (OpenAI Codex CLI, OpenAI Python SDK, ...) that target the Responses API receive structured function_call output items instead of plain text with tool-call tokens leaking into content. - ResponsesRequest: type tools/tool_choice properly, add parallel_tool_calls; accept function_call and function_call_output input items for multi-turn - Translate flat Responses tool / tool_choice shape to the nested Chat Completions shape before forwarding to llama-server - _normalise_responses_input: map function_call_output -> role="tool", function_call -> assistant tool_calls (preserving call_id) - Non-streaming: map returned tool_calls -> top-level function_call output items keyed by call_id - Streaming: emit response.output_item.added (function_call), response.function_call_arguments.delta/.done, and response.output_item.done per tool call while keeping the text message at output_index 0 - Pytest coverage: tools/tool_choice translation, multi-turn input mapping, non-streaming tool_calls mapping, response round-trip * Studio: merge system messages and close inner stream on /v1/responses Fixes two issues surfacing when OpenAI Codex CLI drives /v1/responses against a GGUF with a strict chat template (gpt-oss harmony, Qwen3, ...). 1. "System message must be at the beginning" upstream errors Codex sends `instructions` AND a `role:"developer"` message in `input`, producing two separate system-role messages. Strict templates raise when a second system message exists or when one appears after a user turn. _normalise_responses_input now hoists all instructions / system / developer content into a single merged system message at the top of the Chat Completions message list. 2. "async generator ignored GeneratorExit" / "Attempted to exit cancel scope in a different task" _responses_stream consumed the inner chat-completions body_iterator without an explicit aclose() in a finally block. On client disconnect (Codex frequently cancels mid-stream), Python 3.13 finalized the inner async generator on a different task, tripping anyio's cancel-scope check. Mirrored the same try/finally + aclose pattern used by the /v1/messages, /v1/chat/completions, and /v1/completions passthroughs. Tests: hoisting of instructions + developer, developer mid-conversation, multiple system messages in input, no-system passthrough. * Studio: accept Codex multi-turn shapes and fix cross-task stream close on /v1/responses Two issues observed driving /v1/responses from OpenAI Codex CLI against a GGUF backend. 1. 422 on every turn after the first Codex replays prior assistant turns with `content:[{"type":"output_text","text":...,"annotations":[],"logprobs":[]}]` and carries forward `reasoning` items (o-series / gpt-5) between turns. Our `ResponsesContentPart` union only accepted input_text / input_image, and `ResponsesInputItem` only message / function_call / function_call_output, so Pydantic failed the whole list and FastAPI returned `"Input should be a valid string"` against the `str` branch of the outer union. - Add `ResponsesOutputTextPart` for assistant-replay content. - Add `ResponsesUnknownContentPart` and `ResponsesUnknownInputItem` as permissive catch-alls (drop during normalisation). - Wire an explicit `Discriminator` so dispatch is deterministic and the fallthrough reaches the catch-all instead of misreporting via the outer `Union[str, list[...]]`. - `_normalise_responses_input` now accepts output_text parts, flattens single-part assistant text to a plain string (keeps legacy chat templates happy), and silently drops reasoning / unknown items. 2. "async generator ignored GeneratorExit" / cross-task cancel scope `_responses_stream` awaited `openai_chat_completions` in the parent route-handler task, which opens the httpx client for the inner passthrough on that task. The outer `StreamingResponse` then iterates in a child task, so the asyncgen GC finalises the inner httpcore byte stream on the child task, tripping anyio's "Attempted to exit cancel scope in a different task". Move the `await` inside `event_generator` so the httpx lifecycle stays within the single streaming child task, and surface any HTTPException as a `response.failed` SSE frame. Tests: assistant output_text replay, reasoning-item tolerance, unknown content-part tolerance, end-to-end Codex-shape payload (developer + user + reasoning + function_call + function_call_output + assistant output_text + user), and single-part assistant flattening to plain string. * Studio: call llama-server directly from streaming /v1/responses The previous fix (running the inner await inside event_generator) was not enough. Wrapping the existing `openai_chat_completions` pass-through still stacks two async generators: when the outer generator is closed, the innermost `HTTP11ConnectionByteStream.__aiter__` in httpcore doesn't receive GeneratorExit before Python's asyncgen GC finalises it in a sibling task, tripping "Attempted to exit cancel scope in a different task" and "async generator ignored GeneratorExit" — the same Python 3.13 + httpcore 1.0.x interaction already seen in PRs #4956, #4981, #5099. Cure both pass-throughs had: a single same-task httpx lifecycle with explicit `aiter_lines().aclose()` BEFORE `resp.aclose()` / `client.aclose()` in the generator's finally block. Apply it at the Responses layer by dropping the wrapper entirely for GGUF: open httpx, consume `resp.aiter_lines()`, parse `chat.completion.chunk`, emit Responses SSE events, close everything in finally — all in the single StreamingResponse child task. Non-GGUF streaming is rejected with a 400 (wrapping the transformers backend would re-introduce the double-layer pattern and isn't a Codex-compatible path today anyway). Also surfaces upstream httpx.RequestError / non-200 as a `response.failed` SSE frame rather than a dropped stream now that the request is dispatched after SSE headers have gone out. * Studio: silence benign httpcore asyncgen GC warnings on Python 3.13 The streaming pass-throughs (/v1/chat/completions, /v1/messages, /v1/responses, /v1/completions) all use the proven #4981 / #5099 pattern — single-task httpx lifecycle with explicit aiter_lines().aclose() ahead of resp.aclose() / client.aclose() in the generator's finally block. That handles our own iterators correctly. The residual noise ("async generator ignored GeneratorExit" / "Attempted to exit cancel scope in a different task") comes from an innermost HTTP11ConnectionByteStream.__aiter__ that httpcore creates internally inside its pool. We hold no reference to it, so we cannot aclose it ourselves. Python 3.13's asyncgen GC hook finalises it on the finaliser task, its aclose path enters an anyio CancelScope shield, and Python flags the cross-task exit. The response has already been delivered with a 200 by then — it is purely log noise, not a functional failure. Same interaction seen in modelcontextprotocol/python-sdk #831, agno #3556, chainlit #2361, langchain-mcp-adapters #254. Install a targeted sys.unraisablehook that swallows this specific tuple — RuntimeError mentioning "cancel scope" or "GeneratorExit" plus an object repr referencing HTTP11ConnectionByteStream — and defers to the default hook for every other unraisable. Idempotent; guarded by a sentinel attribute so repeated imports don't stack filters.		2026-04-21 13:17:20 +04:00
..
assets	Add Qwen3.6 inference defaults for Studio (#5065 )	2026-04-16 11:42:42 -07:00
auth	Studio: Expose openai and anthropic compatible external API end points (#4956 )	2026-04-13 21:08:11 +04:00
core	Studio: forward standard OpenAI tools / tool_choice to llama-server (#5099 )	2026-04-18 12:53:23 +04:00
loggers	Final cleanup	2026-03-12 18:28:04 +00:00
models	Studio: forward standard OpenAI tools / tool_choice on /v1/responses (Codex compat) (#5122 )	2026-04-21 13:17:20 +04:00
plugins	Bump Data Designer to 0.5.4 (removes litellm dependency) (#4569 )	2026-03-25 02:01:43 -07:00
requirements	change torchcodec version to 0.10.0 in extra-no-deps (#5043 )	2026-04-16 19:50:57 +04:00
routes	Studio: forward standard OpenAI tools / tool_choice on /v1/responses (Codex compat) (#5122 )	2026-04-21 13:17:20 +04:00
state	Final cleanup	2026-03-12 18:28:04 +00:00
storage	feat: custom scan folders for GGUF model discovery (#4723 )	2026-03-31 06:40:31 -07:00
tests	Studio: forward standard OpenAI tools / tool_choice on /v1/responses (Codex compat) (#5122 )	2026-04-21 13:17:20 +04:00
utils	chore: switch helper and no-cache fallback to Gemma (#5066 )	2026-04-16 22:27:30 +04:00
__init__.py	Final cleanup	2026-03-12 18:28:04 +00:00
_platform_compat.py	Fix Studio crash on Anaconda/conda-forge Python (#4484 )	2026-03-22 05:36:55 -07:00
colab.py	Fix/studio colab button message: Add fallback message for Colab Studio button when proxy URL fails (#4866 )	2026-04-05 21:57:45 -07:00
main.py	Chat first onboarding (#5063 )	2026-04-16 09:58:10 -07:00
run.py	Studio: Expose openai and anthropic compatible external API end points (#4956 )	2026-04-13 21:08:11 +04:00
startup_banner.py	studio: unify Windows installer/setup logging style, verbosity controls, and startup messaging (#4651 )	2026-03-30 00:53:23 -07:00