LocalAI

mirror of https://github.com/mudler/LocalAI synced 2026-05-24 09:28:23 +00:00

History

Richard Palethorpe c894d9c826 feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686 ) Bring the sglang Python backend up to feature parity with vllm by adding the same engine_args:-map plumbing the vLLM backend already has. Any ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model YAML, including the speculative-decoding flags needed for Multi-Token Prediction. Validation matches the vllm backend's: keys are checked against dataclasses.fields(ServerArgs), unknown keys raise ValueError with a difflib close-match suggestion at LoadModel time, and the typed ModelOptions fields keep their existing meaning with engine_args overriding them. Backend code: * backend/python/sglang/backend.py: add _apply_engine_args, import dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed -> sampling_seed (sglang 0.5.11 renamed the SamplingParams field). * backend/python/sglang/test.py + test.sh + Makefile: six unit tests exercising the helper directly (no engine load required). Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class): * backend/python/sglang/install.sh: add --prerelease=allow because sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels; add --index-strategy=unsafe-best-match for cublas12 so the cu128 torch index wins over default-PyPI's cu130; new pyproject.toml-driven l4t13 install path so [tool.uv.sources] can pin torch/torchvision/ torchaudio/sglang to the jetson-ai-lab index without forcing every transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the equivalent fix in backend/python/vllm/install.sh). * backend/python/sglang/pyproject.toml (new): L4T project spec with explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt for the l4t13 BUILD_PROFILE; other profiles still go through the requirements-.txt pipeline via libbackend.sh's installRequirements. backend/python/sglang/requirements-l4t13.txt: removed; superseded by pyproject.toml. * backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13 (new files) and cu128 torch index for cublas12 (default PyPI now ships cu130 torch wheels by default and breaks cu12 hosts). * backend/index.yaml: add cuda13-sglang and cuda13-sglang-development capability mappings + image entries pointing at quay.io/.../-gpu-nvidia-cuda-13-sglang. * .github/workflows/backend.yml: new cublas13 sglang matrix entry, mirroring vllm's cuda13 build. Model gallery + docs: * gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml. * gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands. * gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads + online fp8 weight quantization, verified end-to-end on a 16 GB RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the MTP draft worker's vocab embedding is loaded unquantised and OOMs the static reservation at sglang's 0.85 default. * gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp, gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang). * docs/content/features/text-generation.md: new SGLang section with setup, engine_args reference, MTP demos, version requirements. * .agents/sglang-backend.md (new): agent one-pager covering the flat ServerArgs structure, the typed-vs-engine_args precedence, the speculative-decoding cheatsheet, and the mem_fraction_static gotcha documented above. * AGENTS.md: index entry for the new agent doc. Known limitation: the two Gemma 4 MTP gallery entries ship a recipe that doesn't yet run on stock libraries. The drafter checkpoints (google/gemma-4-{E2B,E4B}-it-assistant) declare model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which neither transformers (<=5.6.0, including the SGLang cookbook's pinned commit 91b1ab1f... and main HEAD) nor sglang's own model registry (<=0.5.11) registers as of 2026-05-06. They will start working when HF or sglang upstream registers the architecture -- no LocalAI changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work today on this build (verified on RTX 5070 Ti, 16 GB). Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch] Signed-off-by: Richard Palethorpe <io@richiejp.com>		2026-05-07 17:27:29 +02:00
..
alpaca.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
arch-function.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
cerbero.yaml	fix: yamlint warnings and errors (#2131 )	2024-04-25 17:25:56 +00:00
chatml-hercules.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
chatml.yaml	feat(gallery): Speed up load times and clean gallery entries (#9211 )	2026-05-06 14:51:38 +02:00
codellama.yaml	fix: yamlint warnings and errors (#2131 )	2024-04-25 17:25:56 +00:00
command-r.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
deephermes.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
deepseek-r1.yaml	feat(gallery): Speed up load times and clean gallery entries (#9211 )	2026-05-06 14:51:38 +02:00
deepseek.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
dreamshaper.yaml	fix: yamlint warnings and errors (#2131 )	2024-04-25 17:25:56 +00:00
falcon3.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
flux-ggml.yaml	fix(flux): Set CFG=1 so that prompts are followed (#5378 )	2025-05-16 17:53:54 +02:00
flux.yaml	fix(flux): Set CFG=1 so that prompts are followed (#5378 )	2025-05-16 17:53:54 +02:00
gemma.yaml	feat(gallery): Speed up load times and clean gallery entries (#9211 )	2026-05-06 14:51:38 +02:00
granite.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
granite3-2.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
granite4.yaml	feat(gallery): Speed up load times and clean gallery entries (#9211 )	2026-05-06 14:51:38 +02:00
harmony.yaml	feat(gallery): Speed up load times and clean gallery entries (#9211 )	2026-05-06 14:51:38 +02:00
hermes-2-pro-mistral.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
hermes-vllm.yaml	chore(model-gallery): add more quants for popular models (#3365 )	2024-08-24 00:29:24 +02:00
index.yaml	feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686 )	2026-05-07 17:27:29 +02:00
jamba.yaml	chore(model gallery): add ai21labs_ai21-jamba-reasoning-3b (#6417 )	2025-10-09 15:00:56 +02:00
kokoros.yaml	feat: Add Kokoros backend (#9212 )	2026-04-08 19:23:16 +02:00
lfm.yaml	feat(gallery): Speed up load times and clean gallery entries (#9211 )	2026-05-06 14:51:38 +02:00
llama3-instruct.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
llama3.1-instruct-grammar.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
llama3.1-instruct.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
llama3.1-reflective.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
llama3.2-fcall.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
llama3.2-quantized.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
llava.yaml	fix: yamlint warnings and errors (#2131 )	2024-04-25 17:25:56 +00:00
mathstral.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
mistral-0.3.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
moondream.yaml	feat(gallery): Speed up load times and clean gallery entries (#9211 )	2026-05-06 14:51:38 +02:00
mudler.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
nanbeige4.1.yaml	feat(gallery): Speed up load times and clean gallery entries (#9211 )	2026-05-06 14:51:38 +02:00
noromaid.yaml	fix: yamlint warnings and errors (#2131 )	2024-04-25 17:25:56 +00:00
openvino.yaml	feat(gallery): Speed up load times and clean gallery entries (#9211 )	2026-05-06 14:51:38 +02:00
parler-tts.yaml	fix: yamlint warnings and errors (#2131 )	2024-04-25 17:25:56 +00:00
phi-2-chat.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
phi-2-orange.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
phi-3-chat.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
phi-3-vision.yaml	fix(phi3-vision): add multimodal template (#3944 )	2024-10-23 15:34:45 +02:00
phi-4-chat-fcall.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
phi-4-chat.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
piper.yaml	fix: yamlint warnings and errors (#2131 )	2024-04-25 17:25:56 +00:00
pocket-tts.yaml	feat(tts): add pocket-tts backend (#8018 )	2026-01-13 23:35:19 +01:00
qwen-fcall.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
qwen-image.yaml	Update qwen-image.yaml	2025-08-06 10:40:46 +02:00
qwen3-deepresearch.yaml	chore(model gallery): add alibaba-nlp_tongyi-deepresearch-30b-a3b (#6295 )	2025-09-17 09:22:19 +02:00
qwen3-openbuddy.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
qwen3.yaml	feat(gallery): Speed up load times and clean gallery entries (#9211 )	2026-05-06 14:51:38 +02:00
rerankers.yaml	fix: yamlint warnings and errors (#2131 )	2024-04-25 17:25:56 +00:00
rwkv.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
sd-ggml.yaml	chore(model gallery): add sd-3.5-large-ggml (#4647 )	2025-01-20 19:04:23 +01:00
sentencetransformers.yaml	fix: yamlint warnings and errors (#2131 )	2024-04-25 17:25:56 +00:00
sglang-gemma-4-e2b-mtp.yaml	feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686 )	2026-05-07 17:27:29 +02:00
sglang-gemma-4-e4b-mtp.yaml	feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686 )	2026-05-07 17:27:29 +02:00
sglang-mimo-7b-mtp.yaml	feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686 )	2026-05-07 17:27:29 +02:00
sglang.yaml	feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686 )	2026-05-07 17:27:29 +02:00
sherpa-onnx-asr.yaml	feat: Add Sherpa ONNX backend for ASR and TTS (#8523 )	2026-04-24 14:40:06 +02:00
sherpa-onnx-tts.yaml	feat: Add Sherpa ONNX backend for ASR and TTS (#8523 )	2026-04-24 14:40:06 +02:00
sherpa-onnx-vad.yaml	feat: Add Sherpa ONNX backend for ASR and TTS (#8523 )	2026-04-24 14:40:06 +02:00
smolvlm.yaml	feat(gallery): Speed up load times and clean gallery entries (#9211 )	2026-05-06 14:51:38 +02:00
stablediffusion3.yaml	feat(sd-3): add stablediffusion 3 support (#2591 )	2024-06-18 15:09:39 +02:00
tuluv2.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
vibevoice.yaml	feat(vibevoice): add new backend (#7494 )	2025-12-10 21:14:21 +01:00
vicuna-chat.yaml	models(gallery): add apollo2-9b (#3860 )	2024-10-17 10:16:52 +02:00
virtual.yaml	fix: yamlint warnings and errors (#2131 )	2024-04-25 17:25:56 +00:00
vllm.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
wan-ggml.yaml	chore(gallery): fixup wan	2026-04-19 21:31:22 +00:00
whisper-base.yaml	models(gallery): add all whisper variants (#2462 )	2024-06-01 20:04:03 +02:00
wizardlm2.yaml	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
z-image-ggml.yaml	Fix load of z-image-turbo (#9264 )	2026-04-11 08:42:13 +02:00