LocalAI/.github/workflows
Richard Palethorpe c894d9c826
feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686)
Bring the sglang Python backend up to feature parity with vllm by adding
the same engine_args:-map plumbing the vLLM backend already has. Any
ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model
YAML, including the speculative-decoding flags needed for Multi-Token
Prediction. Validation matches the vllm backend's: keys are checked
against dataclasses.fields(ServerArgs), unknown keys raise ValueError
with a difflib close-match suggestion at LoadModel time, and the typed
ModelOptions fields keep their existing meaning with engine_args
overriding them.

Backend code:
* backend/python/sglang/backend.py: add _apply_engine_args, import
  dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed ->
  sampling_seed (sglang 0.5.11 renamed the SamplingParams field).
* backend/python/sglang/test.py + test.sh + Makefile: six unit tests
  exercising the helper directly (no engine load required).

Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class):
* backend/python/sglang/install.sh: add --prerelease=allow because
  sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels;
  add --index-strategy=unsafe-best-match for cublas12 so the cu128
  torch index wins over default-PyPI's cu130; new pyproject.toml-driven
  l4t13 install path so [tool.uv.sources] can pin torch/torchvision/
  torchaudio/sglang to the jetson-ai-lab index without forcing every
  transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the
  equivalent fix in backend/python/vllm/install.sh).
* backend/python/sglang/pyproject.toml (new): L4T project spec with
  explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt
  for the l4t13 BUILD_PROFILE; other profiles still go through the
  requirements-*.txt pipeline via libbackend.sh's installRequirements.
* backend/python/sglang/requirements-l4t13.txt: removed; superseded
  by pyproject.toml.
* backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin
  sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13
  (new files) and cu128 torch index for cublas12 (default PyPI now
  ships cu130 torch wheels by default and breaks cu12 hosts).
* backend/index.yaml: add cuda13-sglang and cuda13-sglang-development
  capability mappings + image entries pointing at
  quay.io/.../-gpu-nvidia-cuda-13-sglang.
* .github/workflows/backend.yml: new cublas13 sglang matrix entry,
  mirroring vllm's cuda13 build.

Model gallery + docs:
* gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml.
* gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos
  transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands.
* gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads
  + online fp8 weight quantization, verified end-to-end on a 16 GB
  RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the
  MTP draft worker's vocab embedding is loaded unquantised and OOMs
  the static reservation at sglang's 0.85 default.
* gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp,
  gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang).
* docs/content/features/text-generation.md: new SGLang section with
  setup, engine_args reference, MTP demos, version requirements.
* .agents/sglang-backend.md (new): agent one-pager covering the flat
  ServerArgs structure, the typed-vs-engine_args precedence, the
  speculative-decoding cheatsheet, and the mem_fraction_static gotcha
  documented above.
* AGENTS.md: index entry for the new agent doc.

Known limitation: the two Gemma 4 MTP gallery entries ship a recipe
that doesn't yet run on stock libraries. The drafter checkpoints
(google/gemma-4-{E2B,E4B}-it-assistant) declare
model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which
neither transformers (<=5.6.0, including the SGLang cookbook's pinned
commit 91b1ab1f... and main HEAD) nor sglang's own model registry
(<=0.5.11) registers as of 2026-05-06. They will start working when
HF or sglang upstream registers the architecture -- no LocalAI
changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work
today on this build (verified on RTX 5070 Ti, 16 GB).

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-07 17:27:29 +02:00
..
disabled chore(ci): disable CI actions 2026-03-02 14:48:00 +01:00
backend.yml feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686) 2026-05-07 17:27:29 +02:00
backend_build.yml fix(ci): switch apt mirror per runner — azure on github-hosted, kernel.org on self-hosted 2026-05-03 22:59:26 +00:00
backend_build_darwin.yml ci(darwin): add native caches to backend_build_darwin 2026-04-27 20:17:36 +00:00
backend_pr.yml fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds (#9626) 2026-05-02 15:53:14 +02:00
build-test.yaml feat(ci): allow routing apt traffic through an alternate Ubuntu mirror (#9650) 2026-05-03 23:50:13 +02:00
bump-inference-defaults.yml chore(deps): bump peter-evans/create-pull-request from 7 to 8 (#9114) 2026-03-24 08:50:50 +01:00
bump_deps.yaml fix: unbreak master CI (docs, kokoros, vibevoice-cpp ABI) (#9682) 2026-05-06 10:36:59 +02:00
bump_docs.yaml fix(api)!: Stop model prior to deletion (#8422) 2026-02-06 09:22:10 +01:00
checksum_checker.yaml feat(ci): allow routing apt traffic through an alternate Ubuntu mirror (#9650) 2026-05-03 23:50:13 +02:00
deploy-explorer.yaml fix(api)!: Stop model prior to deletion (#8422) 2026-02-06 09:22:10 +01:00
gallery-agent.yaml Change cron schedule to run every 12 hours 2026-04-25 18:38:28 +02:00
generate_intel_image.yaml [intel GPU support] Use latest oneapi-basekit image for Intel images to support b70 (#9543) 2026-04-24 18:29:10 +02:00
gh-pages.yml chore(deps): bump actions/upload-pages-artifact from 4 to 5 (#9337) 2026-04-13 21:53:47 +02:00
image-pr.yml ci: switch image/backend build cache to a dedicated registry image 2026-04-27 13:13:04 +00:00
image.yml ci: switch image/backend build cache to a dedicated registry image 2026-04-27 13:13:04 +00:00
image_build.yml fix(ci): switch apt mirror per runner — azure on github-hosted, kernel.org on self-hosted 2026-05-03 22:59:26 +00:00
lint.yml chore(deps): bump actions/checkout from 4 to 6 (#9663) 2026-05-05 15:37:05 +02:00
notify-releases.yaml fix(api)!: Stop model prior to deletion (#8422) 2026-02-06 09:22:10 +01:00
release.yaml feat(ci): allow routing apt traffic through an alternate Ubuntu mirror (#9650) 2026-05-03 23:50:13 +02:00
secscan.yaml Revert "chore(deps): bump securego/gosec from 2.22.9 to 2.22.11" (#7789) 2025-12-30 09:58:13 +01:00
stalebot.yml chore(deps): bump actions/stale from 10.1.1 to 10.2.0 (#8633) 2026-02-23 23:27:20 +01:00
test-extra.yml feat: add LocalVQE backend and audio transformations UI (#9640) 2026-05-04 22:07:11 +02:00
test.yml refactor(tests): split app_test.go, move real-backend coverage to e2e-backends 2026-04-27 23:09:20 +00:00
tests-aio.yml refactor(tests): split app_test.go, move real-backend coverage to e2e-backends 2026-04-27 23:09:20 +00:00
tests-e2e.yml feat(ci): allow routing apt traffic through an alternate Ubuntu mirror (#9650) 2026-05-03 23:50:13 +02:00
tests-ui-e2e.yml feat(ci): allow routing apt traffic through an alternate Ubuntu mirror (#9650) 2026-05-03 23:50:13 +02:00
update_swagger.yaml feat(ci): allow routing apt traffic through an alternate Ubuntu mirror (#9650) 2026-05-03 23:50:13 +02:00
yaml-check.yml chore(backend gallery): add description for remaining backends (#5679) 2025-06-17 22:21:44 +02:00