feat(gallery): add Wan 2.1 I2V 14B 720P + pin all wan ggufs by sha256
Adds a new entry for the native-720p image-to-video sibling of the
480p I2V model (wan-2.1-i2v-14b-480p-ggml). The 720p I2V model is
trained purely as image-to-video — no first-last-frame interpolation
path — so motion is freer than repurposing the FLF2V 720P variant as
an i2v. Shares the same VAE, umt5_xxl text encoder, and clip_vision_h
auxiliary files as the existing 480p I2V and 720p FLF2V entries, so
no new aux downloads are introduced.
Also pins the main diffusion gguf by sha256 for the new entry and for
the three existing wan entries that were previously missing a hash
(wan-2.1-t2v-1.3b-ggml, wan-2.1-i2v-14b-480p-ggml,
wan-2.1-flf2v-14b-720p-ggml). Hashes were fetched from HuggingFace's
x-linked-etag header per .agents/adding-gallery-models.md.
Assisted-by: Claude:claude-opus-4-7
The API Traces tab in /app/traces always showed (0) traces despite requests
being recorded.
The /api/traces endpoint was registered in both localai.go and ui_api.go.
The ui_api.go version wrapped the response as {"traces": [...]} instead of
the flat []APIExchange array that both the React UI (Traces.jsx) and the
legacy Alpine.js UI (traces.html) expect. Because Echo matched the ui_api.go
handler, Array.isArray(apiData) always returned false, making the API Traces
tab permanently empty.
Remove the duplicate endpoints from ui_api.go so only the correct flat-array
version in localai.go is served.
Also use mime.ParseMediaType for the Content-Type check in the trace
middleware so requests with parameters (e.g. application/json; charset=utf-8)
are still traced.
Signed-off-by: Pawel Brzozowski <paul@ontux.net>
Co-authored-by: Pawel Brzozowski <paul@ontux.net>
GET /api/settings returns settings.ApiKeys as the merged env+runtime list
via ApplicationConfig.ToRuntimeSettings(). The WebUI displays that list and
round-trips it back on POST /api/settings unchanged.
UpdateSettingsEndpoint was then doing:
appConfig.ApiKeys = append(envKeys, runtimeKeys...)
where runtimeKeys already contained envKeys (because the UI got them from
the merged GET). Every save therefore duplicated the env keys on top of
the previous merge, and also wrote the duplicates to runtime_settings.json
so the duplication survived restarts and compounded with each save. This
is the user-visible behaviour in #9071: the Web UI shows the keys
twice / three times after consecutive saves.
Before we marshal the settings to disk or call ApplyRuntimeSettings, drop
any incoming key that already appears in startupConfig.ApiKeys. The file
on disk now stores only the genuinely runtime-added keys; the subsequent
append(envKeys, runtimeKeys...) produces one copy of each env key, as
intended. Behaviour is unchanged for users who never had env keys set.
Fixes#9071
Co-authored-by: SAY-5 <SAY-5@users.noreply.github.com>
First-last-frame-to-video variant of the 14B Wan family. Accepts a
start and end reference image and — unlike the pure i2v path — runs
both through clip_vision, so the final frame lands on the end image
both in pixel and semantic space. Right pick for seamless loops
(start_image == end_image) and narrative A→B cuts.
Shares the same VAE, umt5_xxl text encoder, and clip_vision_h as the
I2V 14B entry. Options block mirrors i2v's full-list-in-override
style so the template merge doesn't drop fields.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Align LocalAI with the Linux kernel project's policy for AI-assisted
contributions (https://docs.kernel.org/process/coding-assistants.html).
- Add .agents/ai-coding-assistants.md with the full policy adapted to
LocalAI's MIT license: no Signed-off-by or Co-Authored-By from AI,
attribute AI involvement via an Assisted-by: trailer, human submitter
owns the contribution.
- Surface the rules at the entry points: AGENTS.md (and its CLAUDE.md
symlink) and CONTRIBUTING.md.
- Publish a user-facing reference page at
docs/content/reference/ai-coding-assistants.md and link it from the
references index.
Assisted-by: Claude:claude-opus-4-7
gen_video's ffmpeg subprocess was relying on the filename extension to
choose the output container. Distributed LocalAI hands the backend a
staging path (e.g. /staging/localai-output-NNN.tmp) that is renamed to
.mp4 only after the backend returns, so ffmpeg saw a .tmp extension and
bailed with "Unable to choose an output format". Inference had already
completed and the frames were piped in, producing the cryptic
"video inference failed (code 1)" at the API layer.
Pass -f mp4 explicitly so the container is selected by flag instead of
by filename suffix.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two interrelated bugs that combined to make a meta backend impossible
to uninstall once its concrete had been removed from disk (partial
install, earlier crash, manual cleanup).
1. DeleteBackendFromSystem returned "meta backend %q not found" and
bailed out early when the concrete directory didn't exist,
preventing the orphaned meta dir from ever being removed. Treat a
missing concrete as idempotent success — log a warning and continue
to remove the orphan meta.
2. InstallBackendFromGallery's "already installed, skip" short-circuit
only checked that the name was known (`backends.Exists(name)`); an
orphaned meta whose RunFile points at a missing concrete still
satisfies that check, so every reinstall returned nil without doing
anything. Afterwards the worker's findBackend returned empty and we
kept looping with "backend %q not found after install attempt".
Require the entry to be actually runnable (run.sh stat-able, not a
directory) before skipping.
New helper isBackendRunnable centralises the runnability test so both
the install guard and future callers stay in sync. Tests cover the
orphaned-meta delete path and the non-runnable short-circuit case.
pending_backend_ops rows targeting agent-type workers looped forever:
the reconciler fan-out hit a NATS subject the worker doesn't subscribe
to, returned ErrNoResponders, we marked the node unhealthy, and the
health monitor flipped it back to healthy on the next heartbeat. Next
tick, same row, same failure.
Three related fixes:
1. enqueueAndDrainBackendOp skips nodes whose NodeType != backend.
Agent workers handle agent NATS subjects, not backend.install /
delete / list, so enqueueing for them guarantees an infinite retry
loop. Silent skip is correct — they aren't consumers of these ops.
2. Reconciler drain mirrors enqueueAndDrainBackendOp's behavior on
nats.ErrNoResponders: mark the node unhealthy before recording the
failure, so subsequent ListDuePendingBackendOps (filters by
status=healthy) stops picking the row until the node actually
recovers. Matches the synchronous fan-out path.
3. Dead-letter cap at maxPendingBackendOpAttempts (10). After ~1h of
exponential backoff the row is a poison message; further retries
just thrash NATS. Row is deleted and logged at ERROR so it stays
visible without staying infinite.
Plus a one-shot startup cleanup in NewNodeRegistry: drop queue rows
that target agent-type nodes, non-existent nodes, or carry an empty
backend name. Guarded by the same schema-migration advisory lock so
only one instance performs it. The guards above prevent new rows of
this shape; this closes the migration gap for existing ones.
Tests: the prune migration (valid row stays, agent + empty-name rows
drop) on top of existing upsert / backoff coverage.
* fix(turboquant): drop ignore-eos patch, bump fork to b8967-627ebbc
The upstream PR #21203 (server: respect the ignore_eos flag) has been
merged into the TheTom/llama-cpp-turboquant feature/turboquant-kv-cache
branch. With the fix now in-tree, 0001-server-respect-the-ignore-eos-flag.patch
no longer applies (git apply sees its additions already present) and the
nightly turboquant bump fails.
Retire the patch and bump the pin to the first fork revision that carries
the merged fix (tag feature-turboquant-kv-cache-b8967-627ebbc). This matches
the contract in apply-patches.sh: drop patches once the fork catches up.
* fix(turboquant): patch out get_media_marker() call in grpc-server copy
CI turboquant docker build was failing with:
grpc-server.cpp:2825:40: error: use of undeclared identifier
'get_media_marker'
The call was added by 7809c5f5 (PR #9412) to propagate the mtmd random
per-server media marker upstream landed in ggml-org/llama.cpp#21962. The
TheTom/llama-cpp-turboquant fork branched before that PR, so its
server-common.cpp has no such symbol.
Extend patch-grpc-server.sh to substitute get_media_marker() with the
legacy "<__media__>" literal in the build-time grpc-server.cpp copy
under turboquant-<flavor>-build/. The fork's mtmd_default_marker()
returns exactly that string, and the Go layer falls back to the same
sentinel when media_marker is empty, so behavior on the turboquant path
is unchanged. Patched copy only — the shared source under
backend/cpp/llama-cpp/ keeps compiling against vanilla upstream.
Verified by running `make docker-build-turboquant` locally end-to-end:
all five flavors (avx, avx2, avx512, fallback, grpc+rpc-server) now
compile past the previous failure and the image tags successfully.
* fix(distributed): detect backend upgrades across worker nodes
Before this change `DistributedBackendManager.CheckUpgrades` delegated to the
local manager, which read backends from the frontend filesystem. In
distributed deployments the frontend has no backends installed locally —
they live on workers — so the upgrade-detection loop never ran and the UI
silently never surfaced upgrades even when the gallery advertised newer
versions or digests.
Worker-side: NATS backend.list reply now carries Version, URI and Digest
for each installed backend (read from metadata.json).
Frontend-side: DistributedBackendManager.ListBackends aggregates per-node
refs (name, status, version, digest) instead of deduping, and CheckUpgrades
feeds that aggregation into gallery.CheckUpgradesAgainst — a new entrypoint
factored out of CheckBackendUpgrades so both paths share the same core
logic.
Cluster drift policy: when per-node version/digest tuples disagree, the
backend is flagged upgradeable regardless of whether any single node
matches the gallery, and UpgradeInfo.NodeDrift enumerates the outliers so
operators can see *why* it is out of sync. The next upgrade-all realigns
the cluster.
Tests cover: drift detection, unanimous-match (no upgrade), and the
empty-installed-version path that the old distributed code silently
missed.
* feat(ui): surface backend upgrades in the System page
The System page (Manage.jsx) only showed updates as a tiny inline arrow,
so operators routinely missed them. Port the Backend Gallery's upgrade UX
so System speaks the same visual language:
- Yellow banner at the top of the Backends tab when upgrades are pending,
with an "Upgrade all" button (serial fan-out, matches the gallery) and a
"Updates only" filter toggle.
- Warning pill (↑ N) next to the tab label so the count is glanceable even
when the banner is scrolled out of view.
- Per-row labeled "Upgrade to vX.Y" button (replaces the icon-only button
that silently flipped semantics between Reinstall and Upgrade), plus an
"Update available" badge in the new Version column.
- New columns: Version (with upgrade + drift chips), Nodes (per-node
attribution badges for distributed mode, degrading to a compact
"on N nodes · M offline" chip above three nodes), Installed (relative
time).
- System backends render a "Protected" chip instead of a bare "—" so rows
still align and the reason is obvious.
- Delete uses the softer btn-danger-ghost so rows don't scream red; the
ConfirmDialog still owns the "are you sure".
The upgrade checker also needed the same per-worker fix as the previous
commit: NewUpgradeChecker now takes a BackendManager getter so its
periodic runs call the distributed CheckUpgrades (which asks workers)
instead of the empty frontend filesystem. Without this the /api/backends/
upgrades endpoint stayed empty in distributed mode even with the protocol
change in place.
New CSS primitives — .upgrade-banner, .tab-pill, .badge-row, .cell-stack,
.cell-mono, .cell-muted, .row-actions, .btn-danger-ghost — all live in
App.css so other pages can adopt them without duplicating styles.
* feat(ui): polish the Nodes page so it reads like a product
The Nodes page was the biggest visual liability in distributed mode.
Rework the main dashboard surfaces in place without changing behavior:
StatCards: uniform height (96px min), left accent bar colored by the
metric's semantic (success/warning/error/primary), icon lives in a
36x36 soft-tinted chip top-right, value is left-aligned and large.
Grid auto-fills so the row doesn't collapse on narrow viewports. This
replaces the previous thin-bordered boxes with inconsistent heights.
Table rows: expandable rows now show a chevron cue on the left (rotates
on expand) so users know rows open. Status cell became a dedicated chip
with an LED-style halo dot instead of a bare bullet. Action buttons gained
labels — "Approve", "Resume", "Drain" — so the icons aren't doing all
the semantic work; the destructive remove action uses the softer
btn-danger-ghost variant so rows don't scream red, with the ConfirmDialog
still owning the real "are you sure". Applied cell-mono/cell-muted
utility classes so label chips and addresses share one spacing/font
grammar instead of re-declaring inline styles everywhere.
Expanded drawer: empty states for Loaded Models and Installed Backends
now render as a proper drawer-empty card (dashed border, icon, one-line
hint) instead of a plain muted string that read like broken formatting.
Tabs: three inline-styled buttons became the shared .tab class so they
inherit focus ring, hover state, and the rest of the design system —
matches the System page.
"Add more workers" toggle turned into a .nodes-add-worker dashed-border
button labelled "Register a new worker" (action voice) instead of a
chevron + muted link that operators kept mistaking for broken text.
New shared CSS primitives carry over to other pages:
.stat-grid + .stat-card, .row-chevron, .node-status, .drawer-empty,
.nodes-add-worker.
* feat(distributed): durable backend fan-out + state reconciliation
Two connected problems handled together:
1) Backend delete/install/upgrade used to silently skip non-healthy nodes,
so a delete during an outage left a zombie on the offline node once it
returned. The fan-out now records intent in a new pending_backend_ops
table before attempting the NATS round-trip. Currently-healthy nodes
get an immediate attempt; everyone else is queued. Unique index on
(node_id, backend, op) means reissuing the same operation refreshes
next_retry_at instead of stacking duplicates.
2) Loaded-model state could drift from reality: a worker OOM'd, got
killed, or restarted a backend process would leave a node_models row
claiming the model was still loaded, feeding ghost entries into the
/api/nodes/models listing and the router's scheduling decisions.
The existing ReplicaReconciler gains two new passes that run under a
fresh KeyStateReconciler advisory lock (non-blocking, so one wedged
frontend doesn't freeze the cluster):
- drainPendingBackendOps: retries queued ops whose next_retry_at has
passed on currently-healthy nodes. Success deletes the row; failure
bumps attempts and pushes next_retry_at out with exponential backoff
(30s → 15m cap). ErrNoResponders also marks the node unhealthy.
- probeLoadedModels: gRPC-HealthChecks addresses the DB thinks are
loaded but hasn't seen touched in the last probeStaleAfter (2m).
Unreachable addresses are removed from the registry. A pluggable
ModelProber lets tests substitute a fake without standing up gRPC.
DistributedBackendManager exposes DeleteBackendDetailed so the HTTP
handler can surface per-node outcomes ("2 succeeded, 1 queued") to the
UI in a follow-up commit; the existing DeleteBackend still returns
error-only for callers that don't care about node breakdown.
Multi-frontend safety: the state pass uses advisorylock.TryWithLockCtx
on a new key so N frontends coordinate — the same pattern the health
monitor and replica reconciler already rely on. Single-node mode runs
both passes inline (adapter is nil, state drain is a no-op).
Tests cover the upsert semantics, backoff math, the probe removing an
unreachable model but keeping a reachable one, and filtering by
probeStaleAfter.
* feat(ui): show cluster distribution of models in the System page
When a frontend restarted in distributed mode, models that workers had
already loaded weren't visible until the operator clicked into each node
manually — the /api/models/capabilities endpoint only knew about
configs on the frontend's filesystem, not the registry-backed truth.
/api/models/capabilities now joins in ListAllLoadedModels() when the
registry is active, returning loaded_on[] with node id/name/state/status
for each model. Models that live in the registry but lack a local config
(the actual ghosts, not recovered from the frontend's file cache) still
surface with source="registry-only" so operators can see and persist
them; without that emission they'd be invisible to this frontend.
Manage → Models replaces the old Running/Idle pill with a distribution
cell that lists the first three nodes the model is loaded on as chips
colored by state (green loaded, blue loading, amber anything else). On
wider clusters the remaining count collapses into a +N chip with a
title-attribute breakdown. Disabled / single-node behavior unchanged.
Adopted models get an extra "Adopted" ghost-icon chip with hover copy
explaining what it means and how to make it permanent.
Distributed mode also enables a 10s auto-refresh and a "Last synced Xs
ago" indicator next to the Update button so ghost rows drop off within
one reconcile tick after their owning process dies. Non-distributed
mode is untouched — no polling, no cell-stack, same old Running/Idle.
* feat(ui): NodeDistributionChip — shared per-node attribution component
Large clusters were going to break the Manage → Backends Nodes column:
the old inline logic rendered every node as a badge and would shred the
layout at >10 workers, plus the Manage → Models distribution cell had
copy-pasted its own slightly-different version.
NodeDistributionChip handles any cluster size with two render modes:
- small (≤3 nodes): inline chips of node names, colored by health.
- large: a single "on N nodes · M offline · K drift" summary chip;
clicking opens a Popover with a per-node table (name, status,
version, digest for backends; name, status, state for models).
Drift counting mirrors the backend's summarizeNodeDrift so the UI
number matches UpgradeInfo.NodeDrift. Digests are truncated to the
docker-style 12-char form with the full value preserved in the title.
Popover is a new general-purpose primitive: fixed positioning anchored
to the trigger, flips above when there's no room below, closes on
outside-click or Escape, returns focus to the trigger. Uses .card as
its surface so theming is inherited. Also useful for a future
labels-editor popup and the user menu.
Manage.jsx drops its duplicated inline Nodes-column + loaded_on cell
and uses the shared chip with context="backends" / "models"
respectively. Delete code removes ~40 lines of ad-hoc logic.
* feat(ui): shared FilterBar across the System page tabs
The Backends gallery had a nice search + chip + toggle strip; the System
page had nothing, so the two surfaces felt like different apps. Lift the
pattern into a reusable FilterBar and wire both System tabs through it.
New component core/http/react-ui/src/components/FilterBar.jsx renders a
search input, a role="tablist" chip row (aria-selected for a11y), and
optional toggles / right slot. Chips support an optional `count` which
the System page uses to show "User 3", "Updates 1" etc.
System Models tab: search by id or backend; chips for
All/Running/Idle/Disabled/Pinned plus a conditional Distributed chip in
distributed mode. "Last synced" + Update button live in the right slot.
System Backends tab: search by name/alias/meta-backend-for; chips for
All/User/System/Meta plus conditional Updates / Offline-nodes chips
when relevant. The old ad-hoc "Updates only" toggle from the upgrade
banner folded into the Updates chip — one source of truth for that
filter. Offline chip only appears in distributed mode when at least
one backend has an unhealthy node, so the chip row stays quiet on
healthy clusters.
Filter state persists in URL query params (mq/mf/bq/bf) so deep links
and tab switches keep the operator's filter context instead of
resetting every time.
Also adds an "Adopted" distribution path: when a model in
/api/models/capabilities carries source="registry-only" (discovered on
a worker but not configured locally), the Models tab shows a ghost chip
labelled "Adopted" with hover copy explaining how to persist it — this
is what closes the loop on the ghost-model story end-to-end.
The backend.proto was updated to add AudioTranscriptionStream RPC, but
the Rust KokorosService was never updated to match the regenerated
tonic trait, breaking compilation with E0046.
Stubs the new streaming method as unimplemented, matching the pattern
used for the other streaming RPCs Kokoros does not support.
Add gfx1151 (AMD Strix Halo / Ryzen AI MAX) to the default AMDGPU_TARGETS
list in the llama-cpp backend Makefile. ROCm 7.2.1 ships with gfx1151
Tensile libraries, so this architecture should be included in default builds.
Also expose AMDGPU_TARGETS as an ARG/ENV in Dockerfile.llama-cpp so that
users building for non-default GPU architectures can override the target
list via --build-arg AMDGPU_TARGETS=<arch>. Previously, passing
-DAMDGPU_TARGETS=<arch> through CMAKE_ARGS was silently overridden by
the Makefile's own append of the default target list.
Fixes#9374
Signed-off-by: Keith Mattix <keithmattix2@gmail.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
The shared grpc-server CMakeLists hardcoded `llama-common`, the post-rename
target name in upstream llama.cpp. The turboquant fork branched before that
rename and still exposes the helpers library as `common`, so the name
silently degraded to a plain `-llama-common` link flag, the PUBLIC include
directory was never propagated, and tools/server/server-task.h failed to
find common.h during turboquant-<flavor> builds.
Upstream llama.cpp (PR #21962) switched the server-side mtmd media
marker to a random per-server string and removed the legacy
"<__media__>" backward-compat replacement in mtmd_tokenizer. The
Go layer still emitted the hardcoded "<__media__>", so on the
non-tokenizer-template path the prompt arrived with a marker mtmd
did not recognize and tokenization failed with "number of bitmaps
(1) does not match number of markers (0)".
Report the active media marker via ModelMetadataResponse.media_marker
and substitute the sentinel "<__media__>" with it right before the
gRPC call, after the backend has been loaded and probed. Also skip
the Go-side multimodal templating entirely when UseTokenizerTemplate
is true — llama.cpp's oaicompat_chat_params_parse already injects its
own marker and StringContent is unused in that path. Backends that do
not expose the field keep the legacy "<__media__>" behavior.
Upstream llama.cpp (45cac7ca) renamed the CMake library target
`common` to `llama-common`. Linking the old name caused
`target_include_directories(... PUBLIC .)` from the common/ dir
to not propagate, so `#include "common.h"` failed when building
grpc-server.
The gallery-agent lives under .github/, which Go tooling treats as a
hidden directory and excludes from './...' expansion. That means 'go
mod tidy' (run on every dependabot dependency bump) repeatedly strips
github.com/ghodss/yaml from go.mod/go.sum, breaking 'go run
./.github/gallery-agent' with a missing go.sum entry error.
Switch to sigs.k8s.io/yaml — API-compatible with ghodss/yaml and
already pulled in as a transitive dependency via non-hidden packages,
so tidy can no longer remove it.
Editing a model's YAML and changing the `name:` field previously wrote
the new body to the original `<oldName>.yaml`. On reload the config
loader indexed that file under the new name while the old key
lingered in memory, producing two entries in the system UI that
shared a single underlying file — deleting either removed both.
Detect the rename in EditModelEndpoint and rename the on-disk
`<name>.yaml` and `._gallery_<name>.yaml` to match, drop the stale
in-memory key before the reload, and redirect the editor URL in the
React UI so it tracks the new name. Reject conflicts (409) and names
containing path separators (400).
Fixes#9294
chore: ⬆️ Update TheTom/llama-cpp-turboquant to `45f8a066ed5f5bb38c695cec532f6cef9f4efa9d`
Drop 0002-ggml-rpc-bump-op-count-to-97.patch; the fork now has
GGML_OP_COUNT == 97 and RPC_PROTO_PATCH_VERSION 2 upstream.
Fetch all tags in backend/cpp/llama-cpp/Makefile so tag-only commits
(the new turboquant pin is reachable only through the tag
feature-turboquant-kv-cache-b8821-45f8a06) can be checked out.
Drop the 295-line vendor/llama.py fork in favor of `tinygrad.apps.llm`,
which now provides the Transformer blocks, GGUF loader (incl. Q4/Q6/Q8
quantization), KV-cache and generate loop we were maintaining ourselves.
What changed:
- New vendor/appsllm_adapter.py (~90 LOC) — HF -> GGUF-native state-dict
keymap, Transformer kwargs builder, `_embed_hidden` helper, and a hard
rejection of qkv_bias models (Qwen2 / 2.5 are no longer supported; the
apps.llm Transformer ties `bias=False` on Q/K/V projections).
- backend.py routes both safetensors and GGUF paths through
apps.llm.Transformer. Generation now delegates to its (greedy-only)
`generate()`; Temperature / TopK / TopP / RepetitionPenalty are still
accepted on the wire but ignored — documented in the module docstring.
- Jinja chat render now passes `enable_thinking=False` so Qwen3's
reasoning preamble doesn't eat the tool-call token budget on small
models.
- Embedding path uses `_embed_hidden` (block stack + output_norm) rather
than the custom `embed()` method we were carrying on the vendored
Transformer.
- test.py gains TestAppsLLMAdapter covering the keymap rename, tied
embedding fallback, unknown-key skipping, and qkv_bias rejection.
- Makefile fixtures move from Qwen/Qwen2.5-0.5B-Instruct to Qwen/Qwen3-0.6B
(apps.llm-compatible) and tool_parser from qwen3_xml to hermes (the
HF chat template emits hermes-style JSON tool calls).
Verified with the docker-backed targets:
test-extra-backend-tinygrad 5/5 PASS
test-extra-backend-tinygrad-embeddings 3/3 PASS
test-extra-backend-tinygrad-whisper 4/4 PASS
test-extra-backend-tinygrad-sd 3/3 PASS
* feat(backends): add sglang
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(sglang): force AVX-512 CXXFLAGS and disable CI e2e job
sgl-kernel's shm.cpp uses __m512 AVX-512 intrinsics unconditionally;
-march=native fails on CI runners without AVX-512 in /proc/cpuinfo.
Force -march=sapphirerapids so the build always succeeds, matching
sglang upstream's docker/xeon.Dockerfile recipe.
The resulting binary still requires an AVX-512 capable CPU at runtime,
so disable tests-sglang-grpc in test-extra.yml for the same reason
tests-vllm-grpc is disabled. Local runs with make test-extra-backend-sglang
still work on hosts with the right SIMD baseline.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(sglang): patch CMakeLists.txt instead of CXXFLAGS for AVX-512
CXXFLAGS with -march=sapphirerapids was being overridden by
add_compile_options(-march=native) in sglang's CPU CMakeLists.txt,
since CMake appends those flags after CXXFLAGS. Sed-patch the
CMakeLists.txt directly after cloning to replace -march=native.
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The gemma-4-26b-a4b-it, gemma-4-e2b-it, and gemma-4-e4b-it gallery
entries pointed at files that do not exist on HuggingFace, so LocalAI
fails with 404 when users try to install them.
Two issues per entry:
- mmproj filename uses the 'f16' quantization suffix, but ggml-org
publishes the mmproj projectors as 'bf16'.
- The e2b and e4b URIs hardcode lowercase 'e2b'/'e4b' in the filename
component. HuggingFace file paths are case-sensitive and the real
files use uppercase 'E2B'/'E4B'.
Updated filename, uri, sha256, and the top-level 'mmproj' and
'parameters.model' references so every entry points at a real file
and the declared hashes match the content.
Verified each URI resolves (HTTP 302) and each sha256 matches the
'x-linked-etag' header returned by HuggingFace.
Signed-off-by: Matt Van Horn <mvanhorn@gmail.com>