LocalAI

mirror of https://github.com/mudler/LocalAI synced 2026-05-24 09:28:23 +00:00

History

LocalAI [bot] d77a9137d8 feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults (#9852 ) * feat(llama-cpp): bump to MTP-merge SHA and document draft-mtp spec type Update LLAMA_VERSION to 0253fb21 (post ggml-org/llama.cpp#22673 merge, 2026-05-16) to pick up Multi-Token Prediction support. No grpc-server.cpp changes are required: the existing `spec_type` option delegates to upstream's `common_speculative_types_from_names()`, which already accepts the new `draft-mtp` name. The `n_rs_seq` cparam needed by MTP is auto-derived inside `common_context_params_to_llama` from `params.speculative.need_n_rs_seq()`, and when no `draft_model` is set the upstream server builds the MTP context off the target model itself. Docs: extend the speculative-decoding section of the model-configuration guide with the new type, both load paths (MTP head embedded in the main GGUF vs. separate `mtp-.gguf` sibling), the PR's recommended `spec_n_max:2-3`, and the chained `draft-mtp,ngram-mod` recipe. Also notes that the upstream `-hf` auto-discovery of `mtp-.gguf` siblings is not wired through LocalAI's gRPC layer. Agent guide: short note explaining that new upstream spec types are picked up automatically and that MTP needs no gRPC plumbing. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): auto-detect MTP heads and enable draft-mtp on import + load Detect upstream's `<arch>.nextn_predict_layers` GGUF metadata key (set by `convert_hf_to_gguf.py` for Qwen3.5/3.6 family models and similar) and, when present and the user has not configured a `spec_type` explicitly, auto-append the upstream-recommended speculative-decoding tuple: - spec_type:draft-mtp - spec_n_max:6 - spec_p_min:0.75 The 0.75 p_min is pinned defensively because upstream marks the current default with a "change to 0.0f" TODO; locking it here keeps acceptance thresholds stable across future llama.cpp bumps. Detection runs in two places: - The model importer (`POST /models/import-uri`, the `/import-model` UI) range-fetches the GGUF header for HuggingFace / direct-URL imports via `gguf.ParseGGUFFileRemote`, with a 30s timeout and non-fatal error handling. OCI/Ollama URIs are skipped because the artifact is not directly streamable; the load-time hook covers them once the file is on disk. - The llama-cpp load-time hook (`guessGGUFFromFile`) reads the local header on every model start and appends the same options if `spec_type` is not already set. Both paths share `ApplyMTPDefaults` and respect an explicit user-set `spec_type:` / `speculative_type:` so YAML overrides win. Ginkgo specs cover the append, preserve-user-choice, legacy alias, and nil safety paths. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(importer): resolve huggingface:// URIs before MTP header probe `gguf.ParseGGUFFileRemote` only speaks HTTP(S), but the importer was handing it the raw `huggingface://...` URI directly (and similarly for any other custom downloader scheme). Live-test against `huggingface://ggml-org/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-MTP-Q8_0.gguf` exposed this: the probe failed with `unsupported protocol scheme "huggingface"`, was caught by the non-fatal error path, and the MTP options were silently never applied to the generated YAML. Route every candidate URI through `downloader.URI.ResolveURL()` and require the resolved form to be HTTP(S). After the fix the probe successfully reads `<arch>.nextn_predict_layers=1` from the real HF GGUF and the emitted ConfigFile carries spec_type:draft-mtp, spec_n_max:6, spec_p_min:0.75 as intended. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>		2026-05-16 22:42:48 +02:00
..
gen_inference_defaults	feat: inferencing default, automatic tool parsing fallback and wire min_p (#9092 )	2026-03-22 00:57:15 +01:00
meta	feat(ui): Interactive model config editor with autocomplete (#9149 )	2026-04-07 14:42:23 +02:00
application_config.go	feat(branding): admin-configurable instance name, tagline, and assets (#9635 )	2026-05-02 15:51:36 +02:00
application_config_test.go	feat: backend versioning, upgrade detection and auto-upgrade (#9315 )	2026-04-11 22:31:15 +02:00
backend_capabilities.go	feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page (#9801 )	2026-05-13 21:57:27 +02:00
backend_capabilities_test.go	feat(gallery): Speed up load times and clean gallery entries (#9211 )	2026-05-06 14:51:38 +02:00
backend_hooks.go	feat(vllm): parity with llama.cpp backend (#9328 )	2026-04-13 11:00:29 +02:00
config_suite_test.go	dependencies(grpcio): bump to fix CI issues (#2362 )	2024-05-21 14:33:47 +02:00
distributed_config.go	fix(distributed): cascade-clean stale node_models rows + filter routing by healthy status (#9754 )	2026-05-13 21:57:50 +02:00
gallery.go	refactor: gallery inconsistencies (#2647 )	2024-06-24 17:32:12 +02:00
gguf.go	feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults (#9852 )	2026-05-16 22:42:48 +02:00
gguf_reasoning_test.go	Respect explicit reasoning config during GGUF thinking probe (#9463 )	2026-04-21 21:53:10 +02:00
hooks_llamacpp.go	feat(vllm): parity with llama.cpp backend (#9328 )	2026-04-13 11:00:29 +02:00
hooks_test.go	feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563 )	2026-04-29 00:49:28 +02:00
hooks_vllm.go	feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563 )	2026-04-29 00:49:28 +02:00
inference_defaults.go	feat: inferencing default, automatic tool parsing fallback and wire min_p (#9092 )	2026-03-22 00:57:15 +01:00
inference_defaults.json	chore: bump inference defaults from unsloth (#9396 )	2026-04-17 09:05:55 +02:00
inference_defaults_test.go	feat: inferencing default, automatic tool parsing fallback and wire min_p (#9092 )	2026-03-22 00:57:15 +01:00
model_config.go	feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page (#9801 )	2026-05-13 21:57:27 +02:00
model_config_filter.go	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
model_config_loader.go	feat(concurrency-groups): per-model exclusive groups for backend loading (#9662 )	2026-05-05 08:42:50 +02:00
model_config_loader_test.go	feat(concurrency-groups): per-model exclusive groups for backend loading (#9662 )	2026-05-05 08:42:50 +02:00
model_config_test.go	feat(concurrency-groups): per-model exclusive groups for backend loading (#9662 )	2026-05-05 08:42:50 +02:00
model_test.go	fix(tests): inline model_test fixtures after tests/models_fixtures removal	2026-04-28 12:58:49 +00:00
mtp.go	feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults (#9852 )	2026-05-16 22:42:48 +02:00
mtp_test.go	feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults (#9852 )	2026-05-16 22:42:48 +02:00
parser_defaults.json	feat(vllm): parity with llama.cpp backend (#9328 )	2026-04-13 11:00:29 +02:00
runtime_settings.go	feat(branding): admin-configurable instance name, tagline, and assets (#9635 )	2026-05-02 15:51:36 +02:00
runtime_settings_persist.go	feat(branding): admin-configurable instance name, tagline, and assets (#9635 )	2026-05-02 15:51:36 +02:00
runtime_settings_persist_test.go	feat(branding): admin-configurable instance name, tagline, and assets (#9635 )	2026-05-02 15:51:36 +02:00