LocalAI

mirror of https://github.com/mudler/LocalAI synced 2026-04-21 21:37:21 +00:00

History

Ettore Di Giacinto 6f0051301b feat(backend): add tinygrad multimodal backend (experimental) (#9364 ) * feat(backend): add tinygrad multimodal backend Wire tinygrad as a new Python backend covering LLM text generation with native tool-call extraction, embeddings, Stable Diffusion 1.x image generation, and Whisper speech-to-text from a single self-contained container. Backend (`backend/python/tinygrad/`): - `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects Llama / Qwen2 / Mistral architecture from `config.json`, supports safetensors and GGUF), Embedding via mean-pooled last hidden state, GenerateImage via the vendored SD1.x pipeline, AudioTranscription + AudioTranscriptionStream via the vendored Whisper inference loop, plus Tokenize / ModelMetadata / Status / Free. - Vendored upstream model code under `vendor/` (MIT, headers preserved): llama.py with an added `qkv_bias` flag for Qwen2-family bias support and an `embed()` method that returns the last hidden state, plus clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf training branch that pulls `mlperf.initializers`), audio_helpers.py and whisper.py (trimmed to drop the pyaudio listener). - Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 / Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral (Mistral / Mixtral). Auto-selected from model architecture or `Options`. - `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the default portable python is 3.10). - `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so tinygrad's CPU device uses the in-process libLLVM JIT instead of shelling out to the missing `clang` binary. - Local unit tests for Health and the four parsers in `test.py`. Build wiring: - Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, `BACKEND_TINYGRAD = tinygrad\|python\|.\|false\|true`, docker-build-target eval, and `docker-build-backends` aggregator. - `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix entries (mirrors the transformers backend placement). - `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image entries (latest + development). E2E test wiring: - `tests/e2e-backends/backend_test.go` gains an `image` capability that exercises GenerateImage and asserts a non-empty PNG is written to `dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS` knobs. - Five new make targets next to `test-extra-backend-vllm`: - `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes, mirrors the vllm target 1:1 (5/9 specs in ~57s). - `test-extra-backend-tinygrad-embeddings` — same model, embeddings via LLM hidden state (3/9 in ~10s). - `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror, health/load/image (3/9 in ~10min, 4 diffusion steps on CPU). - `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en against jfk.wav from whisper.cpp samples (4/9 in ~49s). - `test-extra-backend-tinygrad-all` aggregate. All four targets land green on the first MVP pass: 15 specs total, 0 failures across LLM+tools, embeddings, image generation, and speech transcription. * refactor(tinygrad): collapse to a single backend image tinygrad generates its own GPU kernels (PTX renderer for CUDA, the autogen ctypes wrappers for HIP / Metal / WebGPU) and never links against cuDNN, cuBLAS, or any toolkit-version-tied library. The only runtime dependency that varies across hosts is the driver's libcuda.so.1 / libamdhip64.so, which are injected into the container at run time by the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based backends, there is no reason to ship per-CUDA-version images. - Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries from .github/workflows/backend.yml. The sole remaining entry is renamed to -tinygrad (from -cpu-tinygrad) since it is no longer CPU-only. - Collapse backend/index.yaml to a single meta + development pair. The meta anchor carries the latest uri directly; the development entry points at the master tag. - run.sh picks the tinygrad device at launch time by probing /usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX renderer (avoids any nvrtc/toolkit dependency); otherwise we fall back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process libLLVM JIT for the CLANG path. - backend.py's _select_tinygrad_device() is trimmed to a CLANG-only fallback since production device selection happens in run.sh. Re-ran test-extra-backend-tinygrad after the change: Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed		2026-04-15 19:48:23 +02:00
..
disabled	chore(ci): disable CI actions	2026-03-02 14:48:00 +01:00
backend.yml	feat(backend): add tinygrad multimodal backend (experimental) (#9364 )	2026-04-15 19:48:23 +02:00
backend_build.yml	chore(deps): bump docker/login-action from 3 to 4 (#8918 )	2026-03-09 22:30:11 +01:00
backend_build_darwin.yml	chore(deps): bump docker/metadata-action from 5 to 6 (#8917 )	2026-03-09 22:27:02 +01:00
backend_pr.yml	Change runner from macOS-14 to macos-latest	2025-12-13 10:11:27 +01:00
build-test.yaml	chore(deps): bump actions/upload-artifact from 6 to 7 (#8730 )	2026-03-02 21:43:39 +01:00
bump-inference-defaults.yml	chore(deps): bump peter-evans/create-pull-request from 7 to 8 (#9114 )	2026-03-24 08:50:50 +01:00
bump_deps.yaml	feat(backend): add turboquant llama.cpp-fork backend (#9355 )	2026-04-15 01:25:04 +02:00
bump_docs.yaml	fix(api)!: Stop model prior to deletion (#8422 )	2026-02-06 09:22:10 +01:00
checksum_checker.yaml	fix(api)!: Stop model prior to deletion (#8422 )	2026-02-06 09:22:10 +01:00
deploy-explorer.yaml	fix(api)!: Stop model prior to deletion (#8422 )	2026-02-06 09:22:10 +01:00
gallery-agent.yaml	fix(ci): small fixups	2026-04-14 09:27:27 +00:00
generate_grpc_cache.yaml	chore(deps): bump docker/build-push-action from 6 to 7 (#8919 )	2026-03-09 22:29:51 +01:00
generate_intel_image.yaml	chore(deps): bump docker/login-action from 3 to 4 (#8918 )	2026-03-09 22:30:11 +01:00
gh-pages.yml	chore(deps): bump actions/upload-pages-artifact from 4 to 5 (#9337 )	2026-04-13 21:53:47 +02:00
image-pr.yml	feat(rocm): bump to 7.x (#9323 )	2026-04-12 08:51:30 +02:00
image.yml	feat(rocm): bump to 7.x (#9323 )	2026-04-12 08:51:30 +02:00
image_build.yml	chore: drop AIO images (#9004 )	2026-03-14 17:49:36 +01:00
notify-releases.yaml	fix(api)!: Stop model prior to deletion (#8422 )	2026-02-06 09:22:10 +01:00
release.yaml	chore(deps): bump softprops/action-gh-release from 2 to 3 (#9336 )	2026-04-13 21:53:28 +02:00
secscan.yaml	Revert "chore(deps): bump securego/gosec from 2.22.9 to 2.22.11" (#7789 )	2025-12-30 09:58:13 +01:00
stalebot.yml	chore(deps): bump actions/stale from 10.1.1 to 10.2.0 (#8633 )	2026-02-23 23:27:20 +01:00
test-extra.yml	feat(backend): add turboquant llama.cpp-fork backend (#9355 )	2026-04-15 01:25:04 +02:00
test.yml	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
tests-e2e.yml	feat(realtime): WebRTC support (#8790 )	2026-03-13 21:37:15 +01:00
tests-ui-e2e.yml	chore(deps): bump actions/upload-artifact from 4 to 7 (#9030 )	2026-03-17 11:42:49 +01:00
update_swagger.yaml	fix(api)!: Stop model prior to deletion (#8422 )	2026-02-06 09:22:10 +01:00
yaml-check.yml	chore(backend gallery): add description for remaining backends (#5679 )	2025-06-17 22:21:44 +02:00