LocalAI/docs/content/features/text-generation.md


+++
disableToc = false
title = "Text Generation (GPT)"
weight = 10
url = "/features/text-generation/"
+++

LocalAI supports generating text with GPT with `llama.cpp` and other backends (such as `rwkv.cpp` as ) see also the [Model compatibility]({{%relref "reference/compatibility-table" %}}) for an up-to-date list of the supported model families.

Note:

- You can also specify the model name as part of the OpenAI token.
- If only one model is available, the API will use it for all the requests.

## API Reference

### Chat completions

https://platform.openai.com/docs/api-reference/chat

For example, to generate a chat completion, you can send a POST request to the `/v1/chat/completions` endpoint with the instruction as the request body:

```bash
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "messages": [{"role": "user", "content": "Say this is a test!"}],
  "temperature": 0.7
}'
```

Available additional parameters: `top_p`, `top_k`, `max_tokens`

### Edit completions

https://platform.openai.com/docs/api-reference/edits

To generate an edit completion you can send a POST request to the `/v1/edits` endpoint with the instruction as the request body:

```bash
curl http://localhost:8080/v1/edits -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "instruction": "rephrase",
  "input": "Black cat jumped out of the window",
  "temperature": 0.7
}'
```

Available additional parameters: `top_p`, `top_k`, `max_tokens`.

### Completions

https://platform.openai.com/docs/api-reference/completions

To generate a completion, you can send a POST request to the `/v1/completions` endpoint with the instruction as per the request body:

```bash
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "prompt": "A long time ago in a galaxy far, far away",
  "temperature": 0.7
}'
```

Available additional parameters: `top_p`, `top_k`, `max_tokens`

### List models

You can list all the models available with:

```bash
curl http://localhost:8080/v1/models
```

### Anthropic Messages API

LocalAI supports the Anthropic Messages API, which is compatible with Claude clients. This endpoint provides a structured way to send messages and receive responses, with support for tools, streaming, and multimodal content.

**Endpoint:** `POST /v1/messages` or `POST /messages`

**Reference:** https://docs.anthropic.com/claude/reference/messages_post

#### Basic Usage

```bash
curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "max_tokens": 1024,
    "messages": [
      {"role": "user", "content": "Say this is a test!"}
    ]
  }'
```

#### Request Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `model` | string | Yes | The model identifier |
| `messages` | array | Yes | Array of message objects with `role` and `content` |
| `max_tokens` | integer | Yes | Maximum number of tokens to generate (must be > 0) |
| `system` | string | No | System message to set the assistant's behavior |
| `temperature` | float | No | Sampling temperature (0.0 to 1.0) |
| `top_p` | float | No | Nucleus sampling parameter |
| `top_k` | integer | No | Top-k sampling parameter |
| `stop_sequences` | array | No | Array of strings that will stop generation |
| `stream` | boolean | No | Enable streaming responses |
| `tools` | array | No | Array of tool definitions for function calling |
| `tool_choice` | string/object | No | Tool choice strategy: "auto", "any", "none", or specific tool |
| `metadata` | object | No | Per-request metadata passed to the backend (e.g., `{"enable_thinking": "true"}`) |

#### Message Format

Messages can contain text or structured content blocks:

```bash
curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "max_tokens": 1024,
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What is in this image?"
          },
          {
            "type": "image",
            "source": {
              "type": "base64",
              "media_type": "image/jpeg",
              "data": "base64_encoded_image_data"
            }
          }
        ]
      }
    ]
  }'
```

#### Tool Calling

The Anthropic API supports function calling through tools:

```bash
curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "max_tokens": 1024,
    "tools": [
      {
        "name": "get_weather",
        "description": "Get the current weather",
        "input_schema": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state"
            }
          },
          "required": ["location"]
        }
      }
    ],
    "tool_choice": "auto",
    "messages": [
      {"role": "user", "content": "What is the weather in San Francisco?"}
    ]
  }'
```

#### Streaming

Enable streaming responses by setting `stream: true`:

```bash
curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "max_tokens": 1024,
    "stream": true,
    "messages": [
      {"role": "user", "content": "Tell me a story"}
    ]
  }'
```

Streaming responses use Server-Sent Events (SSE) format with event types: `message_start`, `content_block_start`, `content_block_delta`, `content_block_stop`, `message_delta`, and `message_stop`.

#### Response Format

```json
{
  "id": "msg_abc123",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "This is a test!"
    }
  ],
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 10,
    "output_tokens": 5
  }
}
```

### Open Responses API

LocalAI supports the Open Responses API specification, which provides a standardized interface for AI model interactions with support for background processing, streaming, tool calling, and advanced features like reasoning.

**Endpoint:** `POST /v1/responses` or `POST /responses`

**Reference:** https://www.openresponses.org/specification

#### Basic Usage

```bash
curl http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "input": "Say this is a test!",
    "max_output_tokens": 1024
  }'
```

#### Request Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `model` | string | Yes | The model identifier |
| `input` | string/array | Yes | Input text or array of input items |
| `max_output_tokens` | integer | No | Maximum number of tokens to generate |
| `temperature` | float | No | Sampling temperature |
| `top_p` | float | No | Nucleus sampling parameter |
| `instructions` | string | No | System instructions |
| `tools` | array | No | Array of tool definitions |
| `tool_choice` | string/object | No | Tool choice: "auto", "required", "none", or specific tool |
| `stream` | boolean | No | Enable streaming responses |
| `background` | boolean | No | Run request in background (returns immediately) |
| `store` | boolean | No | Whether to store the response |
| `reasoning` | object | No | Reasoning configuration with `effort` and `summary` |
| `parallel_tool_calls` | boolean | No | Allow parallel tool calls |
| `max_tool_calls` | integer | No | Maximum number of tool calls |
| `presence_penalty` | float | No | Presence penalty (-2.0 to 2.0) |
| `frequency_penalty` | float | No | Frequency penalty (-2.0 to 2.0) |
| `top_logprobs` | integer | No | Number of top logprobs to return |
| `truncation` | string | No | Truncation mode: "auto" or "disabled" |
| `text_format` | object | No | Text format configuration |
| `metadata` | object | No | Custom metadata |

#### Input Format

Input can be a simple string or an array of structured items:

```bash
curl http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "input": [
      {
        "type": "message",
        "role": "user",
        "content": "What is the weather?"
      }
    ],
    "max_output_tokens": 1024
  }'
```

#### Background Processing

Run requests in the background for long-running tasks:

```bash
curl http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "input": "Generate a long story",
    "max_output_tokens": 4096,
    "background": true
  }'
```

The response will include a response ID that can be used to poll for completion:

```json
{
  "id": "resp_abc123",
  "object": "response",
  "status": "in_progress",
  "created_at": 1234567890
}
```

#### Retrieving Background Responses

Use the GET endpoint to retrieve background responses:

```bash
# Get response by ID
curl http://localhost:8080/v1/responses/resp_abc123

# Resume streaming with query parameters
curl "http://localhost:8080/v1/responses/resp_abc123?stream=true&starting_after=10"
```

#### Canceling Background Responses

Cancel a background response that's still in progress:

```bash
curl -X POST http://localhost:8080/v1/responses/resp_abc123/cancel
```

#### Tool Calling

Open Responses API supports function calling with tools:

```bash
curl http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "input": "What is the weather in San Francisco?",
    "tools": [
      {
        "type": "function",
        "name": "get_weather",
        "description": "Get the current weather",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state"
            }
          },
          "required": ["location"]
        }
      }
    ],
    "tool_choice": "auto",
    "max_output_tokens": 1024
  }'
```

#### Reasoning Configuration

Configure reasoning effort and summary style:

```bash
curl http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "input": "Solve this complex problem step by step",
    "reasoning": {
      "effort": "high",
      "summary": "detailed"
    },
    "max_output_tokens": 2048
  }'
```

#### Response Format

```json
{
  "id": "resp_abc123",
  "object": "response",
  "created_at": 1234567890,
  "completed_at": 1234567895,
  "status": "completed",
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "output": [
    {
      "type": "message",
      "id": "msg_001",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "This is a test!",
          "annotations": [],
          "logprobs": []
        }
      ],
      "status": "completed"
    }
  ],
  "error": null,
  "incomplete_details": null,
  "temperature": 0.7,
  "top_p": 1.0,
  "presence_penalty": 0.0,
  "frequency_penalty": 0.0,
  "usage": {
    "input_tokens": 10,
    "output_tokens": 5,
    "total_tokens": 15,
    "input_tokens_details": {
      "cached_tokens": 0
    },
    "output_tokens_details": {
      "reasoning_tokens": 0
    }
  }
}
```

## Backends

### RWKV

RWKV support is available through llama.cpp (see below)

### llama.cpp

[llama.cpp](https://github.com/ggerganov/llama.cpp) is a popular port of Facebook's LLaMA model in C/C++.

{{% notice note %}}

The `ggml` file format has been deprecated. If you are using `ggml` models and you are configuring your model with a YAML file, specify, use a LocalAI version older than v2.25.0. For `gguf` models, use the `llama` backend. The go backend is deprecated as well but still available as `go-llama`.

 {{% /notice %}}

#### Features

The `llama.cpp` model supports the following features:
- [📖 Text generation (GPT)]({{%relref "features/text-generation" %}})
- [🧠 Embeddings]({{%relref "features/embeddings" %}})
- [🔥 OpenAI functions]({{%relref "features/openai-functions" %}})
- [✍️ Constrained grammars]({{%relref "features/constrained_grammars" %}})

#### Setup

LocalAI supports `llama.cpp` models out of the box. You can use the `llama.cpp` model in the same way as any other model. 

##### Manual setup

It is sufficient to copy the `ggml` or `gguf` model files in the `models` folder. You can refer to the model in the `model` parameter in the API calls.

[You can optionally create an associated YAML]({{%relref "advanced" %}}) model config file to tune the model's parameters or apply a template to the prompt.

Prompt templates are useful for models that are fine-tuned towards a specific prompt. 

##### Automatic setup

LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for `ggml` or `gguf` models.

For instance, if you have the galleries enabled and LocalAI already running, you can just start chatting with models in huggingface by running:

```bash
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.1
   }'
```

LocalAI will automatically download and configure the model in the `model` directory.

Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the [model gallery documentation]({{%relref "features/model-gallery" %}}).

#### YAML configuration

To use the `llama.cpp` backend, specify `llama-cpp` as the backend in the YAML file:

```yaml
name: llama
backend: llama-cpp
parameters:
  # Relative to the models path
  model: file.gguf
```

#### Backend Options

The `llama.cpp` backend supports additional configuration options that can be specified in the `options` field of your model YAML configuration. These options allow fine-tuning of the backend behavior:

| Option | Type | Description | Example |
|--------|------|-------------|---------|
| `use_jinja` or `jinja` | boolean | Enable Jinja2 template processing for chat templates. When enabled, the backend uses Jinja2-based chat templates from the model for formatting messages. | `use_jinja:true` |
| `context_shift` | boolean | Enable context shifting, which allows the model to dynamically adjust context window usage. | `context_shift:true` |
| `cache_ram` | integer | Set the maximum RAM cache size in MiB for KV cache. Use `-1` for unlimited (default). | `cache_ram:2048` |
| `parallel` or `n_parallel` | integer | Enable parallel request processing. When set to a value greater than 1, enables continuous batching for handling multiple requests concurrently. | `parallel:4` |
| `grpc_servers` or `rpc_servers` | string | Comma-separated list of gRPC server addresses for distributed inference. Allows distributing workload across multiple llama.cpp workers. | `grpc_servers:localhost:50051,localhost:50052` |
| `fit_params` or `fit` | boolean | Enable auto-adjustment of model/context parameters to fit available device memory. Default: `true`. | `fit_params:true` |
| `fit_params_target` or `fit_target` | integer | Target margin per device in MiB when using fit_params. Default: `1024` (1GB). | `fit_target:2048` |
| `fit_params_min_ctx` or `fit_ctx` | integer | Minimum context size that can be set by fit_params. Default: `4096`. | `fit_ctx:2048` |
| `n_cache_reuse` or `cache_reuse` | integer | Minimum chunk size to attempt reusing from the cache via KV shifting. Default: `0` (disabled). | `cache_reuse:256` |
| `slot_prompt_similarity` or `sps` | float | How much the prompt of a request must match the prompt of a slot to use that slot. Default: `0.1`. Set to `0` to disable. | `sps:0.5` |
| `swa_full` | boolean | Use full-size SWA (Sliding Window Attention) cache. Default: `false`. | `swa_full:true` |
| `cont_batching` or `continuous_batching` | boolean | Enable continuous batching for handling multiple sequences. Default: `true`. | `cont_batching:true` |
| `check_tensors` | boolean | Validate tensor data for invalid values during model loading. Default: `false`. | `check_tensors:true` |
| `warmup` | boolean | Enable warmup run after model loading. Default: `true`. | `warmup:false` |
| `no_op_offload` | boolean | Disable offloading host tensor operations to device. Default: `false`. | `no_op_offload:true` |
| `kv_unified` or `unified_kv` | boolean | Enable unified KV cache. Default: `false`. | `kv_unified:true` |
| `n_ctx_checkpoints` or `ctx_checkpoints` | integer | Maximum number of context checkpoints per slot. Default: `8`. | `ctx_checkpoints:4` |
| `split_mode` or `sm` | string | How to split the model across multiple GPUs: `none` (single GPU only), `layer` (default — split layers and KV across GPUs), `row` (split rows across GPUs), `tensor` (experimental tensor parallelism — requires `flash_attention: true`, no KV-cache quantization, manually set `context_size`, and a llama.cpp build that includes [#19378](https://github.com/ggml-org/llama.cpp/pull/19378)). | `split_mode:tensor` |

**Example configuration with options:**

```yaml
name: llama-model
backend: llama
parameters:
  model: model.gguf
options:
  - use_jinja:true
  - context_shift:true
  - cache_ram:4096
  - parallel:2
  - fit_params:true
  - fit_target:1024
  - slot_prompt_similarity:0.5
```

**Note:** The `parallel` option can also be set via the `LLAMACPP_PARALLEL` environment variable, and `grpc_servers` can be set via the `LLAMACPP_GRPC_SERVERS` environment variable. Options specified in the YAML file take precedence over environment variables.

#### Reference

- [llama](https://github.com/ggerganov/llama.cpp)


### ik_llama.cpp

[ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) is a hard fork of `llama.cpp` by Iwan Kawrakow that focuses on superior CPU and hybrid GPU/CPU performance. It ships additional quantization types (IQK quants), custom quantization mixes, Multi-head Latent Attention (MLA) for DeepSeek models, and fine-grained tensor offload controls — particularly useful for running very large models on commodity CPU hardware.

{{% notice note %}}

The `ik-llama-cpp` backend requires a CPU with **AVX2** support. The IQK kernels are not compatible with older CPUs.

{{% /notice %}}

#### Features

The `ik-llama-cpp` backend supports the following features:
- [📖 Text generation (GPT)]({{%relref "features/text-generation" %}})
- [🧠 Embeddings]({{%relref "features/embeddings" %}})
- IQK quantization types for better CPU inference performance
- Multimodal models (via clip/llava)

#### Setup

The backend is distributed as a separate container image and can be installed from the LocalAI backend gallery, or specified directly in a model configuration. GGUF models loaded with this backend benefit from ik_llama.cpp's optimized CPU kernels — especially useful for MoE models and large quantized models that would otherwise be GPU-bound.

#### YAML configuration

To use the `ik-llama-cpp` backend, specify it as the backend in the YAML file:

```yaml
name: my-model
backend: ik-llama-cpp
parameters:
  # Relative to the models path
  model: file.gguf
```

The aliases `ik-llama` and `ik_llama` are also accepted.

#### Reference

- [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)


### turboquant (llama.cpp fork with TurboQuant KV-cache)

[llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant) is a `llama.cpp` fork that adds the **TurboQuant KV-cache** quantization scheme. It reuses the upstream `llama.cpp` codebase and ships as a drop-in alternative backend inside LocalAI, sharing the same gRPC server sources as the stock `llama-cpp` backend — so any GGUF model that runs on `llama-cpp` also runs on `turboquant`.

You would pick `turboquant` when you want **smaller KV-cache memory pressure** (longer contexts on the same VRAM) or to experiment with the fork's quantized KV representations on top of the standard `cache_type_k` / `cache_type_v` knobs already supported by upstream `llama.cpp`.

#### Features

- Drop-in GGUF compatibility with upstream `llama.cpp`.
- TurboQuant KV-cache quantization (see fork README for the current set of accepted `cache_type_k` / `cache_type_v` values).
- Same feature surface as the `llama-cpp` backend: text generation, embeddings, tool calls, multimodal via mmproj.
- Available on CPU (AVX/AVX2/AVX512/fallback), NVIDIA CUDA 12/13, AMD ROCm/HIP, Intel SYCL f32/f16, Vulkan, and NVIDIA L4T.

#### Setup

`turboquant` ships as a separate container image in the LocalAI backend gallery. Install it like any other backend:

```bash
local-ai backends install turboquant
```

Or pick a specific flavor for your hardware (example tags: `cpu-turboquant`, `cuda12-turboquant`, `cuda13-turboquant`, `rocm-turboquant`, `intel-sycl-f16-turboquant`, `vulkan-turboquant`).

#### YAML configuration

To run a model with `turboquant`, set the backend in your model YAML and optionally pick quantized KV-cache types:

```yaml
name: my-model
backend: turboquant
parameters:
  # Relative to the models path
  model: file.gguf
# Use TurboQuant's own KV-cache quantization schemes. The fork accepts
# the standard llama.cpp types (f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1)
# and adds three TurboQuant-specific ones: turbo2, turbo3, turbo4.
# turbo3 / turbo4 auto-enable flash_attention (required for turbo K/V)
# and offer progressively more aggressive compression.
cache_type_k: turbo3
cache_type_v: turbo3
context_size: 8192
```

The `cache_type_k` / `cache_type_v` fields map to llama.cpp's `-ctk` / `-ctv` flags. The stock `llama-cpp` backend only accepts the standard llama.cpp types — to use `turbo2` / `turbo3` / `turbo4` you need this `turboquant` backend, which is where the fork's TurboQuant code paths actually take effect. Pick `q8_0` here and you're just running stock llama.cpp KV quantization; pick `turbo*` and you're running TurboQuant.

#### Reference

- [llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant)
- [Tracked branch: `feature/turboquant-kv-cache`](https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache)


### vLLM

[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference.

LocalAI has a built-in integration with vLLM, and it can be used to run models. You can check out `vllm` performance [here](https://github.com/vllm-project/vllm#performance).

#### Setup

Create a YAML file for the model you want to use with `vllm`.

To setup a model, you need to just specify the model name in the YAML config file:
```yaml
name: vllm
backend: vllm
parameters:
    model: "facebook/opt-125m"

```

The backend will automatically download the required files in order to run the model.


#### Usage

Use the `completions` endpoint by specifying the `vllm` backend:
```
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{   
   "model": "vllm",
   "prompt": "Hello, my name is",
   "temperature": 0.1, "top_p": 0.1
 }'
```

#### Passing arbitrary vLLM options with `engine_args`

A subset of `AsyncEngineArgs` is exposed as typed YAML fields
(`tensor_parallel_size`, `gpu_memory_utilization`, `quantization`,
`max_model_len`, `dtype`, `trust_remote_code`, `enforce_eager`, …).
Anything else can be passed through the generic `engine_args:` map.
Keys are forwarded verbatim to vLLM's engine; unknown keys fail at load
time with the closest valid name as a hint. Nested maps materialise
into vLLM's nested config dataclasses (`SpeculativeConfig`,
`KVTransferConfig`, `CompilationConfig`, …).

Speculative decoding (DFlash, ngram, eagle, deepseek_mtp, …) is
configured this way:

```yaml
name: qwen3.5-4b-dflash
backend: vllm
parameters:
  model: Qwen/Qwen3.5-4B
context_size: 8192
max_model_len: 8192
trust_remote_code: true
quantization: fp8
template:
  use_tokenizer_template: true
engine_args:
  speculative_config:
    method: dflash
    model: z-lab/Qwen3.5-4B-DFlash
    num_speculative_tokens: 15
```

The shape of `speculative_config` follows vLLM's
[`SpeculativeConfig`](https://docs.vllm.ai/en/latest/api/vllm/config/speculative.html)
— `method` picks the algorithm, the remaining keys are method-specific.
Drafters from [z-lab](https://huggingface.co/z-lab) are paired with
specific target models; pick the one that matches your target. The
drafter loads in its native precision regardless of the target's
`quantization:` setting.

Another example — picking a non-default attention backend (e.g. on
hardware where the default cutlass kernels aren't supported):

```yaml
engine_args:
  attention_backend: TRITON_ATTN
```

#### Multi-node data parallelism

`engine_args.data_parallel_size > 1` combined with the
`local-ai p2p-worker vllm` follower lets a single model span multiple
GPU nodes. See [vLLM Multi-Node (Data-Parallel)]({{% relref
"features/distributed-mode#vllm-multi-node-data-parallel" %}})
for the head/follower configuration and a worked Kimi-K2.6 example.

### SGLang

[SGLang](https://github.com/sgl-project/sglang) is a fast serving
framework for LLMs and VLMs with a focus on prefix caching, speculative
decoding, and multi-modal generation. LocalAI ships a gRPC backend that
wraps SGLang's async `Engine`, including its native function-call and
reasoning parsers.

#### Setup

```yaml
name: sglang
backend: sglang
parameters:
  model: "Qwen/Qwen3-4B"
template:
  use_tokenizer_template: true
```

The backend will pull the model from HuggingFace on first load.

#### Passing arbitrary SGLang options with `engine_args`

The same `engine_args:` map that the vLLM backend accepts is also
honoured by the SGLang backend. Keys are validated against
[`ServerArgs`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py)
— SGLang's central configuration dataclass — and forwarded verbatim to
`Engine(**kwargs)`. Unknown keys fail at load time with the closest
valid name as a hint. Unlike vLLM, `ServerArgs` is flat: speculative
decoding fields are top-level (`speculative_algorithm`,
`speculative_draft_model_path`, etc.) rather than nested under a
`speculative_config:` dict.

The typed YAML fields shared with vLLM are mapped to their SGLang
equivalents (`gpu_memory_utilization` → `mem_fraction_static`,
`enforce_eager` → `disable_cuda_graph`, `tensor_parallel_size` →
`tp_size`, `max_model_len` → `context_length`). Anything else,
including all speculative-decoding flags, goes under `engine_args:`.

##### Speculative decoding: Gemma 4 with Multi-Token Prediction

Google publishes paired "assistant" drafters for every Gemma 4 size.
The drafters use Multi-Token Prediction (MTP) to propose several
candidate tokens per target step, which SGLang then verifies in
parallel. Flags below are transcribed verbatim from the
[SGLang Gemma 4 cookbook](https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#speculative-decoding-mtp-server-commands).

For consumer GPUs in the 16–24 GB range, use **E4B** (8 B total /
4 B effective parameters):

```yaml
name: gemma-4-e4b-mtp
backend: sglang
parameters:
  model: google/gemma-4-E4B-it
context_size: 4096
template:
  use_tokenizer_template: true
options:
  - tool_parser:gemma4
  - reasoning_parser:gemma4
engine_args:
  mem_fraction_static: 0.85
  speculative_algorithm: NEXTN
  speculative_draft_model_path: google/gemma-4-E4B-it-assistant
  speculative_num_steps: 5
  speculative_num_draft_tokens: 6
  speculative_eagle_topk: 1
```

For smaller cards (8–12 GB), drop to **E2B** (5 B total / 2 B effective)
by swapping the model paths to `google/gemma-4-E2B-it` and
`google/gemma-4-E2B-it-assistant`; the rest of the flags stay the same.

`NEXTN` is normalised to `EAGLE` inside `ServerArgs.__post_init__`, so
either value works — the cookbook uses `NEXTN`. `mem_fraction_static`
is the share of GPU memory SGLang reserves for the model + KV pool;
0.85 is the cookbook's default and adapts to whatever single GPU the
backend is running on.

The 31 B dense and 26 B-A4B MoE Gemma 4 variants exist in the same
cookbook but require `--tp-size 2`, so they're not in the gallery as
single-GPU recipes.

> **SGLang version requirement.** Gemma 4 support landed in SGLang via
> [PR #21952](https://github.com/sgl-project/sglang/pull/21952). The
> LocalAI sglang backend pins a release that includes it; if you've
> overridden the pin to an older version, this recipe will fail with a
> "model architecture not recognised" error at load time.

##### Other speculative algorithms

`speculative_algorithm:` also accepts `EAGLE`/`EAGLE3` (paired with an
EAGLE-style draft head), `DFLASH` (block-diffusion drafters from
[z-lab](https://huggingface.co/z-lab) for the Qwen3 family), `STANDALONE`
(a smaller draft LLM verifying a larger target), and `NGRAM` (no draft
model — pure prefix-history speculation). See SGLang's
[speculative-decoding docs](https://docs.sglang.io/advanced_features/speculative_decoding.html)
for the full algorithm matrix.

#### Tool calling and reasoning parsers

SGLang's native parsers stream `tool_calls` and `reasoning_content`
inside `ChatDelta` — the LocalAI Python backend wires them up
per-request rather than via `engine_args:`. Pick a parser by name:

```yaml
options:
  - tool_parser:hermes
  - reasoning_parser:deepseek_r1
```

The full list of registered parsers lives in `sglang.srt.function_call`
and `sglang.srt.parser.reasoning_parser`.

### Transformers

[Transformers](https://huggingface.co/docs/transformers/index) is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX.

LocalAI has a built-in integration with Transformers, and it can be used to run models.

This is an extra backend - in the container images (the `extra` images already contains python dependencies for Transformers) is already available and there is nothing to do for the setup.

#### Setup

Create a YAML file for the model you want to use with `transformers`.

To setup a model, you need to just specify the model name in the YAML config file:
```yaml
name: transformers
backend: transformers
parameters:
    model: "facebook/opt-125m"
type: AutoModelForCausalLM
quantization: bnb_4bit # One of: bnb_8bit, bnb_4bit, xpu_4bit, xpu_8bit (optional)
```

The backend will automatically download the required files in order to run the model.

#### Parameters

##### Type

| Type | Description |
| --- | --- |
| `AutoModelForCausalLM` | `AutoModelForCausalLM` is a model that can be used to generate sequences. Use it for NVIDIA CUDA and Intel GPU with Intel Extensions for Pytorch acceleration |
| `OVModelForCausalLM` | for Intel CPU/GPU/NPU OpenVINO Text Generation models |
| `OVModelForFeatureExtraction` | for Intel CPU/GPU/NPU OpenVINO Embedding acceleration |
| N/A | Defaults to `AutoModel` |

- `OVModelForCausalLM` requires OpenVINO IR [Text Generation](https://huggingface.co/models?library=openvino&pipeline_tag=text-generation) models from Hugging face
- `OVModelForFeatureExtraction` works with any Safetensors Transformer [Feature Extraction](https://huggingface.co/models?pipeline_tag=feature-extraction&library=transformers,safetensors) model from Huggingface (Embedding Model)

Please note that streaming is currently not implemente in `AutoModelForCausalLM` for Intel GPU.
AMD GPU support is not implemented.
Although AMD CPU is not officially supported by OpenVINO there are reports that it works: YMMV.

##### Embeddings
Use `embeddings: true` if the model is an embedding model

##### Inference device selection
Transformer backend tries to automatically select the best device for inference, anyway you can override the decision manually overriding with the `main_gpu` parameter.

| Inference Engine | Applicable Values |
| --- | --- |
| CUDA | `cuda`, `cuda.X` where X is the GPU device like in `nvidia-smi -L` output |
| OpenVINO | Any applicable value from [Inference Modes](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html) like `AUTO`,`CPU`,`GPU`,`NPU`,`MULTI`,`HETERO` |

Example for CUDA:
`main_gpu: cuda.0`

Example for OpenVINO:
`main_gpu: AUTO:-CPU`

This parameter applies to both Text Generation and Feature Extraction (i.e. Embeddings) models.

##### Inference Precision
Transformer backend automatically select the fastest applicable inference precision according to the device support.
CUDA backend can manually enable *bfloat16* if your hardware support it with the following parameter:

`f16: true`

##### Quantization

| Quantization | Description |
| --- | --- |
| `bnb_8bit` | 8-bit quantization |
| `bnb_4bit` | 4-bit quantization |
| `xpu_8bit` | 8-bit quantization for Intel XPUs |
| `xpu_4bit` | 4-bit quantization for Intel XPUs |

##### Trust Remote Code
Some models like Microsoft Phi-3 requires external code than what is provided by the transformer library.
By default it is disabled for security.
It can be manually enabled with:
`trust_remote_code: true`

##### Maximum Context Size
Maximum context size in bytes can be specified with the parameter: `context_size`. Do not use values higher than what your model support.

Usage example:
`context_size: 8192`

##### Auto Prompt Template
Usually chat template is defined by the model author in the `tokenizer_config.json` file.
To enable it use the `use_tokenizer_template: true` parameter in the `template` section.

Usage example:
```
template:
  use_tokenizer_template: true
```

##### Custom Stop Words
Stopwords are usually defined in `tokenizer_config.json` file.
They can be overridden with the `stopwords` parameter in case of need like in llama3-Instruct model.

Usage example:
```
stopwords:
- "<|eot_id|>"
- "<|end_of_text|>"
```

#### Usage

Use the `completions` endpoint by specifying the `transformers` model:
```
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{   
   "model": "transformers",
   "prompt": "Hello, my name is",
   "temperature": 0.1, "top_p": 0.1
 }'
```

#### Examples

##### OpenVINO

A model configuration file for openvion and starling model:

```yaml
name: starling-openvino
backend: transformers
parameters:
  model: fakezeta/Starling-LM-7B-beta-openvino-int8
context_size: 8192
threads: 6
f16: true
type: OVModelForCausalLM
stopwords:
- <|end_of_turn|>
- <|endoftext|>
prompt_cache_path: "cache"
prompt_cache_all: true
template:
  chat_message: |
    {{if eq .RoleName "system"}}{{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "assistant"}}<|end_of_turn|>GPT4 Correct Assistant: {{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "user"}}GPT4 Correct User: {{.Content}}{{end}}

  chat: |
    {{.Input}}<|end_of_turn|>GPT4 Correct Assistant:

  completion: |
    {{.Input}}
```
-												docs/examples: enhancements (#1572)

* docs: re-order sections

* fix references

* Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b

* Fix link

* Minor corrections

* fix: models is a StringSlice, not a String

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP: switch docs theme

* content

* Fix GH link

* enhancements

* enhancements

* Fixed how to link

Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>

* fixups

* logo fix

* more fixups

* final touches

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
											
										
										
											2024-01-18 18:41:08 +00:00
 								+++
 								disableToc = false
-												fix(docs): fix broken references to distributed mode

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

											
										
										
											2026-04-03 07:46:06 +00:00
+								title = "Text Generation (GPT)"
-												docs/examples: enhancements (#1572)

* docs: re-order sections

* fix references

* Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b

* Fix link

* Minor corrections

* fix: models is a StringSlice, not a String

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP: switch docs theme

* content

* Fix GH link

* enhancements

* enhancements

* Fixed how to link

Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>

* fixups

* logo fix

* more fixups

* final touches

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
											
										
										
											2024-01-18 18:41:08 +00:00
+								weight = 10
-												docs: re-use original permalinks (#1610)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2024-01-19 18:23:58 +00:00
+								url = "/features/text-generation/"
-												docs/examples: enhancements (#1572)

* docs: re-order sections

* fix references

* Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b

* Fix link

* Minor corrections

* fix: models is a StringSlice, not a String

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP: switch docs theme

* content

* Fix GH link

* enhancements

* enhancements

* Fixed how to link

Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>

* fixups

* logo fix

* more fixups

* final touches

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
											
										
										
											2024-01-18 18:41:08 +00:00
+								+++
-												feat: docs revamp (#7313)

* docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Small enhancements

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Enhancements

* Default to zen-dark

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2025-11-19 21:21:20 +00:00
+								LocalAI supports generating text with GPT with `llama.cpp` and other backends (such as `rwkv.cpp` as ) see also the [Model compatibility]({{%relref "reference/compatibility-table" %}}) for an up-to-date list of the supported model families.
-												docs/examples: enhancements (#1572)

* docs: re-order sections

* fix references

* Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b

* Fix link

* Minor corrections

* fix: models is a StringSlice, not a String

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP: switch docs theme

* content

* Fix GH link

* enhancements

* enhancements

* Fixed how to link

Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>

* fixups

* logo fix

* more fixups

* final touches

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
											
										
										
											2024-01-18 18:41:08 +00:00
 								Note:
 								- You can also specify the model name as part of the OpenAI token.
 								- If only one model is available, the API will use it for all the requests.
 								## API Reference
 								### Chat completions
 								https://platform.openai.com/docs/api-reference/chat
 								For example, to generate a chat completion, you can send a POST request to the `/v1/chat/completions` endpoint with the instruction as the request body:
 								```bash
 								curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
 								  "model": "ggml-koala-7b-model-q4_0-r2.bin",
 								  "messages": [{"role": "user", "content": "Say this is a test!"}],
 								  "temperature": 0.7
 								}'
 								```
 								Available additional parameters: `top_p`, `top_k`, `max_tokens`
 								### Edit completions
 								https://platform.openai.com/docs/api-reference/edits
 								To generate an edit completion you can send a POST request to the `/v1/edits` endpoint with the instruction as the request body:
 								```bash
 								curl http://localhost:8080/v1/edits -H "Content-Type: application/json" -d '{
 								  "model": "ggml-koala-7b-model-q4_0-r2.bin",
 								  "instruction": "rephrase",
 								  "input": "Black cat jumped out of the window",
 								  "temperature": 0.7
 								}'
 								```
 								Available additional parameters: `top_p`, `top_k`, `max_tokens`.
 								### Completions
 								https://platform.openai.com/docs/api-reference/completions
 								To generate a completion, you can send a POST request to the `/v1/completions` endpoint with the instruction as per the request body:
 								```bash
 								curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
 								  "model": "ggml-koala-7b-model-q4_0-r2.bin",
 								  "prompt": "A long time ago in a galaxy far, far away",
 								  "temperature": 0.7
 								}'
 								```
 								Available additional parameters: `top_p`, `top_k`, `max_tokens`
 								### List models
 								You can list all the models available with:
 								```bash
 								curl http://localhost:8080/v1/models
 								```
-												chore(docs): update docs with Anthropic API and openresponses

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

											
										
										
											2026-01-20 08:25:24 +00:00
+								### Anthropic Messages API
 								LocalAI supports the Anthropic Messages API, which is compatible with Claude clients. This endpoint provides a structured way to send messages and receive responses, with support for tools, streaming, and multimodal content.
 								**Endpoint:** `POST /v1/messages` or `POST /messages`
 								**Reference:** https://docs.anthropic.com/claude/reference/messages_post
 								#### Basic Usage
 								```bash
 								curl http://localhost:8080/v1/messages \
 								  -H "Content-Type: application/json" \
 								  -H "anthropic-version: 2023-06-01" \
 								  -d '{
 								    "model": "ggml-koala-7b-model-q4_0-r2.bin",
 								    "max_tokens": 1024,
 								    "messages": [
 								      {"role": "user", "content": "Say this is a test!"}
 								    ]
 								  }'
 								```
 								#### Request Parameters
 								| Parameter | Type | Required | Description |
 								|-----------|------|----------|-------------|
 								| `model` | string | Yes | The model identifier |
 								| `messages` | array | Yes | Array of message objects with `role` and `content` |
 								| `max_tokens` | integer | Yes | Maximum number of tokens to generate (must be > 0) |
 								| `system` | string | No | System message to set the assistant's behavior |
 								| `temperature` | float | No | Sampling temperature (0.0 to 1.0) |
 								| `top_p` | float | No | Nucleus sampling parameter |
 								| `top_k` | integer | No | Top-k sampling parameter |
 								| `stop_sequences` | array | No | Array of strings that will stop generation |
 								| `stream` | boolean | No | Enable streaming responses |
 								| `tools` | array | No | Array of tool definitions for function calling |
 								| `tool_choice` | string/object | No | Tool choice strategy: "auto", "any", "none", or specific tool |
-												feat: pass-by metadata to predict options (#8795)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2026-03-05 21:50:10 +00:00
+								| `metadata` | object | No | Per-request metadata passed to the backend (e.g., `{"enable_thinking": "true"}`) |
-												chore(docs): update docs with Anthropic API and openresponses

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

											
										
										
											2026-01-20 08:25:24 +00:00
 								#### Message Format
 								Messages can contain text or structured content blocks:
 								```bash
 								curl http://localhost:8080/v1/messages \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "ggml-koala-7b-model-q4_0-r2.bin",
 								    "max_tokens": 1024,
 								    "messages": [
 								      {
 								        "role": "user",
 								        "content": [
 								          {
 								            "type": "text",
 								            "text": "What is in this image?"
 								          },
 								          {
 								            "type": "image",
 								            "source": {
 								              "type": "base64",
 								              "media_type": "image/jpeg",
 								              "data": "base64_encoded_image_data"
 								            }
 								          }
 								        ]
 								      }
 								    ]
 								  }'
 								```
 								#### Tool Calling
 								The Anthropic API supports function calling through tools:
 								```bash
 								curl http://localhost:8080/v1/messages \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "ggml-koala-7b-model-q4_0-r2.bin",
 								    "max_tokens": 1024,
 								    "tools": [
 								      {
 								        "name": "get_weather",
 								        "description": "Get the current weather",
 								        "input_schema": {
 								          "type": "object",
 								          "properties": {
 								            "location": {
 								              "type": "string",
 								              "description": "The city and state"
 								            }
 								          },
 								          "required": ["location"]
 								        }
 								      }
 								    ],
 								    "tool_choice": "auto",
 								    "messages": [
 								      {"role": "user", "content": "What is the weather in San Francisco?"}
 								    ]
 								  }'
 								```
 								#### Streaming
 								Enable streaming responses by setting `stream: true`:
 								```bash
 								curl http://localhost:8080/v1/messages \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "ggml-koala-7b-model-q4_0-r2.bin",
 								    "max_tokens": 1024,
 								    "stream": true,
 								    "messages": [
 								      {"role": "user", "content": "Tell me a story"}
 								    ]
 								  }'
 								```
 								Streaming responses use Server-Sent Events (SSE) format with event types: `message_start`, `content_block_start`, `content_block_delta`, `content_block_stop`, `message_delta`, and `message_stop`.
 								#### Response Format
 								```json
 								{
 								  "id": "msg_abc123",
 								  "type": "message",
 								  "role": "assistant",
 								  "content": [
 								    {
 								      "type": "text",
 								      "text": "This is a test!"
 								    }
 								  ],
 								  "model": "ggml-koala-7b-model-q4_0-r2.bin",
 								  "stop_reason": "end_turn",
 								  "usage": {
 								    "input_tokens": 10,
 								    "output_tokens": 5
 								  }
 								}
 								```
 								### Open Responses API
 								LocalAI supports the Open Responses API specification, which provides a standardized interface for AI model interactions with support for background processing, streaming, tool calling, and advanced features like reasoning.
 								**Endpoint:** `POST /v1/responses` or `POST /responses`
 								**Reference:** https://www.openresponses.org/specification
 								#### Basic Usage
 								```bash
 								curl http://localhost:8080/v1/responses \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "ggml-koala-7b-model-q4_0-r2.bin",
 								    "input": "Say this is a test!",
 								    "max_output_tokens": 1024
 								  }'
 								```
 								#### Request Parameters
 								| Parameter | Type | Required | Description |
 								|-----------|------|----------|-------------|
 								| `model` | string | Yes | The model identifier |
 								| `input` | string/array | Yes | Input text or array of input items |
 								| `max_output_tokens` | integer | No | Maximum number of tokens to generate |
 								| `temperature` | float | No | Sampling temperature |
 								| `top_p` | float | No | Nucleus sampling parameter |
 								| `instructions` | string | No | System instructions |
 								| `tools` | array | No | Array of tool definitions |
 								| `tool_choice` | string/object | No | Tool choice: "auto", "required", "none", or specific tool |
 								| `stream` | boolean | No | Enable streaming responses |
 								| `background` | boolean | No | Run request in background (returns immediately) |
 								| `store` | boolean | No | Whether to store the response |
 								| `reasoning` | object | No | Reasoning configuration with `effort` and `summary` |
 								| `parallel_tool_calls` | boolean | No | Allow parallel tool calls |
 								| `max_tool_calls` | integer | No | Maximum number of tool calls |
 								| `presence_penalty` | float | No | Presence penalty (-2.0 to 2.0) |
 								| `frequency_penalty` | float | No | Frequency penalty (-2.0 to 2.0) |
 								| `top_logprobs` | integer | No | Number of top logprobs to return |
 								| `truncation` | string | No | Truncation mode: "auto" or "disabled" |
 								| `text_format` | object | No | Text format configuration |
 								| `metadata` | object | No | Custom metadata |
 								#### Input Format
 								Input can be a simple string or an array of structured items:
 								```bash
 								curl http://localhost:8080/v1/responses \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "ggml-koala-7b-model-q4_0-r2.bin",
 								    "input": [
 								      {
 								        "type": "message",
 								        "role": "user",
 								        "content": "What is the weather?"
 								      }
 								    ],
 								    "max_output_tokens": 1024
 								  }'
 								```
 								#### Background Processing
 								Run requests in the background for long-running tasks:
 								```bash
 								curl http://localhost:8080/v1/responses \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "ggml-koala-7b-model-q4_0-r2.bin",
 								    "input": "Generate a long story",
 								    "max_output_tokens": 4096,
 								    "background": true
 								  }'
 								```
 								The response will include a response ID that can be used to poll for completion:
 								```json
 								{
 								  "id": "resp_abc123",
 								  "object": "response",
 								  "status": "in_progress",
 								  "created_at": 1234567890
 								}
 								```
 								#### Retrieving Background Responses
 								Use the GET endpoint to retrieve background responses:
 								```bash
 								# Get response by ID
 								curl http://localhost:8080/v1/responses/resp_abc123
 								# Resume streaming with query parameters
 								curl "http://localhost:8080/v1/responses/resp_abc123?stream=true&starting_after=10"
 								```
 								#### Canceling Background Responses
 								Cancel a background response that's still in progress:
 								```bash
 								curl -X POST http://localhost:8080/v1/responses/resp_abc123/cancel
 								```
 								#### Tool Calling
 								Open Responses API supports function calling with tools:
 								```bash
 								curl http://localhost:8080/v1/responses \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "ggml-koala-7b-model-q4_0-r2.bin",
 								    "input": "What is the weather in San Francisco?",
 								    "tools": [
 								      {
 								        "type": "function",
 								        "name": "get_weather",
 								        "description": "Get the current weather",
 								        "parameters": {
 								          "type": "object",
 								          "properties": {
 								            "location": {
 								              "type": "string",
 								              "description": "The city and state"
 								            }
 								          },
 								          "required": ["location"]
 								        }
 								      }
 								    ],
 								    "tool_choice": "auto",
 								    "max_output_tokens": 1024
 								  }'
 								```
 								#### Reasoning Configuration
 								Configure reasoning effort and summary style:
 								```bash
 								curl http://localhost:8080/v1/responses \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "ggml-koala-7b-model-q4_0-r2.bin",
 								    "input": "Solve this complex problem step by step",
 								    "reasoning": {
 								      "effort": "high",
 								      "summary": "detailed"
 								    },
 								    "max_output_tokens": 2048
 								  }'
 								```
 								#### Response Format
 								```json
 								{
 								  "id": "resp_abc123",
 								  "object": "response",
 								  "created_at": 1234567890,
 								  "completed_at": 1234567895,
 								  "status": "completed",
 								  "model": "ggml-koala-7b-model-q4_0-r2.bin",
 								  "output": [
 								    {
 								      "type": "message",
 								      "id": "msg_001",
 								      "role": "assistant",
 								      "content": [
 								        {
 								          "type": "output_text",
 								          "text": "This is a test!",
 								          "annotations": [],
 								          "logprobs": []
 								        }
 								      ],
 								      "status": "completed"
 								    }
 								  ],
 								  "error": null,
 								  "incomplete_details": null,
 								  "temperature": 0.7,
 								  "top_p": 1.0,
 								  "presence_penalty": 0.0,
 								  "frequency_penalty": 0.0,
 								  "usage": {
 								    "input_tokens": 10,
 								    "output_tokens": 5,
 								    "total_tokens": 15,
 								    "input_tokens_details": {
 								      "cached_tokens": 0
 								    },
 								    "output_tokens_details": {
 								      "reasoning_tokens": 0
 								    }
 								  }
 								}
 								```
-												docs/examples: enhancements (#1572)

* docs: re-order sections

* fix references

* Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b

* Fix link

* Minor corrections

* fix: models is a StringSlice, not a String

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP: switch docs theme

* content

* Fix GH link

* enhancements

* enhancements

* Fixed how to link

Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>

* fixups

* logo fix

* more fixups

* final touches

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
											
										
										
											2024-01-18 18:41:08 +00:00
+								## Backends
 								### RWKV
-												chore(autogptq): drop archived backend (#5214)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2025-04-19 13:52:29 +00:00
+								RWKV support is available through llama.cpp (see below)
-												docs/examples: enhancements (#1572)

* docs: re-order sections

* fix references

* Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b

* Fix link

* Minor corrections

* fix: models is a StringSlice, not a String

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP: switch docs theme

* content

* Fix GH link

* enhancements

* enhancements

* Fixed how to link

Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>

* fixups

* logo fix

* more fixups

* final touches

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
											
										
										
											2024-01-18 18:41:08 +00:00
 								### llama.cpp
 								[llama.cpp](https://github.com/ggerganov/llama.cpp) is a popular port of Facebook's LLaMA model in C/C++.
-												feat: docs revamp (#7313)

* docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Small enhancements

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Enhancements

* Default to zen-dark

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2025-11-19 21:21:20 +00:00
+								{{% notice note %}}
-												docs/examples: enhancements (#1572)

* docs: re-order sections

* fix references

* Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b

* Fix link

* Minor corrections

* fix: models is a StringSlice, not a String

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP: switch docs theme

* content

* Fix GH link

* enhancements

* enhancements

* Fixed how to link

Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>

* fixups

* logo fix

* more fixups

* final touches

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
											
										
										
											2024-01-18 18:41:08 +00:00
-												chore(llama-ggml): drop deprecated backend (#4775)

The GGML format is now dead, since in the next version of LocalAI we
already bring many breaking compatibility changes, taking the occasion
also to drop ggml support (pre-gguf).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2025-02-06 17:36:23 +00:00
+								The `ggml` file format has been deprecated. If you are using `ggml` models and you are configuring your model with a YAML file, specify, use a LocalAI version older than v2.25.0. For `gguf` models, use the `llama` backend. The go backend is deprecated as well but still available as `go-llama`.
-												docs/examples: enhancements (#1572)

* docs: re-order sections

* fix references

* Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b

* Fix link

* Minor corrections

* fix: models is a StringSlice, not a String

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP: switch docs theme

* content

* Fix GH link

* enhancements

* enhancements

* Fixed how to link

Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>

* fixups

* logo fix

* more fixups

* final touches

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
											
										
										
											2024-01-18 18:41:08 +00:00
-												feat: docs revamp (#7313)

* docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Small enhancements

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Enhancements

* Default to zen-dark

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2025-11-19 21:21:20 +00:00
+								 {{% /notice %}}
-												docs/examples: enhancements (#1572)

* docs: re-order sections

* fix references

* Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b

* Fix link

* Minor corrections

* fix: models is a StringSlice, not a String

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP: switch docs theme

* content

* Fix GH link

* enhancements

* enhancements

* Fixed how to link

Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>

* fixups

* logo fix

* more fixups

* final touches

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
											
										
										
											2024-01-18 18:41:08 +00:00
 								#### Features
 								The `llama.cpp` model supports the following features:
-												feat: docs revamp (#7313)

* docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Small enhancements

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Enhancements

* Default to zen-dark

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2025-11-19 21:21:20 +00:00
+								- [📖 Text generation (GPT)]({{%relref "features/text-generation" %}})
 								- [🧠 Embeddings]({{%relref "features/embeddings" %}})
 								- [🔥 OpenAI functions]({{%relref "features/openai-functions" %}})
 								- [✍️ Constrained grammars]({{%relref "features/constrained_grammars" %}})
-												docs/examples: enhancements (#1572)

* docs: re-order sections

* fix references

* Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b

* Fix link

* Minor corrections

* fix: models is a StringSlice, not a String

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP: switch docs theme

* content

* Fix GH link

* enhancements

* enhancements

* Fixed how to link

Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>

* fixups

* logo fix

* more fixups

* final touches

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
											
										
										
											2024-01-18 18:41:08 +00:00
 								#### Setup
 								LocalAI supports `llama.cpp` models out of the box. You can use the `llama.cpp` model in the same way as any other model.
 								##### Manual setup
 								It is sufficient to copy the `ggml` or `gguf` model files in the `models` folder. You can refer to the model in the `model` parameter in the API calls.
-												feat: docs revamp (#7313)

* docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Small enhancements

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Enhancements

* Default to zen-dark

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2025-11-19 21:21:20 +00:00
+								[You can optionally create an associated YAML]({{%relref "advanced" %}}) model config file to tune the model's parameters or apply a template to the prompt.
-												docs/examples: enhancements (#1572)

* docs: re-order sections

* fix references

* Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b

* Fix link

* Minor corrections

* fix: models is a StringSlice, not a String

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP: switch docs theme

* content

* Fix GH link

* enhancements

* enhancements

* Fixed how to link

Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>

* fixups

* logo fix

* more fixups

* final touches

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
											
										
										
											2024-01-18 18:41:08 +00:00
 								Prompt templates are useful for models that are fine-tuned towards a specific prompt.
 								##### Automatic setup
 								LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for `ggml` or `gguf` models.
 								For instance, if you have the galleries enabled and LocalAI already running, you can just start chatting with models in huggingface by running:
 								```bash
 								curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
 								     "model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
 								     "messages": [{"role": "user", "content": "Say this is a test!"}],
 								     "temperature": 0.1
 								   }'
 								```
 								LocalAI will automatically download and configure the model in the `model` directory.
-												feat: docs revamp (#7313)

* docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Small enhancements

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Enhancements

* Default to zen-dark

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2025-11-19 21:21:20 +00:00
+								Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the [model gallery documentation]({{%relref "features/model-gallery" %}}).
-												docs/examples: enhancements (#1572)

* docs: re-order sections

* fix references

* Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b

* Fix link

* Minor corrections

* fix: models is a StringSlice, not a String

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP: switch docs theme

* content

* Fix GH link

* enhancements

* enhancements

* Fixed how to link

Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>

* fixups

* logo fix

* more fixups

* final touches

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
											
										
										
											2024-01-18 18:41:08 +00:00
 								#### YAML configuration
-												feat(llama.cpp): consolidate options and respect tokenizer template when enabled (#7120)

* feat(llama.cpp): expose env vars as options for consistency

This allows to configure everything in the YAML file of the model rather
than have global configurations

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama.cpp): respect usetokenizertemplate and use llama.cpp templating system to process messages

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Detect template exists if use tokenizer template is enabled

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Better recognization of chat

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixes to support tool calls while using templates from tokenizer

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop template guessing, fix passing tools to tokenizer

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Extract grammar and other options from chat template, add schema struct

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Automatically set use_jinja

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Cleanups, identify by default gguf models for chat

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Update docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2025-11-07 20:23:50 +00:00
+								To use the `llama.cpp` backend, specify `llama-cpp` as the backend in the YAML file:
-												docs/examples: enhancements (#1572)

* docs: re-order sections

* fix references

* Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b

* Fix link

* Minor corrections

* fix: models is a StringSlice, not a String

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP: switch docs theme

* content

* Fix GH link

* enhancements

* enhancements

* Fixed how to link

Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>

* fixups

* logo fix

* more fixups

* final touches

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
											
										
										
											2024-01-18 18:41:08 +00:00
 								```yaml
 								name: llama
-												feat(llama.cpp): consolidate options and respect tokenizer template when enabled (#7120)

* feat(llama.cpp): expose env vars as options for consistency

This allows to configure everything in the YAML file of the model rather
than have global configurations

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama.cpp): respect usetokenizertemplate and use llama.cpp templating system to process messages

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Detect template exists if use tokenizer template is enabled

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Better recognization of chat

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixes to support tool calls while using templates from tokenizer

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop template guessing, fix passing tools to tokenizer

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Extract grammar and other options from chat template, add schema struct

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Automatically set use_jinja

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Cleanups, identify by default gguf models for chat

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Update docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2025-11-07 20:23:50 +00:00
+								backend: llama-cpp
-												docs/examples: enhancements (#1572)

* docs: re-order sections

* fix references

* Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b

* Fix link

* Minor corrections

* fix: models is a StringSlice, not a String

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP: switch docs theme

* content

* Fix GH link

* enhancements

* enhancements

* Fixed how to link

Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>

* fixups

* logo fix

* more fixups

* final touches

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
											
										
										
											2024-01-18 18:41:08 +00:00
+								parameters:
 								  # Relative to the models path
-												chore(llama-ggml): drop deprecated backend (#4775)

The GGML format is now dead, since in the next version of LocalAI we
already bring many breaking compatibility changes, taking the occasion
also to drop ggml support (pre-gguf).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2025-02-06 17:36:23 +00:00
+								  model: file.gguf
-												docs/examples: enhancements (#1572)

* docs: re-order sections

* fix references

* Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b

* Fix link

* Minor corrections

* fix: models is a StringSlice, not a String

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP: switch docs theme

* content

* Fix GH link

* enhancements

* enhancements

* Fixed how to link

Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>

* fixups

* logo fix

* more fixups

* final touches

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
											
										
										
											2024-01-18 18:41:08 +00:00
+								```
-												feat(llama.cpp): consolidate options and respect tokenizer template when enabled (#7120)

* feat(llama.cpp): expose env vars as options for consistency

This allows to configure everything in the YAML file of the model rather
than have global configurations

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama.cpp): respect usetokenizertemplate and use llama.cpp templating system to process messages

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Detect template exists if use tokenizer template is enabled

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Better recognization of chat

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixes to support tool calls while using templates from tokenizer

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop template guessing, fix passing tools to tokenizer

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Extract grammar and other options from chat template, add schema struct

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Automatically set use_jinja

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Cleanups, identify by default gguf models for chat

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Update docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2025-11-07 20:23:50 +00:00
+								#### Backend Options
 								The `llama.cpp` backend supports additional configuration options that can be specified in the `options` field of your model YAML configuration. These options allow fine-tuning of the backend behavior:
 								| Option | Type | Description | Example |
 								|--------|------|-------------|---------|
 								| `use_jinja` or `jinja` | boolean | Enable Jinja2 template processing for chat templates. When enabled, the backend uses Jinja2-based chat templates from the model for formatting messages. | `use_jinja:true` |
 								| `context_shift` | boolean | Enable context shifting, which allows the model to dynamically adjust context window usage. | `context_shift:true` |
 								| `cache_ram` | integer | Set the maximum RAM cache size in MiB for KV cache. Use `-1` for unlimited (default). | `cache_ram:2048` |
 								| `parallel` or `n_parallel` | integer | Enable parallel request processing. When set to a value greater than 1, enables continuous batching for handling multiple requests concurrently. | `parallel:4` |
 								| `grpc_servers` or `rpc_servers` | string | Comma-separated list of gRPC server addresses for distributed inference. Allows distributing workload across multiple llama.cpp workers. | `grpc_servers:localhost:50051,localhost:50052` |
-												chore(llama.cpp): Add Missing llama.cpp Options to gRPC Server (#7584)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2025-12-15 20:55:20 +00:00
+								| `fit_params` or `fit` | boolean | Enable auto-adjustment of model/context parameters to fit available device memory. Default: `true`. | `fit_params:true` |
 								| `fit_params_target` or `fit_target` | integer | Target margin per device in MiB when using fit_params. Default: `1024` (1GB). | `fit_target:2048` |
 								| `fit_params_min_ctx` or `fit_ctx` | integer | Minimum context size that can be set by fit_params. Default: `4096`. | `fit_ctx:2048` |
 								| `n_cache_reuse` or `cache_reuse` | integer | Minimum chunk size to attempt reusing from the cache via KV shifting. Default: `0` (disabled). | `cache_reuse:256` |
 								| `slot_prompt_similarity` or `sps` | float | How much the prompt of a request must match the prompt of a slot to use that slot. Default: `0.1`. Set to `0` to disable. | `sps:0.5` |
 								| `swa_full` | boolean | Use full-size SWA (Sliding Window Attention) cache. Default: `false`. | `swa_full:true` |
 								| `cont_batching` or `continuous_batching` | boolean | Enable continuous batching for handling multiple sequences. Default: `true`. | `cont_batching:true` |
 								| `check_tensors` | boolean | Validate tensor data for invalid values during model loading. Default: `false`. | `check_tensors:true` |
 								| `warmup` | boolean | Enable warmup run after model loading. Default: `true`. | `warmup:false` |
 								| `no_op_offload` | boolean | Disable offloading host tensor operations to device. Default: `false`. | `no_op_offload:true` |
 								| `kv_unified` or `unified_kv` | boolean | Enable unified KV cache. Default: `false`. | `kv_unified:true` |
 								| `n_ctx_checkpoints` or `ctx_checkpoints` | integer | Maximum number of context checkpoints per slot. Default: `8`. | `ctx_checkpoints:4` |
-												feat(llama-cpp): expose split_mode option for multi-GPU placement (#9560)

Adds split_mode (alias sm) to the llama.cpp backend options allowlist,
accepting none|layer|row|tensor. The tensor value targets the experimental
backend-agnostic tensor parallelism from ggml-org/llama.cpp#19378 and
requires a llama.cpp build that includes that PR, FlashAttention enabled,
KV-cache quantization disabled, and a manually set context size.


Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2026-04-25 12:02:57 +00:00
+								| `split_mode` or `sm` | string | How to split the model across multiple GPUs: `none` (single GPU only), `layer` (default — split layers and KV across GPUs), `row` (split rows across GPUs), `tensor` (experimental tensor parallelism — requires `flash_attention: true`, no KV-cache quantization, manually set `context_size`, and a llama.cpp build that includes [#19378](https://github.com/ggml-org/llama.cpp/pull/19378)). | `split_mode:tensor` |
-												feat(llama.cpp): consolidate options and respect tokenizer template when enabled (#7120)

* feat(llama.cpp): expose env vars as options for consistency

This allows to configure everything in the YAML file of the model rather
than have global configurations

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama.cpp): respect usetokenizertemplate and use llama.cpp templating system to process messages

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Detect template exists if use tokenizer template is enabled

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Better recognization of chat

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixes to support tool calls while using templates from tokenizer

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop template guessing, fix passing tools to tokenizer

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Extract grammar and other options from chat template, add schema struct

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Automatically set use_jinja

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Cleanups, identify by default gguf models for chat

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Update docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2025-11-07 20:23:50 +00:00
 								**Example configuration with options:**
 								```yaml
 								name: llama-model
 								backend: llama
 								parameters:
 								  model: model.gguf
 								options:
 								  - use_jinja:true
 								  - context_shift:true
 								  - cache_ram:4096
 								  - parallel:2
-												chore(llama.cpp): Add Missing llama.cpp Options to gRPC Server (#7584)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2025-12-15 20:55:20 +00:00
+								  - fit_params:true
 								  - fit_target:1024
 								  - slot_prompt_similarity:0.5
-												feat(llama.cpp): consolidate options and respect tokenizer template when enabled (#7120)

* feat(llama.cpp): expose env vars as options for consistency

This allows to configure everything in the YAML file of the model rather
than have global configurations

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama.cpp): respect usetokenizertemplate and use llama.cpp templating system to process messages

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Detect template exists if use tokenizer template is enabled

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Better recognization of chat

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixes to support tool calls while using templates from tokenizer

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop template guessing, fix passing tools to tokenizer

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Extract grammar and other options from chat template, add schema struct

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Automatically set use_jinja

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Cleanups, identify by default gguf models for chat

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Update docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2025-11-07 20:23:50 +00:00
+								```
 								**Note:** The `parallel` option can also be set via the `LLAMACPP_PARALLEL` environment variable, and `grpc_servers` can be set via the `LLAMACPP_GRPC_SERVERS` environment variable. Options specified in the YAML file take precedence over environment variables.
-												docs/examples: enhancements (#1572)

* docs: re-order sections

* fix references

* Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b

* Fix link

* Minor corrections

* fix: models is a StringSlice, not a String

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP: switch docs theme

* content

* Fix GH link

* enhancements

* enhancements

* Fixed how to link

Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>

* fixups

* logo fix

* more fixups

* final touches

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
											
										
										
											2024-01-18 18:41:08 +00:00
+								#### Reference
 								- [llama](https://github.com/ggerganov/llama.cpp)
-												feat(backends): add ik-llama-cpp (#9326)

* feat(backends): add ik-llama-cpp

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: add grpc e2e suite, hook to CI, update README

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Apply suggestion from @mudler

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

* Apply suggestion from @mudler

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
											
										
										
											2026-04-12 11:51:28 +00:00
+								### ik_llama.cpp
 								[ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) is a hard fork of `llama.cpp` by Iwan Kawrakow that focuses on superior CPU and hybrid GPU/CPU performance. It ships additional quantization types (IQK quants), custom quantization mixes, Multi-head Latent Attention (MLA) for DeepSeek models, and fine-grained tensor offload controls — particularly useful for running very large models on commodity CPU hardware.
 								{{% notice note %}}
 								The `ik-llama-cpp` backend requires a CPU with **AVX2** support. The IQK kernels are not compatible with older CPUs.
 								{{% /notice %}}
 								#### Features
 								The `ik-llama-cpp` backend supports the following features:
 								- [📖 Text generation (GPT)]({{%relref "features/text-generation" %}})
 								- [🧠 Embeddings]({{%relref "features/embeddings" %}})
 								- IQK quantization types for better CPU inference performance
 								- Multimodal models (via clip/llava)
 								#### Setup
 								The backend is distributed as a separate container image and can be installed from the LocalAI backend gallery, or specified directly in a model configuration. GGUF models loaded with this backend benefit from ik_llama.cpp's optimized CPU kernels — especially useful for MoE models and large quantized models that would otherwise be GPU-bound.
 								#### YAML configuration
 								To use the `ik-llama-cpp` backend, specify it as the backend in the YAML file:
 								```yaml
 								name: my-model
 								backend: ik-llama-cpp
 								parameters:
 								  # Relative to the models path
 								  model: file.gguf
 								```
 								The aliases `ik-llama` and `ik_llama` are also accepted.
 								#### Reference
 								- [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)
-												feat(backend): add turboquant llama.cpp-fork backend (#9355)

* feat(backend): add turboquant llama.cpp-fork backend

turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch
feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme.
It ships as a first-class backend reusing backend/cpp/llama-cpp sources
via a thin wrapper Makefile: each variant target copies ../llama-cpp
into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server
with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No
duplication of grpc-server.cpp — upstream fixes flow through automatically.

Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL
f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml,
adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0
to exercise the KV-cache config path (backend_test.go gains dedicated env
vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement
usable by any llama.cpp-family backend), and registers a nightly auto-bump
PR in bump_deps.yaml tracking feature/turboquant-kv-cache.

scripts/changed-backends.js gets a special-case so edits to
backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since
the wrapper reuses those sources.

* feat(turboquant): carry upstream patches against fork API drift

turboquant branched from llama.cpp before upstream commit 66060008
("server: respect the ignore eos flag", #21203) which added the
`logit_bias_eog` field to `server_context_meta` and a matching
parameter to `server_task::params_from_json_cmpl`. The shared
backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so
building it against the fork unmodified fails.

Cherry-pick that commit as a patch file under
backend/cpp/turboquant/patches/ and apply it to the cloned fork
sources via a new apply-patches.sh hook called from the wrapper
Makefile. Simplifies the build flow too: instead of hopping through
llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now
drives the copied Makefile directly (clone -> patch -> build).

Drop the corresponding patch whenever the fork catches up with
upstream — the build fails fast if a patch stops applying, which
is the signal to retire it.

* docs: add turboquant backend section + clarify cache_type_k/v

Document the new turboquant (llama.cpp fork with TurboQuant KV-cache)
backend alongside the existing llama-cpp / ik-llama-cpp sections in
features/text-generation.md: when to pick it, how to install it from
the gallery, and a YAML example showing backend: turboquant together
with cache_type_k / cache_type_v.

Also expand the cache_type_k / cache_type_v table rows in
advanced/model-configuration.md to spell out the accepted llama.cpp
quantization values and note that these fields apply to all
llama.cpp-family backends, not just vLLM.

* feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion

The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but
ggml/include/ggml-rpc.h static-asserts it equals 96, breaking
the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server).
Carry a one-line patch that updates the expected count so the
assertion holds. Drop this patch whenever the fork fixes it upstream.

* feat(turboquant): allow turbo* KV-cache types and exercise them in e2e

The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own
allow-list of accepted KV-cache types (kv_cache_types[]) and rejects
anything outside it before the value reaches llama.cpp's parser. That
list only contains the standard llama.cpp types — turbo2/turbo3/turbo4
would throw "Unsupported cache type" at LoadModel time, meaning
nothing the LocalAI gRPC layer accepted was actually fork-specific.

Add a build-time augmentation step (patch-grpc-server.sh, called from
the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0
into the allow-list of the *copied* grpc-server.cpp under
turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/
is never touched, so the stock llama-cpp build keeps compiling against
vanilla upstream which has no notion of those enum values.

Switch test-extra-backend-turboquant to set
BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite
actually runs the fork's TurboQuant KV-cache code paths (turbo3 also
auto-enables flash_attention in the fork). Picking q8_0 here would
only re-test the standard llama.cpp path that the upstream llama-cpp
backend already covers.

Refresh the docs (text-generation.md + model-configuration.md) to
list turbo2/turbo3/turbo4 explicitly and call out that you only get
the TurboQuant code path with this backend + a turbo* cache type.

* fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3

The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant)
does not install python3, so the python-based augmentation step
errored with `python3: command not found` at make time. Switch to
awk, which ships in coreutils and is already available everywhere
the rest of the wrapper Makefile runs.

* Apply suggestion from @mudler

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
											
										
										
											2026-04-14 23:25:04 +00:00
+								### turboquant (llama.cpp fork with TurboQuant KV-cache)
 								[llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant) is a `llama.cpp` fork that adds the **TurboQuant KV-cache** quantization scheme. It reuses the upstream `llama.cpp` codebase and ships as a drop-in alternative backend inside LocalAI, sharing the same gRPC server sources as the stock `llama-cpp` backend — so any GGUF model that runs on `llama-cpp` also runs on `turboquant`.
 								You would pick `turboquant` when you want **smaller KV-cache memory pressure** (longer contexts on the same VRAM) or to experiment with the fork's quantized KV representations on top of the standard `cache_type_k` / `cache_type_v` knobs already supported by upstream `llama.cpp`.
 								#### Features
 								- Drop-in GGUF compatibility with upstream `llama.cpp`.
 								- TurboQuant KV-cache quantization (see fork README for the current set of accepted `cache_type_k` / `cache_type_v` values).
 								- Same feature surface as the `llama-cpp` backend: text generation, embeddings, tool calls, multimodal via mmproj.
 								- Available on CPU (AVX/AVX2/AVX512/fallback), NVIDIA CUDA 12/13, AMD ROCm/HIP, Intel SYCL f32/f16, Vulkan, and NVIDIA L4T.
 								#### Setup
 								`turboquant` ships as a separate container image in the LocalAI backend gallery. Install it like any other backend:
 								```bash
 								local-ai backends install turboquant
 								```
 								Or pick a specific flavor for your hardware (example tags: `cpu-turboquant`, `cuda12-turboquant`, `cuda13-turboquant`, `rocm-turboquant`, `intel-sycl-f16-turboquant`, `vulkan-turboquant`).
 								#### YAML configuration
 								To run a model with `turboquant`, set the backend in your model YAML and optionally pick quantized KV-cache types:
 								```yaml
 								name: my-model
 								backend: turboquant
 								parameters:
 								  # Relative to the models path
 								  model: file.gguf
 								# Use TurboQuant's own KV-cache quantization schemes. The fork accepts
 								# the standard llama.cpp types (f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1)
 								# and adds three TurboQuant-specific ones: turbo2, turbo3, turbo4.
 								# turbo3 / turbo4 auto-enable flash_attention (required for turbo K/V)
 								# and offer progressively more aggressive compression.
 								cache_type_k: turbo3
 								cache_type_v: turbo3
 								context_size: 8192
 								```
 								The `cache_type_k` / `cache_type_v` fields map to llama.cpp's `-ctk` / `-ctv` flags. The stock `llama-cpp` backend only accepts the standard llama.cpp types — to use `turbo2` / `turbo3` / `turbo4` you need this `turboquant` backend, which is where the fork's TurboQuant code paths actually take effect. Pick `q8_0` here and you're just running stock llama.cpp KV quantization; pick `turbo*` and you're running TurboQuant.
 								#### Reference
 								- [llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant)
 								- [Tracked branch: `feature/turboquant-kv-cache`](https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache)
-												docs/examples: enhancements (#1572)

* docs: re-order sections

* fix references

* Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b

* Fix link

* Minor corrections

* fix: models is a StringSlice, not a String

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP: switch docs theme

* content

* Fix GH link

* enhancements

* enhancements

* Fixed how to link

Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>

* fixups

* logo fix

* more fixups

* final touches

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>
											
										
										
											2024-01-18 18:41:08 +00:00
+								### vLLM
 								[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference.
 								LocalAI has a built-in integration with vLLM, and it can be used to run models. You can check out `vllm` performance [here](https://github.com/vllm-project/vllm#performance).
 								#### Setup
 								Create a YAML file for the model you want to use with `vllm`.
 								To setup a model, you need to just specify the model name in the YAML config file:
 								```yaml
 								name: vllm
 								backend: vllm
 								parameters:
 								    model: "facebook/opt-125m"
 								```
 								The backend will automatically download the required files in order to run the model.
 								#### Usage
 								Use the `completions` endpoint by specifying the `vllm` backend:
 								```
 								curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
 								   "model": "vllm",
 								   "prompt": "Hello, my name is",
 								   "temperature": 0.1, "top_p": 0.1
 								 }'
 								```
-												docs(transformers): add docs section about transformers (#1841)


											
										
										
											2024-03-15 17:13:30 +00:00
-												feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563)

* feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map

LocalAI's vLLM backend wraps a small typed subset of vLLM's
AsyncEngineArgs (quantization, tensor_parallel_size, dtype, etc.).
Anything outside that subset -- pipeline/data/expert parallelism,
speculative_config, kv_transfer_config, all2all_backend, prefix
caching, chunked prefill, etc. -- requires a new protobuf field, a
Go struct field, an options.go line, and a backend.py mapping per
feature. That cadence is the bottleneck on shipping vLLM's
production feature set.

Add a generic `engine_args:` map on the model YAML that is
JSON-serialised into a new ModelOptions.EngineArgs proto field and
applied verbatim to AsyncEngineArgs at LoadModel time. Validation
is done by the Python backend via dataclasses.fields(); unknown
keys fail with the closest valid name as a hint.
dataclasses.replace() is used so vLLM's __post_init__ re-runs and
auto-converts dict values into nested config dataclasses
(CompilationConfig, AttentionConfig, ...). speculative_config and
kv_transfer_config flow through as dicts; vLLM converts them at
engine init.

Operators can now write:

  engine_args:
    data_parallel_size: 8
    enable_expert_parallel: true
    all2all_backend: deepep_low_latency
    speculative_config:
      method: deepseek_mtp
      num_speculative_tokens: 3
    kv_cache_dtype: fp8

without further proto/Go/Python plumbing per field.

Production defaults seeded by hooks_vllm.go: enable_prefix_caching
and enable_chunked_prefill default to true unless explicitly set.

Existing typed YAML fields (gpu_memory_utilization,
tensor_parallel_size, etc.) remain for back-compat; engine_args
overrides them when both are set.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(vllm): pin cublas13 to vLLM 0.20.0 cu130 wheel

vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't
load on a cu130 host. Switch the cublas13 build to vLLM's per-tag cu130
simple-index (https://wheels.vllm.ai/0.20.0/cu130/) and pin
vllm==0.20.0. The cu130-flavoured wheel ships libcudart.so.13 and
includes the DFlash speculative-decoding method that landed in 0.20.0.

cublas13 install gets --index-strategy=unsafe-best-match so uv consults
both the cu130 index and PyPI when resolving — PyPI also publishes
vllm==0.20.0, but with cu12 binaries that error at import time.

Verified: Qwen3.5-4B + z-lab/Qwen3.5-4B-DFlash loads and serves chat
completions on RTX 5070 Ti (sm_120, cu130).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* ci(vllm): bot job to bump cublas13 vLLM wheel pin

vLLM's cu130 wheel index URL is itself version-locked
(wheels.vllm.ai/<TAG>/cu130/, no /latest/ alias upstream), so a vLLM
bump means rewriting two values atomically — the URL segment and the
version constraint. bump_deps.sh handles git-sha-in-Makefile only;
add a sibling bump_vllm_wheel.sh and a matching workflow job that
mirrors the existing matrix's PR-creation pattern.

The bumper queries /releases/latest (which excludes prereleases),
strips the leading 'v', and seds both lines unconditionally. When the
file is already on the latest tag the rewrite is a no-op and
peter-evans/create-pull-request opens no PR.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* docs(vllm): document engine_args and speculative decoding

The new engine_args: map plumbs arbitrary AsyncEngineArgs through to
vLLM, but the public docs only covered the basic typed fields. Add a
short subsection in the vLLM section explaining the typed/generic
split and showing a worked DFlash speculative-decoding config, with
pointers to vLLM's SpeculativeConfig reference and z-lab's drafter
collection.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
											
										
										
											2026-04-28 22:49:28 +00:00
+								#### Passing arbitrary vLLM options with `engine_args`
 								A subset of `AsyncEngineArgs` is exposed as typed YAML fields
 								(`tensor_parallel_size`, `gpu_memory_utilization`, `quantization`,
 								`max_model_len`, `dtype`, `trust_remote_code`, `enforce_eager`, …).
 								Anything else can be passed through the generic `engine_args:` map.
 								Keys are forwarded verbatim to vLLM's engine; unknown keys fail at load
 								time with the closest valid name as a hint. Nested maps materialise
 								into vLLM's nested config dataclasses (`SpeculativeConfig`,
 								`KVTransferConfig`, `CompilationConfig`, …).
 								Speculative decoding (DFlash, ngram, eagle, deepseek_mtp, …) is
 								configured this way:
 								```yaml
 								name: qwen3.5-4b-dflash
 								backend: vllm
 								parameters:
 								  model: Qwen/Qwen3.5-4B
 								context_size: 8192
 								max_model_len: 8192
 								trust_remote_code: true
 								quantization: fp8
 								template:
 								  use_tokenizer_template: true
 								engine_args:
 								  speculative_config:
 								    method: dflash
 								    model: z-lab/Qwen3.5-4B-DFlash
 								    num_speculative_tokens: 15
 								```
 								The shape of `speculative_config` follows vLLM's
 								[`SpeculativeConfig`](https://docs.vllm.ai/en/latest/api/vllm/config/speculative.html)
 								— `method` picks the algorithm, the remaining keys are method-specific.
 								Drafters from [z-lab](https://huggingface.co/z-lab) are paired with
 								specific target models; pick the one that matches your target. The
 								drafter loads in its native precision regardless of the target's
 								`quantization:` setting.
-												feat(vllm, distributed): tensor parallel distributed workers (#9612)

* feat(vllm): build vllm from source for Intel XPU

Upstream publishes no XPU wheels for vllm. The Intel profile was
silently picking up a non-XPU wheel that imported but errored at
engine init, and several runtime deps (pillow, charset-normalizer,
chardet) were missing on Intel -- backend.py crashed at import time
before the gRPC server came up.

Switch the Intel profile to upstream's documented from-source
procedure (docs/getting_started/installation/gpu.xpu.inc.md in
vllm-project/vllm):

  - Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a
    cp312 wheel.
  - Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees
    the dpcpp/sycl compiler from the oneapi-basekit base image.
  - Hide requirements-intel-after.txt during installRequirements
    (it used to 'pip install vllm'); install vllm's deps from a
    fresh git clone of vllm via 'uv pip install -r
    requirements/xpu.txt', swap stock triton for
    triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install
    --no-deps .'.
  - requirements-intel.txt trimmed to LocalAI's direct deps
    (accelerate / transformers / bitsandbytes); torch-xpu, vllm,
    vllm_xpu_kernels and the rest come from upstream's xpu.txt
    during the source build.
  - requirements.txt: add pillow + charset-normalizer + chardet --
    used by backend.py and missing on the Intel install profile.
  - run.sh: 'set -x' so backend startup is visible in container
    logs (the gRPC startup error path was previously opaque).

Also adds a one-line docs example for engine_args.attention_backend
under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770)
need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels.

Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct
via LocalAI's /v1/chat/completions.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(vllm): add multi-node data-parallel follower worker

vLLM v1's multi-node story is one process per node sharing a DP
coordinator over ZMQ -- the head runs the API server with
data_parallel_size > 1 and followers run `vllm serve --headless ...`
with matching topology. Today LocalAI can already configure DP on the
head via the engine_args YAML map, but there's no way to bring up the
follower nodes -- so the head sits waiting for ranks that never
handshake.

Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural
precedent (operator-launched, static config, no NATS placement). The
worker:

  - Optionally self-registers with the frontend as an agent-type node
    tagged `node.role=vllm-follower` so it's visible in the admin UI
    and operators can scope ordinary models away via inverse
    selectors.
  - Resolves the platform-specific vllm backend via the gallery's
    "vllm" meta-entry (cuda*, intel-vllm, rocm-vllm, ...).
  - Runs vLLM as a child process so the heartbeat goroutine survives
    until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its
    ZMQ sockets before we tear down.
  - Validates --headless + --start-rank 0 is rejected (rank 0 is the
    head and must serve the API).

Backend run.sh dispatches `serve` as the first arg to vllm's own CLI
instead of LocalAI's backend.py gRPC server -- the follower speaks
ZMQ directly to the head, there is no LocalAI gRPC on the follower
side. Single-node usage is unchanged.

Generalises the gallery resolution helper into findBackendPath()
shared by MLX and vLLM workers; extracts ParseNodeLabels for the
comma-separated label parsing both use.

Ships with two compose recipes (`docker-compose.vllm-multinode.yaml`
for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel
XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are
supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is
not -- PyTorch's process group requires every rank to use the same
collective backend, and NCCL/xccl/gloo don't interoperate.

Out of scope (deferred): SmartRouter-driven placement of follower
ranks via NATS backend.install events, follower log streaming through
/api/backend-logs, tensor-parallel across nodes, disaggregated
prefill via KVTransferConfig.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(vllm): CPU-only end-to-end test for multi-node DP

Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite
that brings up a head + headless follower from the locally-built
local-ai:tests image, bind-mounts the cpu-vllm backend extracted by
make extract-backend-vllm so it's seen as a system backend (no gallery
fetch, no registry server), and asserts a chat completion across both
DP ranks. New `make test-e2e-vllm-multinode` target wires the docker
build, backend extract, and ginkgo run together; BuildKit caches both
images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode")
so the existing distributed suite isn't pulled along.

Two pre-existing bugs surfaced by the test:

1. extract-backend-% (Makefile) failed for every backend, because all
   backend images end with `FROM scratch` and `docker create` rejects
   an image with no CMD/ENTRYPOINT. Fixed by passing
   --entrypoint=/run.sh -- the container is never started, only
   docker-cp'd, so the path doesn't have to exist; we just need
   anything that satisfies the daemon's create-time validation.

2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP
   follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an
   absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no
   longer resolves once the backend is relocated to BackendsPath.
   _makeVenvPortable's shebang rewriter only matches paths that
   already point at ${EDIR}, so the original shebang slips through
   unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script
   as an argument -- Python ignores the script's shebang in that case.

The test fixture caps memory aggressively (max_model_len=512,
VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines
fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for
cpu-vllm: torch._inductor's CPU-ISA probe runs even with
enforce_eager=True and needs g++ on PATH, which the LocalAI runtime
image doesn't ship -- to be addressed in a follow-up that bundles a
toolchain in the cpu-vllm backend.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image

torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up
model for the compilation"`) shells out to `g++` at vllm engine
startup, regardless of `enforce_eager=True` -- the eager flag only
disables CUDA graphs, not inductor's first-batch warmup. The LocalAI
CPU runtime image (Dockerfile, unconditional apt list) does not ship
build-essential, and the cpu-vllm backend image is `FROM scratch`,
so any non-trivial inference on cpu-vllm crashes with:

  torch._inductor.exc.InductorError:
    InvalidCxxCompiler: No working C++ compiler found in
    torch._inductor.config.cpp.cxx: (None, 'g++')

Bundling the toolchain in the CPU runtime image would bloat every
non-vllm-CPU deployment and force a single GCC version on backends
that may want clang or a different version. So this lives in the
backend, gated to BUILD_TYPE=='' (the CPU profile).

`package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6
(runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/
libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The
unversioned binaries on Debian/Ubuntu are symlink chains pointing into
multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`,
the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves
both the version and the arch-triplet variant. Symlinks /lib ->
usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain
root because Ubuntu's UsrMerge keeps them at /, and ld scripts
(`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot
re-roots into the toolchain.

The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper
shell scripts that resolve their own location at runtime and pass
`--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/`
to the underlying versioned binary. That's how torch's bare `g++ foo.cpp
-o foo` invocation finds cc1plus (-B), system headers (--sysroot), and
the bundled libstdc++ (--sysroot, --sysroot is recursive into linker).

`run.sh` adds the toolchain bin dir to PATH and the toolchain's
shared-lib dir to LD_LIBRARY_PATH -- everything else (header search,
linker search, executable search) is encapsulated in the wrappers.
No-op for non-CPU builds, the dir doesn't exist there.

The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm
is already a niche profile (few users compared to GPU vllm) and the
alternative is a backend that crashes at first inference unless the
operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables
all torch.compile optimizations.

Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the
smoke now exercises the real compile path through the bundled toolchain.
Test runtime is +20s for the warmup compile, still <90s end to end.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml

The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io,
which hosts the L4T-specific torch / vllm / flash-attn wheels but also
transparently proxies the rest of PyPI through `/+f/<sha>/<filename>`
URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match`
uv would pick those proxy URLs for ordinary PyPI packages —
anthropic/openai/propcache/annotated-types — and fail when the proxy
503s. Master is hitting the same bug on its own l4t-vllm matrix entry.

Switch the l4t13 install path to a pyproject.toml that marks the
jetson-ai-lab index `explicit = true` and pins only torch, torchvision,
torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't
consult the L4T mirror for anything else, so transitive deps fall back
to PyPI as the default index — no exposure to the proxy 503s.

`uv pip install -r requirements.txt` ignores [tool.uv.sources], so the
l4t13 branch in install.sh now invokes `uv pip install --requirement
pyproject.toml` directly, replacing the old requirements-l4t13*.txt
files. Other BUILD_PROFILEs continue using libbackend.sh's
installRequirements and never read pyproject.toml.

Local resolution test (x86_64, dry-run) confirms uv hits the L4T
index for torch and falls through to PyPI for everything else.

Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
											
										
										
											2026-05-05 22:22:50 +00:00
+								Another example — picking a non-default attention backend (e.g. on
 								hardware where the default cutlass kernels aren't supported):
 								```yaml
 								engine_args:
 								  attention_backend: TRITON_ATTN
 								```
 								#### Multi-node data parallelism
 								`engine_args.data_parallel_size > 1` combined with the
 								`local-ai p2p-worker vllm` follower lets a single model span multiple
 								GPU nodes. See [vLLM Multi-Node (Data-Parallel)]({{% relref
-												fix: unbreak master CI (docs, kokoros, vibevoice-cpp ABI) (#9682)

* fix(docs): correct broken Hugo relrefs

The Hugo build has been failing on master since the relevant pages
landed:

- text-generation.md:720 referenced `/docs/features/distributed-mode`,
  but Hugo `relref` paths are relative to the content root, not the
  rendered URL. Drop the `/docs/` prefix so the lookup matches the
  existing `features/...` form used elsewhere in the file.
- audio-transform.md:144 referenced `tts.md`; the actual page is
  `text-to-audio.md`.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(kokoros): stub Diarize and AudioTransform Backend trait methods

The recent backend.proto additions (Diarize, AudioTransform,
AudioTransformStream) extended the gRPC Backend trait, breaking
kokoros-grpc compilation with E0046 because the Rust implementation
hadn't picked up the new methods. Add Unimplemented stubs matching the
existing pattern for non-applicable RPCs in this TTS-only backend.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(vibevoice-cpp): track upstream ABI + wire 1.5B voice cloning

Two recent commits in mudler/vibevoice.cpp reshaped the vv_capi_tts
signature without a corresponding bump on the LocalAI side:

  3bd759c "1.5b: unify into a single tts entry point" inserted a
          ref_audio_path parameter between voice_path and dst_wav_path.
  ad856bd "1.5b: multi-speaker dialog support" promoted that to a
          (const char* const* ref_audio_paths, int n_ref_audio_paths)
          pair for per-speaker conditioning.

Because purego resolves symbols by name and not by signature, the
build kept linking; at runtime the misaligned arguments turned the
TTS->ASR closed-loop test into a SIGSEGV inside cgo. Track HEAD
explicitly and bring the bridge in line with it:

  * Update the CppTTS purego binding to the 9-arg form. purego
    marshals []*byte as a **char by handing the C side the underlying
    array address; nil/empty maps to NULL, which matches the C
    contract for "no reference audio" on the realtime-0.5B path.
  * Add a `ref_audio` gallery option (comma-separated, repeatable)
    that the 1.5B path consumes for runtime voice cloning. Multiple
    entries are interpreted as one WAV per speaker (Speaker 0..n-1).
  * TTSRequest.Voice now routes by extension/shape: `.wav` or a
    comma-separated list goes to ref_audio_paths; anything else stays
    on voice_path (realtime-0.5B's pre-baked voice gguf).
  * Pin VIBEVOICE_CPP_VERSION to ad856bd and wire the Makefile into
    the existing bump_deps matrix so future upstream rolls land as
    reviewable PRs instead of a silent CI break.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(vibevoice-cpp): use ModelOptions.AudioPath for 1.5B ref audio

Use the existing audio_path field from ModelOptions (already plumbed
through config_file's `audio_path:` YAML and consumed by other audio
backends like kokoros) instead of inventing a custom `ref_audio:`
Options[] string. Multi-speaker setups stay on a single comma-
separated value.

No behavior change beyond the gallery key name; per-call routing via
TTSRequest.Voice is unchanged.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2026-05-06 08:36:59 +00:00
+								"features/distributed-mode#vllm-multi-node-data-parallel" %}})
-												feat(vllm, distributed): tensor parallel distributed workers (#9612)

* feat(vllm): build vllm from source for Intel XPU

Upstream publishes no XPU wheels for vllm. The Intel profile was
silently picking up a non-XPU wheel that imported but errored at
engine init, and several runtime deps (pillow, charset-normalizer,
chardet) were missing on Intel -- backend.py crashed at import time
before the gRPC server came up.

Switch the Intel profile to upstream's documented from-source
procedure (docs/getting_started/installation/gpu.xpu.inc.md in
vllm-project/vllm):

  - Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a
    cp312 wheel.
  - Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees
    the dpcpp/sycl compiler from the oneapi-basekit base image.
  - Hide requirements-intel-after.txt during installRequirements
    (it used to 'pip install vllm'); install vllm's deps from a
    fresh git clone of vllm via 'uv pip install -r
    requirements/xpu.txt', swap stock triton for
    triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install
    --no-deps .'.
  - requirements-intel.txt trimmed to LocalAI's direct deps
    (accelerate / transformers / bitsandbytes); torch-xpu, vllm,
    vllm_xpu_kernels and the rest come from upstream's xpu.txt
    during the source build.
  - requirements.txt: add pillow + charset-normalizer + chardet --
    used by backend.py and missing on the Intel install profile.
  - run.sh: 'set -x' so backend startup is visible in container
    logs (the gRPC startup error path was previously opaque).

Also adds a one-line docs example for engine_args.attention_backend
under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770)
need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels.

Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct
via LocalAI's /v1/chat/completions.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(vllm): add multi-node data-parallel follower worker

vLLM v1's multi-node story is one process per node sharing a DP
coordinator over ZMQ -- the head runs the API server with
data_parallel_size > 1 and followers run `vllm serve --headless ...`
with matching topology. Today LocalAI can already configure DP on the
head via the engine_args YAML map, but there's no way to bring up the
follower nodes -- so the head sits waiting for ranks that never
handshake.

Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural
precedent (operator-launched, static config, no NATS placement). The
worker:

  - Optionally self-registers with the frontend as an agent-type node
    tagged `node.role=vllm-follower` so it's visible in the admin UI
    and operators can scope ordinary models away via inverse
    selectors.
  - Resolves the platform-specific vllm backend via the gallery's
    "vllm" meta-entry (cuda*, intel-vllm, rocm-vllm, ...).
  - Runs vLLM as a child process so the heartbeat goroutine survives
    until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its
    ZMQ sockets before we tear down.
  - Validates --headless + --start-rank 0 is rejected (rank 0 is the
    head and must serve the API).

Backend run.sh dispatches `serve` as the first arg to vllm's own CLI
instead of LocalAI's backend.py gRPC server -- the follower speaks
ZMQ directly to the head, there is no LocalAI gRPC on the follower
side. Single-node usage is unchanged.

Generalises the gallery resolution helper into findBackendPath()
shared by MLX and vLLM workers; extracts ParseNodeLabels for the
comma-separated label parsing both use.

Ships with two compose recipes (`docker-compose.vllm-multinode.yaml`
for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel
XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are
supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is
not -- PyTorch's process group requires every rank to use the same
collective backend, and NCCL/xccl/gloo don't interoperate.

Out of scope (deferred): SmartRouter-driven placement of follower
ranks via NATS backend.install events, follower log streaming through
/api/backend-logs, tensor-parallel across nodes, disaggregated
prefill via KVTransferConfig.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(vllm): CPU-only end-to-end test for multi-node DP

Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite
that brings up a head + headless follower from the locally-built
local-ai:tests image, bind-mounts the cpu-vllm backend extracted by
make extract-backend-vllm so it's seen as a system backend (no gallery
fetch, no registry server), and asserts a chat completion across both
DP ranks. New `make test-e2e-vllm-multinode` target wires the docker
build, backend extract, and ginkgo run together; BuildKit caches both
images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode")
so the existing distributed suite isn't pulled along.

Two pre-existing bugs surfaced by the test:

1. extract-backend-% (Makefile) failed for every backend, because all
   backend images end with `FROM scratch` and `docker create` rejects
   an image with no CMD/ENTRYPOINT. Fixed by passing
   --entrypoint=/run.sh -- the container is never started, only
   docker-cp'd, so the path doesn't have to exist; we just need
   anything that satisfies the daemon's create-time validation.

2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP
   follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an
   absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no
   longer resolves once the backend is relocated to BackendsPath.
   _makeVenvPortable's shebang rewriter only matches paths that
   already point at ${EDIR}, so the original shebang slips through
   unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script
   as an argument -- Python ignores the script's shebang in that case.

The test fixture caps memory aggressively (max_model_len=512,
VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines
fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for
cpu-vllm: torch._inductor's CPU-ISA probe runs even with
enforce_eager=True and needs g++ on PATH, which the LocalAI runtime
image doesn't ship -- to be addressed in a follow-up that bundles a
toolchain in the cpu-vllm backend.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image

torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up
model for the compilation"`) shells out to `g++` at vllm engine
startup, regardless of `enforce_eager=True` -- the eager flag only
disables CUDA graphs, not inductor's first-batch warmup. The LocalAI
CPU runtime image (Dockerfile, unconditional apt list) does not ship
build-essential, and the cpu-vllm backend image is `FROM scratch`,
so any non-trivial inference on cpu-vllm crashes with:

  torch._inductor.exc.InductorError:
    InvalidCxxCompiler: No working C++ compiler found in
    torch._inductor.config.cpp.cxx: (None, 'g++')

Bundling the toolchain in the CPU runtime image would bloat every
non-vllm-CPU deployment and force a single GCC version on backends
that may want clang or a different version. So this lives in the
backend, gated to BUILD_TYPE=='' (the CPU profile).

`package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6
(runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/
libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The
unversioned binaries on Debian/Ubuntu are symlink chains pointing into
multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`,
the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves
both the version and the arch-triplet variant. Symlinks /lib ->
usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain
root because Ubuntu's UsrMerge keeps them at /, and ld scripts
(`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot
re-roots into the toolchain.

The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper
shell scripts that resolve their own location at runtime and pass
`--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/`
to the underlying versioned binary. That's how torch's bare `g++ foo.cpp
-o foo` invocation finds cc1plus (-B), system headers (--sysroot), and
the bundled libstdc++ (--sysroot, --sysroot is recursive into linker).

`run.sh` adds the toolchain bin dir to PATH and the toolchain's
shared-lib dir to LD_LIBRARY_PATH -- everything else (header search,
linker search, executable search) is encapsulated in the wrappers.
No-op for non-CPU builds, the dir doesn't exist there.

The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm
is already a niche profile (few users compared to GPU vllm) and the
alternative is a backend that crashes at first inference unless the
operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables
all torch.compile optimizations.

Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the
smoke now exercises the real compile path through the bundled toolchain.
Test runtime is +20s for the warmup compile, still <90s end to end.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml

The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io,
which hosts the L4T-specific torch / vllm / flash-attn wheels but also
transparently proxies the rest of PyPI through `/+f/<sha>/<filename>`
URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match`
uv would pick those proxy URLs for ordinary PyPI packages —
anthropic/openai/propcache/annotated-types — and fail when the proxy
503s. Master is hitting the same bug on its own l4t-vllm matrix entry.

Switch the l4t13 install path to a pyproject.toml that marks the
jetson-ai-lab index `explicit = true` and pins only torch, torchvision,
torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't
consult the L4T mirror for anything else, so transitive deps fall back
to PyPI as the default index — no exposure to the proxy 503s.

`uv pip install -r requirements.txt` ignores [tool.uv.sources], so the
l4t13 branch in install.sh now invokes `uv pip install --requirement
pyproject.toml` directly, replacing the old requirements-l4t13*.txt
files. Other BUILD_PROFILEs continue using libbackend.sh's
installRequirements and never read pyproject.toml.

Local resolution test (x86_64, dry-run) confirms uv hits the L4T
index for torch and falls through to PyPI for everything else.

Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
											
										
										
											2026-05-05 22:22:50 +00:00
+								for the head/follower configuration and a worked Kimi-K2.6 example.
-												feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686)

Bring the sglang Python backend up to feature parity with vllm by adding
the same engine_args:-map plumbing the vLLM backend already has. Any
ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model
YAML, including the speculative-decoding flags needed for Multi-Token
Prediction. Validation matches the vllm backend's: keys are checked
against dataclasses.fields(ServerArgs), unknown keys raise ValueError
with a difflib close-match suggestion at LoadModel time, and the typed
ModelOptions fields keep their existing meaning with engine_args
overriding them.

Backend code:
* backend/python/sglang/backend.py: add _apply_engine_args, import
  dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed ->
  sampling_seed (sglang 0.5.11 renamed the SamplingParams field).
* backend/python/sglang/test.py + test.sh + Makefile: six unit tests
  exercising the helper directly (no engine load required).

Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class):
* backend/python/sglang/install.sh: add --prerelease=allow because
  sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels;
  add --index-strategy=unsafe-best-match for cublas12 so the cu128
  torch index wins over default-PyPI's cu130; new pyproject.toml-driven
  l4t13 install path so [tool.uv.sources] can pin torch/torchvision/
  torchaudio/sglang to the jetson-ai-lab index without forcing every
  transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the
  equivalent fix in backend/python/vllm/install.sh).
* backend/python/sglang/pyproject.toml (new): L4T project spec with
  explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt
  for the l4t13 BUILD_PROFILE; other profiles still go through the
  requirements-*.txt pipeline via libbackend.sh's installRequirements.
* backend/python/sglang/requirements-l4t13.txt: removed; superseded
  by pyproject.toml.
* backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin
  sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13
  (new files) and cu128 torch index for cublas12 (default PyPI now
  ships cu130 torch wheels by default and breaks cu12 hosts).
* backend/index.yaml: add cuda13-sglang and cuda13-sglang-development
  capability mappings + image entries pointing at
  quay.io/.../-gpu-nvidia-cuda-13-sglang.
* .github/workflows/backend.yml: new cublas13 sglang matrix entry,
  mirroring vllm's cuda13 build.

Model gallery + docs:
* gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml.
* gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos
  transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands.
* gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads
  + online fp8 weight quantization, verified end-to-end on a 16 GB
  RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the
  MTP draft worker's vocab embedding is loaded unquantised and OOMs
  the static reservation at sglang's 0.85 default.
* gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp,
  gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang).
* docs/content/features/text-generation.md: new SGLang section with
  setup, engine_args reference, MTP demos, version requirements.
* .agents/sglang-backend.md (new): agent one-pager covering the flat
  ServerArgs structure, the typed-vs-engine_args precedence, the
  speculative-decoding cheatsheet, and the mem_fraction_static gotcha
  documented above.
* AGENTS.md: index entry for the new agent doc.

Known limitation: the two Gemma 4 MTP gallery entries ship a recipe
that doesn't yet run on stock libraries. The drafter checkpoints
(google/gemma-4-{E2B,E4B}-it-assistant) declare
model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which
neither transformers (<=5.6.0, including the SGLang cookbook's pinned
commit 91b1ab1f... and main HEAD) nor sglang's own model registry
(<=0.5.11) registers as of 2026-05-06. They will start working when
HF or sglang upstream registers the architecture -- no LocalAI
changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work
today on this build (verified on RTX 5070 Ti, 16 GB).

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
											
										
										
											2026-05-07 15:27:29 +00:00
+								### SGLang
 								[SGLang](https://github.com/sgl-project/sglang) is a fast serving
 								framework for LLMs and VLMs with a focus on prefix caching, speculative
 								decoding, and multi-modal generation. LocalAI ships a gRPC backend that
 								wraps SGLang's async `Engine`, including its native function-call and
 								reasoning parsers.
 								#### Setup
 								```yaml
 								name: sglang
 								backend: sglang
 								parameters:
 								  model: "Qwen/Qwen3-4B"
 								template:
 								  use_tokenizer_template: true
 								```
 								The backend will pull the model from HuggingFace on first load.
 								#### Passing arbitrary SGLang options with `engine_args`
 								The same `engine_args:` map that the vLLM backend accepts is also
 								honoured by the SGLang backend. Keys are validated against
 								[`ServerArgs`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py)
 								— SGLang's central configuration dataclass — and forwarded verbatim to
 								`Engine(**kwargs)`. Unknown keys fail at load time with the closest
 								valid name as a hint. Unlike vLLM, `ServerArgs` is flat: speculative
 								decoding fields are top-level (`speculative_algorithm`,
 								`speculative_draft_model_path`, etc.) rather than nested under a
 								`speculative_config:` dict.
 								The typed YAML fields shared with vLLM are mapped to their SGLang
 								equivalents (`gpu_memory_utilization` → `mem_fraction_static`,
 								`enforce_eager` → `disable_cuda_graph`, `tensor_parallel_size` →
 								`tp_size`, `max_model_len` → `context_length`). Anything else,
 								including all speculative-decoding flags, goes under `engine_args:`.
 								##### Speculative decoding: Gemma 4 with Multi-Token Prediction
 								Google publishes paired "assistant" drafters for every Gemma 4 size.
 								The drafters use Multi-Token Prediction (MTP) to propose several
 								candidate tokens per target step, which SGLang then verifies in
 								parallel. Flags below are transcribed verbatim from the
 								[SGLang Gemma 4 cookbook](https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#speculative-decoding-mtp-server-commands).
 								For consumer GPUs in the 16–24 GB range, use **E4B** (8 B total /
 B effective parameters):
 								```yaml
 								name: gemma-4-e4b-mtp
 								backend: sglang
 								parameters:
 								  model: google/gemma-4-E4B-it
 								context_size: 4096
 								template:
 								  use_tokenizer_template: true
 								options:
 								  - tool_parser:gemma4
 								  - reasoning_parser:gemma4
 								engine_args:
 								  mem_fraction_static: 0.85
 								  speculative_algorithm: NEXTN
 								  speculative_draft_model_path: google/gemma-4-E4B-it-assistant
 								  speculative_num_steps: 5
 								  speculative_num_draft_tokens: 6
 								  speculative_eagle_topk: 1
 								```
 								For smaller cards (8–12 GB), drop to **E2B** (5 B total / 2 B effective)
 								by swapping the model paths to `google/gemma-4-E2B-it` and
 								`google/gemma-4-E2B-it-assistant`; the rest of the flags stay the same.
 								`NEXTN` is normalised to `EAGLE` inside `ServerArgs.__post_init__`, so
 								either value works — the cookbook uses `NEXTN`. `mem_fraction_static`
 								is the share of GPU memory SGLang reserves for the model + KV pool;
 .85 is the cookbook's default and adapts to whatever single GPU the
 								backend is running on.
 								The 31 B dense and 26 B-A4B MoE Gemma 4 variants exist in the same
 								cookbook but require `--tp-size 2`, so they're not in the gallery as
 								single-GPU recipes.
 								> **SGLang version requirement.** Gemma 4 support landed in SGLang via
 								> [PR #21952](https://github.com/sgl-project/sglang/pull/21952). The
 								> LocalAI sglang backend pins a release that includes it; if you've
 								> overridden the pin to an older version, this recipe will fail with a
 								> "model architecture not recognised" error at load time.
 								##### Other speculative algorithms
 								`speculative_algorithm:` also accepts `EAGLE`/`EAGLE3` (paired with an
 								EAGLE-style draft head), `DFLASH` (block-diffusion drafters from
 								[z-lab](https://huggingface.co/z-lab) for the Qwen3 family), `STANDALONE`
 								(a smaller draft LLM verifying a larger target), and `NGRAM` (no draft
 								model — pure prefix-history speculation). See SGLang's
 								[speculative-decoding docs](https://docs.sglang.io/advanced_features/speculative_decoding.html)
 								for the full algorithm matrix.
 								#### Tool calling and reasoning parsers
 								SGLang's native parsers stream `tool_calls` and `reasoning_content`
 								inside `ChatDelta` — the LocalAI Python backend wires them up
 								per-request rather than via `engine_args:`. Pick a parser by name:
 								```yaml
 								options:
 								  - tool_parser:hermes
 								  - reasoning_parser:deepseek_r1
 								```
 								The full list of registered parsers lives in `sglang.srt.function_call`
 								and `sglang.srt.parser.reasoning_parser`.
-												docs(transformers): add docs section about transformers (#1841)


											
										
										
											2024-03-15 17:13:30 +00:00
+								### Transformers
 								[Transformers](https://huggingface.co/docs/transformers/index) is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX.
 								LocalAI has a built-in integration with Transformers, and it can be used to run models.
 								This is an extra backend - in the container images (the `extra` images already contains python dependencies for Transformers) is already available and there is nothing to do for the setup.
 								#### Setup
 								Create a YAML file for the model you want to use with `transformers`.
 								To setup a model, you need to just specify the model name in the YAML config file:
 								```yaml
 								name: transformers
 								backend: transformers
 								parameters:
 								    model: "facebook/opt-125m"
 								type: AutoModelForCausalLM
-												docs: updated Transformer parameters description (#2234)

updated Transformer parameters
											
										
										
											2024-05-04 08:45:25 +00:00
+								quantization: bnb_4bit # One of: bnb_8bit, bnb_4bit, xpu_4bit, xpu_8bit (optional)
-												docs(transformers): add docs section about transformers (#1841)


											
										
										
											2024-03-15 17:13:30 +00:00
+								```
 								The backend will automatically download the required files in order to run the model.
 								#### Parameters
 								##### Type
 								| Type | Description |
 								| --- | --- |
-												docs: updated Transformer parameters description (#2234)

updated Transformer parameters
											
										
										
											2024-05-04 08:45:25 +00:00
+								| `AutoModelForCausalLM` | `AutoModelForCausalLM` is a model that can be used to generate sequences. Use it for NVIDIA CUDA and Intel GPU with Intel Extensions for Pytorch acceleration |
 								| `OVModelForCausalLM` | for Intel CPU/GPU/NPU OpenVINO Text Generation models |
 								| `OVModelForFeatureExtraction` | for Intel CPU/GPU/NPU OpenVINO Embedding acceleration |
-												docs(transformers): add docs section about transformers (#1841)


											
										
										
											2024-03-15 17:13:30 +00:00
+								| N/A | Defaults to `AutoModel` |
-												docs: updated Transformer parameters description (#2234)

updated Transformer parameters
											
										
										
											2024-05-04 08:45:25 +00:00
+								- `OVModelForCausalLM` requires OpenVINO IR [Text Generation](https://huggingface.co/models?library=openvino&pipeline_tag=text-generation) models from Hugging face
 								- `OVModelForFeatureExtraction` works with any Safetensors Transformer [Feature Extraction](https://huggingface.co/models?pipeline_tag=feature-extraction&library=transformers,safetensors) model from Huggingface (Embedding Model)
 								Please note that streaming is currently not implemente in `AutoModelForCausalLM` for Intel GPU.
 								AMD GPU support is not implemented.
 								Although AMD CPU is not officially supported by OpenVINO there are reports that it works: YMMV.
 								##### Embeddings
 								Use `embeddings: true` if the model is an embedding model
 								##### Inference device selection
 								Transformer backend tries to automatically select the best device for inference, anyway you can override the decision manually overriding with the `main_gpu` parameter.
 								| Inference Engine | Applicable Values |
 								| --- | --- |
 								| CUDA | `cuda`, `cuda.X` where X is the GPU device like in `nvidia-smi -L` output |
 								| OpenVINO | Any applicable value from [Inference Modes](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html) like `AUTO`,`CPU`,`GPU`,`NPU`,`MULTI`,`HETERO` |
 								Example for CUDA:
 								`main_gpu: cuda.0`
 								Example for OpenVINO:
 								`main_gpu: AUTO:-CPU`
 								This parameter applies to both Text Generation and Feature Extraction (i.e. Embeddings) models.
 								##### Inference Precision
 								Transformer backend automatically select the fastest applicable inference precision according to the device support.
 								CUDA backend can manually enable *bfloat16* if your hardware support it with the following parameter:
 								`f16: true`
-												docs(transformers): add docs section about transformers (#1841)


											
										
										
											2024-03-15 17:13:30 +00:00
 								##### Quantization
 								| Quantization | Description |
 								| --- | --- |
 								| `bnb_8bit` | 8-bit quantization |
 								| `bnb_4bit` | 4-bit quantization |
-												docs: updated Transformer parameters description (#2234)

updated Transformer parameters
											
										
										
											2024-05-04 08:45:25 +00:00
+								| `xpu_8bit` | 8-bit quantization for Intel XPUs |
-												docs(transformers): add docs section about transformers (#1841)


											
										
										
											2024-03-15 17:13:30 +00:00
+								| `xpu_4bit` | 4-bit quantization for Intel XPUs |
-												docs: updated Transformer parameters description (#2234)

updated Transformer parameters
											
										
										
											2024-05-04 08:45:25 +00:00
+								##### Trust Remote Code
 								Some models like Microsoft Phi-3 requires external code than what is provided by the transformer library.
 								By default it is disabled for security.
 								It can be manually enabled with:
 								`trust_remote_code: true`
 								##### Maximum Context Size
 								Maximum context size in bytes can be specified with the parameter: `context_size`. Do not use values higher than what your model support.
 								Usage example:
 								`context_size: 8192`
 								##### Auto Prompt Template
 								Usually chat template is defined by the model author in the `tokenizer_config.json` file.
 								To enable it use the `use_tokenizer_template: true` parameter in the `template` section.
 								Usage example:
 								```
 								template:
 								  use_tokenizer_template: true
 								```
 								##### Custom Stop Words
 								Stopwords are usually defined in `tokenizer_config.json` file.
 								They can be overridden with the `stopwords` parameter in case of need like in llama3-Instruct model.
 								Usage example:
 								```
 								stopwords:
 								- "<|eot_id|>"
 								- "<|end_of_text|>"
 								```
-												docs(transformers): add docs section about transformers (#1841)


											
										
										
											2024-03-15 17:13:30 +00:00
+								#### Usage
 								Use the `completions` endpoint by specifying the `transformers` model:
 								```
 								curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
 								   "model": "transformers",
 								   "prompt": "Hello, my name is",
 								   "temperature": 0.1, "top_p": 0.1
 								 }'
-												fix(seed): generate random seed per-request if -1 is set (#1952)

* fix(seed): generate random seed per-request if -1 is set

Also update ci with new workflows and allow the aio tests to run with an
api key

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(openvino): Add OpenVINO example

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2024-04-03 20:25:47 +00:00
+								```
 								#### Examples
 								##### OpenVINO
 								A model configuration file for openvion and starling model:
 								```yaml
 								name: starling-openvino
 								backend: transformers
 								parameters:
 								  model: fakezeta/Starling-LM-7B-beta-openvino-int8
 								context_size: 8192
 								threads: 6
 								f16: true
 								type: OVModelForCausalLM
 								stopwords:
 								- <|end_of_turn|>
 								- <|endoftext|>
 								prompt_cache_path: "cache"
 								prompt_cache_all: true
 								template:
 								  chat_message: |
 								    {{if eq .RoleName "system"}}{{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "assistant"}}<|end_of_turn|>GPT4 Correct Assistant: {{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "user"}}GPT4 Correct User: {{.Content}}{{end}}
 								  chat: |
 								    {{.Input}}<|end_of_turn|>GPT4 Correct Assistant:
 								  completion: |
 								    {{.Input}}
-												feat: docs revamp (#7313)

* docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Small enhancements

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Enhancements

* Default to zen-dark

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2025-11-19 21:21:20 +00:00
+								```