LocalAI/core/services
LocalAI [bot] ab01ed1a3e
fix(agentpool): close truncate-then-read race in agent_jobs.json persistence (#9811)
* fix(agentpool): close truncate-then-read race in agent_jobs.json persistence

Three call sites wrote and read agent_jobs.json (and agent_tasks.json)
through three independent mutexes:

  - AgentJobService.ExecuteJob spawns go saveJobs(job) -> fileJobPersister
    holding p.mu
  - AgentJobService.SaveJobsToFile holding service.fileMutex
  - AgentJobService.LoadJobsFromFile on a separate service instance holding
    a different service.fileMutex

Nothing serialized those mutexes, and both writers used os.WriteFile, which
opens O_TRUNC. A reader landing between the truncate and the write saw a
zero-byte file and surfaced as `unexpected end of JSON input` at offset 0.
The macOS tests-apple job started hitting this consistently once the path
filter was removed from .github/workflows/test.yml and the file-mode race
test ran on every push (run 25823124797 was the first observed failure).

Two changes close the window:

1. fileJobPersister.saveTasksToFile / saveJobsToFile now write to a
   same-directory temp file and os.Rename to the final path. rename(2) is
   atomic on POSIX, so concurrent readers see either the prior contents or
   the new contents and never a zero-byte window. The helper Syncs before
   close so a crash mid-write leaves either the old file intact or the temp
   behind (cleaned up on next save).

2. AgentJobService.{Load,Save}{Tasks,Jobs}{FromFile,ToFile} are collapsed
   to thin wrappers around fileJobPersister, removing the duplicate write
   path and the redundant service.fileMutex / service.tasksFile /
   service.jobsFile fields. Within a single service all task/job I/O now
   serializes on the persister's mutex; the atomic rename handles the
   cross-instance case the tests exercise.

Adds a regression test that hammers SaveJobsToFile and LoadJobsFromFile
concurrently for 500ms across two service instances on the same paths.
On master this reproduces `unexpected end of JSON input` on Linux within
~500ms; with the fix the suite ran -until-it-fails for 30s (54 attempts,
all green).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(agentpool): route service flush/load through JobPersister interface

The first cut of the race fix made AgentJobService.{Save,Load}{Tasks,Jobs}*
type-assert s.persister to *fileJobPersister so they could reach the
unexported saveTasksToFile / saveJobsToFile helpers. That defeats the
JobPersister interface: the service is back to reasoning about a concrete
implementation instead of an abstraction.

Promote the bulk-flush operations to the interface as FlushTasks / FlushJobs:

  - fileJobPersister.FlushTasks/FlushJobs call the existing private helpers
    (atomic temp+rename writes from the prior commit).
  - dbJobPersister.FlushTasks/FlushJobs are no-ops because SaveTask/SaveJob
    are already write-through to the database.

The service's four file-named methods now talk only to the interface:
LoadTasks/LoadJobs read through s.persister.LoadTasks/LoadJobs, and the
Save side calls FlushTasks/FlushJobs. The "FromFile"/"ToFile" suffixes
stay for backward compat with user_services.go and the existing tests,
but they no longer claim a file-only contract.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-13 23:58:43 +02:00
..
advisorylock feat(distributed): sync state with frontends, better backend management reporting (#9426) 2026-04-19 17:55:53 +02:00
agentpool fix(agentpool): close truncate-then-read race in agent_jobs.json persistence (#9811) 2026-05-13 23:58:43 +02:00
agents feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
dbutil feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
distributed feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
facerecognition feat(face-recognition): add insightface/onnx backend for 1:1 verify, 1:N identify, embedding, detection, analysis (#9480) 2026-04-22 21:55:41 +02:00
finetune chore: Security hardening (#9719) 2026-05-08 16:25:45 +02:00
galleryop feat(distributed): per-node backend installation from the gallery 2026-04-26 22:05:18 +00:00
jobs feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
mcp feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
messaging fix(distributed): split NATS backend.upgrade off install + dedup loads (#9717) 2026-05-08 16:24:54 +02:00
modeladmin feat(gallery): Speed up load times and clean gallery entries (#9211) 2026-05-06 14:51:38 +02:00
monitoring feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
nodes fix(distributed): cascade-clean stale node_models rows + filter routing by healthy status (#9754) 2026-05-13 21:57:50 +02:00
quantization feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
skills feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
storage feat: track files being staged (#9275) 2026-04-08 14:33:58 +02:00
testutil feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
voicerecognition feat: voice recognition (#9500) 2026-04-23 12:07:14 +02:00
worker fix(gallery): keep auto-upgrade off non-dev backends when -development is installed (#9736) 2026-05-09 18:20:00 +02:00