docs: add crash recovery and robustness to spec and plan

Atomic model downloads (.downloading suffix + rename), file-based
install lock (survives container restart), atomic JSON writes,
startup recovery sequence, frontend double-click prevention,
SSE fallback polling, disk space pre-checks.
This commit is contained in:
ashim-hq 2026-04-17 17:43:33 +08:00
parent 08a7ffe403
commit 31424d4356
2 changed files with 1236 additions and 0 deletions

View file

@ -0,0 +1,712 @@
# On-Demand AI Feature Downloads Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Reduce Docker image from ~30 GB to ~5-6 GB by making AI features downloadable post-install via a UI-driven bundle system.
**Architecture:** Six feature bundles (Background Removal, Face Detection, Object Eraser & Colorize, Upscale & Enhance, Photo Restoration, OCR) are defined in a JSON manifest baked into the image. A Python install script handles pip + model downloads to a persistent volume. The backend exposes install/uninstall APIs with SSE progress. The frontend shows download badges on uninstalled tools and an install prompt on tool pages.
**Tech Stack:** Fastify (API), Zustand (frontend state), Python (install script), Docker (image restructuring), SSE (progress streaming)
**Spec:** `docs/superpowers/specs/2026-04-17-on-demand-ai-features-design.md`
---
## File Map
```
NEW FILES:
packages/shared/src/features.ts # Bundle definitions, tool-to-bundle map, types
docker/feature-manifest.json # Authoritative manifest baked into image
apps/api/src/lib/feature-status.ts # Reads manifest + installed.json, provides status
apps/api/src/routes/features.ts # GET /features, POST install/uninstall, GET disk-usage
packages/ai/python/install_feature.py # Python install script (pip + model downloads)
apps/web/src/stores/features-store.ts # Zustand store for bundle statuses
apps/web/src/components/features/feature-install-prompt.tsx # Install prompt card for tool pages
apps/web/src/components/settings/ai-features-section.tsx # Settings panel section
tests/unit/features.test.ts # Unit tests for feature logic
MODIFIED FILES:
packages/ai/src/bridge.ts # restartDispatcher(), FEATURE_NOT_INSTALLED handling
packages/ai/src/index.ts # Export restartDispatcher
packages/ai/python/dispatcher.py # Read installed.json, gate scripts by feature
packages/ai/python/colorize.py # Hard imports to lazy imports
packages/ai/python/restore.py # Hard imports to lazy imports
apps/api/src/index.ts # Register feature routes, startup venv check
apps/api/src/routes/tool-factory.ts # Feature-installed guard before process()
apps/api/src/routes/batch.ts # Feature-installed check at gating point
apps/api/src/routes/pipeline.ts # Feature-installed check in pre-validation
apps/api/src/routes/tools/restore-photo.ts # Feature-installed guard
apps/web/src/lib/api.ts # Extend parseApiError for FEATURE_NOT_INSTALLED
apps/web/src/components/common/tool-card.tsx # Download badge on uninstalled AI tools
apps/web/src/pages/tool-page.tsx # Feature check then install prompt or "not enabled"
apps/web/src/components/layout/tool-panel.tsx # Fetch features on mount
apps/web/src/pages/fullscreen-grid-page.tsx # Fetch features on mount
apps/web/src/components/settings/settings-dialog.tsx # Add AI Features nav item + section
docker/Dockerfile # Remove ML packages/models, keep base
docker/entrypoint.sh # Venv bootstrap, /data/ai/ setup
```
---
### Task 1: Shared Feature Types and Bundle Definitions
**Files:**
- Create: `packages/shared/src/features.ts`
- Modify: `packages/shared/src/index.ts`
- Test: `tests/unit/features.test.ts`
- [ ] **Step 1: Write the failing test for bundle definitions**
Create `tests/unit/features.test.ts`:
```ts
import { describe, expect, it } from "vitest";
import {
FEATURE_BUNDLES,
getBundleForTool,
getToolsForBundle,
TOOL_BUNDLE_MAP,
} from "@ashim/shared/features";
import { PYTHON_SIDECAR_TOOLS } from "@ashim/shared";
describe("Feature bundles", () => {
it("every PYTHON_SIDECAR_TOOL maps to exactly one bundle", () => {
for (const toolId of PYTHON_SIDECAR_TOOLS) {
const bundle = getBundleForTool(toolId);
expect(bundle, `${toolId} has no bundle`).toBeDefined();
}
});
it("getBundleForTool returns null for non-AI tools", () => {
expect(getBundleForTool("resize")).toBeNull();
expect(getBundleForTool("crop")).toBeNull();
});
it("getToolsForBundle returns correct tools", () => {
const tools = getToolsForBundle("background-removal");
expect(tools).toContain("remove-background");
expect(tools).toContain("passport-photo");
expect(tools).not.toContain("upscale");
});
it("all 6 bundles are defined", () => {
expect(Object.keys(FEATURE_BUNDLES)).toHaveLength(6);
expect(FEATURE_BUNDLES["background-removal"]).toBeDefined();
expect(FEATURE_BUNDLES["face-detection"]).toBeDefined();
expect(FEATURE_BUNDLES["object-eraser-colorize"]).toBeDefined();
expect(FEATURE_BUNDLES["upscale-enhance"]).toBeDefined();
expect(FEATURE_BUNDLES["photo-restoration"]).toBeDefined();
expect(FEATURE_BUNDLES["ocr"]).toBeDefined();
});
it("TOOL_BUNDLE_MAP covers all sidecar tools", () => {
const mappedTools = Object.keys(TOOL_BUNDLE_MAP);
for (const toolId of PYTHON_SIDECAR_TOOLS) {
expect(mappedTools, `${toolId} missing from TOOL_BUNDLE_MAP`).toContain(toolId);
}
});
});
```
- [ ] **Step 2: Run test to verify it fails**
Run: `pnpm test:unit -- tests/unit/features.test.ts`
Expected: FAIL with module not found error.
- [ ] **Step 3: Create the feature definitions module**
Create `packages/shared/src/features.ts`:
```ts
export interface FeatureBundleInfo {
id: string;
name: string;
description: string;
estimatedSize: string;
enablesTools: string[];
}
export type FeatureStatus = "not_installed" | "installing" | "installed" | "error";
export interface FeatureBundleState {
id: string;
name: string;
description: string;
status: FeatureStatus;
installedVersion: string | null;
estimatedSize: string;
enablesTools: string[];
progress: { percent: number; stage: string } | null;
error: string | null;
}
export const FEATURE_BUNDLES: Record<string, FeatureBundleInfo> = {
"background-removal": {
id: "background-removal",
name: "Background Removal",
description: "Remove image backgrounds with AI",
estimatedSize: "700 MB - 1 GB",
enablesTools: ["remove-background", "passport-photo"],
},
"face-detection": {
id: "face-detection",
name: "Face Detection",
description: "Detect and blur faces, fix red-eye, smart crop",
estimatedSize: "200-300 MB",
enablesTools: ["blur-faces", "red-eye-removal", "smart-crop"],
},
"object-eraser-colorize": {
id: "object-eraser-colorize",
name: "Object Eraser & Colorize",
description: "Erase objects from photos and colorize B&W images",
estimatedSize: "600-800 MB",
enablesTools: ["erase-object", "colorize"],
},
"upscale-enhance": {
id: "upscale-enhance",
name: "Upscale & Enhance",
description: "AI upscaling, face enhancement, and noise removal",
estimatedSize: "4-5 GB",
enablesTools: ["upscale", "enhance-faces", "noise-removal"],
},
"photo-restoration": {
id: "photo-restoration",
name: "Photo Restoration",
description: "Restore old or damaged photos",
estimatedSize: "800 MB - 1 GB",
enablesTools: ["restore-photo"],
},
ocr: {
id: "ocr",
name: "OCR",
description: "Extract text from images",
estimatedSize: "3-4 GB",
enablesTools: ["ocr"],
},
};
export const TOOL_BUNDLE_MAP: Record<string, string> = {};
for (const [bundleId, bundle] of Object.entries(FEATURE_BUNDLES)) {
for (const toolId of bundle.enablesTools) {
TOOL_BUNDLE_MAP[toolId] = bundleId;
}
}
export function getBundleForTool(toolId: string): FeatureBundleInfo | null {
const bundleId = TOOL_BUNDLE_MAP[toolId];
return bundleId ? FEATURE_BUNDLES[bundleId] : null;
}
export function getToolsForBundle(bundleId: string): string[] {
return FEATURE_BUNDLES[bundleId]?.enablesTools ?? [];
}
```
- [ ] **Step 4: Export from shared package**
Add to the end of `packages/shared/src/index.ts`:
```ts
export * from "./features.js";
```
- [ ] **Step 5: Run test to verify it passes**
Run: `pnpm test:unit -- tests/unit/features.test.ts`
Expected: PASS, all 5 tests green.
- [ ] **Step 6: Commit**
```bash
git add packages/shared/src/features.ts packages/shared/src/index.ts tests/unit/features.test.ts
git commit -m "feat: add shared feature bundle definitions and tool-to-bundle mapping"
```
---
### Task 2: Feature Manifest File
**Files:**
- Create: `docker/feature-manifest.json`
- [ ] **Step 1: Create the feature manifest**
Create `docker/feature-manifest.json` containing the full bundle definitions with exact package versions, pip flags, platform-specific packages, and model download URLs. Source exact versions from the current Dockerfile (lines 167-206) and model URLs from `docker/download_models.py`.
Key details: amd64 uses `--extra-index-url https://download.pytorch.org/whl/cu126` for torch/realesrgan; amd64 uses `paddlepaddle-gpu>=3.2.1` from `https://www.paddlepaddle.org.cn/packages/stable/cu126/`; arm64 uses `mediapipe==0.10.18`; `codeformer-pip==0.0.4` needs `--no-deps`; `postInstall` re-pins `numpy==1.26.4`.
The file should contain a top-level `manifestVersion`, `imageVersion`, `pythonVersion`, `basePackages` array, and `bundles` object with all 6 bundles. Each bundle has `name`, `description`, `estimatedSize`, `packages` (with `common`/`amd64`/`arm64` arrays), `pipFlags`, `postInstall`, `models` array, and `enablesTools` array.
Model entries use either: `{ "id", "url", "path", "minSize" }` for direct downloads, `{ "id", "downloadFn": "rembg_session", "args": [...] }` for rembg models, or `{ "id", "downloadFn": "hf_snapshot", "args": [repo_id, local_subpath] }` for HuggingFace snapshots.
- [ ] **Step 2: Commit**
```bash
git add docker/feature-manifest.json
git commit -m "feat: add feature manifest with all 6 bundle definitions"
```
---
### Task 3: Backend Feature Status Service
**Files:**
- Create: `apps/api/src/lib/feature-status.ts`
- [ ] **Step 1: Create the feature status service**
Create `apps/api/src/lib/feature-status.ts`. This module reads/writes `/data/ai/installed.json`, provides `isFeatureInstalled(bundleId)`, `isToolInstalled(toolId)`, `getFeatureStates()`, `markInstalled()`, `markUninstalled()`, `setInstallProgress()`, and `ensureAiDirs()`.
Uses `FEATURE_BUNDLES` and `TOOL_BUNDLE_MAP` from `@ashim/shared`. Caches `installed.json` in memory with `invalidateCache()` for refresh after install/uninstall. Detects Docker environment via `existsSync("/.dockerenv")`.
See spec section "Persistent Storage" for directory structure: `/data/ai/venv/`, `/data/ai/models/`, `/data/ai/pip-cache/`, `/data/ai/installed.json`.
**Robustness requirements for this module:**
- **Atomic JSON writes:** `markInstalled()` and `markUninstalled()` must write to `installed.json.tmp` first, then `renameSync()` to `installed.json`. Never write directly to `installed.json`.
- **Corrupt JSON recovery:** `readInstalled()` wraps `JSON.parse` in try/catch. If the file is corrupt, treat as empty `{ bundles: {} }` and log a warning.
- **File-based install lock:** Instead of just in-memory `installInProgress`, use `/data/ai/install.lock` file containing `{ bundleId, startedAt, pid }`. Create lock before install, delete on completion/failure. `getInstallingBundle()` reads from the lock file, not memory.
- **`recoverInterruptedInstalls()`** function called on startup:
1. Delete any `*.downloading` files in `/data/ai/models/` (recursive glob)
2. Delete `installed.json.tmp` if it exists
3. Delete `/data/ai/venv.bootstrapping/` if it exists
4. If `install.lock` exists: check if PID is alive (via `process.kill(pid, 0)` in try/catch). If dead, delete the lock and log a warning. If alive, leave it (install is still running from a previous container lifecycle — unlikely but possible with shared volumes).
5. For each bundle in `installed.json`, verify model files exist and meet `minSize` from the feature manifest. If any model is missing/undersized, set the bundle's error field to "Some model files are missing. Reinstall this feature." but do NOT remove from installed.json.
- **`acquireInstallLock(bundleId)`** and **`releaseInstallLock()`** functions that create/delete the lock file atomically.
- [ ] **Step 2: Commit**
```bash
git add apps/api/src/lib/feature-status.ts
git commit -m "feat: add backend feature status service for tracking installed bundles"
```
---
### Task 4: Feature API Routes
**Files:**
- Create: `apps/api/src/routes/features.ts`
- Modify: `apps/api/src/index.ts`
- [ ] **Step 1: Create the features route file**
Create `apps/api/src/routes/features.ts` with 4 endpoints:
1. `GET /api/v1/features` (any authenticated user) — returns `{ bundles: FeatureBundleState[] }`. In non-Docker environments, returns all features as installed.
2. `POST /api/v1/admin/features/:bundleId/install` (admin only) — validates bundle exists, checks not already installed, checks no other install in progress (409). Spawns `install_feature.py` as child process via `spawn()`. Parses stderr JSON progress lines, updates progress via `updateSingleFileProgress()` from `progress.ts`. On success, calls `invalidateCache()` and `shutdownDispatcher()` (from `@ashim/ai`). Returns `{ jobId }`.
3. `POST /api/v1/admin/features/:bundleId/uninstall` (admin only) — removes model files listed in the manifest, calls `markUninstalled()`, calls `shutdownDispatcher()`. Returns `{ ok: true }`.
4. `GET /api/v1/admin/features/disk-usage` (admin only) — returns `{ totalBytes }` by recursively sizing `/data/ai/`.
Note: Use `spawn()` from `node:child_process` (not `exec()`) for the install script to avoid shell injection. Pass arguments as array elements.
**Robustness requirements for install endpoint:**
- Call `acquireInstallLock(bundleId)` before spawning the child process. If lock acquisition fails (lock file already exists with a live PID), return 409.
- Check available disk space before starting: `const { availableParallelism } = require("node:os"); const stats = statfsSync("/data"); const freeBytes = stats.bfree * stats.bsize;`. Compare against a rough estimate for the bundle. If insufficient, return 400 with disk space info.
- On child process `close` event with code 0: call `releaseInstallLock()`, `invalidateCache()`, `shutdownDispatcher()`.
- On child process `close` event with non-zero code: call `releaseInstallLock()`, set error state. Do NOT leave the lock file behind.
- On child process `error` event (spawn failure): call `releaseInstallLock()`, return error.
- The install endpoint returns `{ jobId }` immediately. The child process runs asynchronously. The HTTP response does not block on completion.
- [ ] **Step 2: Register feature routes in index.ts**
In `apps/api/src/index.ts`: import `registerFeatureRoutes`, call it after the settings routes registration. Also import and call `ensureAiDirs()` and `recoverInterruptedInstalls()` near the top of the startup sequence after `runMigrations()`.
- [ ] **Step 3: Commit**
```bash
git add apps/api/src/routes/features.ts apps/api/src/index.ts
git commit -m "feat: add feature install/uninstall API routes with SSE progress"
```
---
### Task 5: Python Install Script
**Files:**
- Create: `packages/ai/python/install_feature.py`
- [ ] **Step 1: Create the install script**
Create `packages/ai/python/install_feature.py`. Takes 3 CLI args: `bundleId`, `manifestPath`, `modelsDir`. Reads manifest JSON, detects architecture via `platform.machine()`, runs pip install for each package using `subprocess.run([sys.executable, "-m", "pip", "install", ...])`, downloads models with retry logic (exponential backoff, 3 retries, file size assertions).
Progress reported via stderr JSON lines: `{"progress": N, "stage": "..."}`. Result written to stdout JSON: `{"success": true, "bundleId": "...", "version": "...", "models": [...]}`.
Port the retry pattern from `docker/download_models.py` `_urlretrieve()` (lines 18-35). Handle rembg models via `rembg.new_session()` and HuggingFace models via `huggingface_hub.snapshot_download()`. Must be idempotent.
Writes to `/data/ai/installed.json` on success (matching the structure read by `feature-status.ts`).
**Robustness requirements for the install script:**
- **Atomic model downloads:** For each URL-based model:
1. Check if final path already exists and meets `minSize` — skip if so (idempotent)
2. Delete any existing `<path>.downloading` file (orphan from a previous failed attempt)
3. Download to `<path>.downloading`
4. Verify file size against `minSize`. If too small, delete and raise error.
5. `os.rename(<path>.downloading, <path>)` — atomic on same filesystem
6. Never leave a `.downloading` file behind on success
- **Atomic JSON writes:** When writing `installed.json`:
1. Write to `installed.json.tmp`
2. `os.rename()` to `installed.json`
- **Disk space pre-check:** Before starting, check available disk space via `shutil.disk_usage()`. If free space is less than estimated bundle size, exit with a clear error message.
- **pip failure recovery:** If `pip install` fails for one package, emit the error and exit. The packages that were already installed remain (pip is idempotent — re-running skips them). The admin can retry.
- **Model failure isolation:** If one model fails to download after retries, continue downloading other models. At the end, report which models failed. Exit with non-zero code so the bundle is NOT marked as installed. On retry, only the failed models need downloading (others pass the exists+size check).
- [ ] **Step 2: Commit**
```bash
git add packages/ai/python/install_feature.py
git commit -m "feat: add Python install script for feature bundles"
```
---
### Task 6: Tool Route Guards
**Files:**
- Modify: `apps/api/src/routes/tool-factory.ts`
- Modify: `apps/api/src/routes/batch.ts`
- Modify: `apps/api/src/routes/pipeline.ts`
- Modify: `apps/api/src/routes/tools/restore-photo.ts`
- [ ] **Step 1: Add feature guard to tool-factory.ts**
Import `isToolInstalled` from `../lib/feature-status.js` and `TOOL_BUNDLE_MAP`, `getBundleForTool` from `@ashim/shared`. Inside `createToolRoute`, after settings validation and before `config.process()`, add:
```ts
const bundleId = TOOL_BUNDLE_MAP[config.toolId];
if (bundleId && !isToolInstalled(config.toolId)) {
const bundle = getBundleForTool(config.toolId);
return reply.status(501).send({
error: "Feature not installed",
code: "FEATURE_NOT_INSTALLED",
feature: bundleId,
featureName: bundle?.name ?? bundleId,
estimatedSize: bundle?.estimatedSize ?? "unknown",
});
}
```
- [ ] **Step 2: Add feature guard to batch.ts**
Same imports. After `getToolConfig(toolId)` returns (around line 35-37), add the same guard returning 501 with `FEATURE_NOT_INSTALLED` code.
- [ ] **Step 3: Add feature guard to pipeline.ts**
Same imports. In both pre-validation loops (execute at lines 143-172, batch at lines 441-462), after successful `getToolConfig(resolvedToolId)`, add the guard. Return 501 with step number in the error message.
- [ ] **Step 4: Add feature guard to restore-photo.ts**
This tool uses its own route handler, not the factory. Import `isToolInstalled` and add the guard before `restorePhoto()` is called.
- [ ] **Step 5: Commit**
```bash
git add apps/api/src/routes/tool-factory.ts apps/api/src/routes/batch.ts apps/api/src/routes/pipeline.ts apps/api/src/routes/tools/restore-photo.ts
git commit -m "feat: add feature-installed guards to tool routes, batch, and pipeline"
```
---
### Task 7: Bridge and Python Sidecar Changes
**Files:**
- Modify: `packages/ai/python/dispatcher.py`
- Modify: `packages/ai/python/colorize.py`
- Modify: `packages/ai/python/restore.py`
- [ ] **Step 1: Add feature gating to dispatcher.py**
Add a `TOOL_BUNDLE_MAP` dict mapping Python script names (without `.py`) to bundle IDs: `remove_bg` -> `background-removal`, `detect_faces` -> `face-detection`, `face_landmarks` -> `face-detection`, `red_eye_removal` -> `face-detection`, `inpaint` -> `object-eraser-colorize`, `colorize` -> `object-eraser-colorize`, `upscale` -> `upscale-enhance`, `enhance_faces` -> `upscale-enhance`, `noise_removal` -> `upscale-enhance`, `restore` -> `photo-restoration`, `ocr` -> `ocr`.
Add `_get_installed_bundles()` that reads `/data/ai/installed.json` and returns a set of installed bundle IDs.
In `_run_script_main()`, before the `exec()` call, check if the script's bundle is installed. If not, return a JSON error: `{"success": false, "error": "feature_not_installed", "feature": bundle_id, "message": "..."}`.
Also set `U2NET_HOME` to `/data/ai/models/rembg` on startup if `/data/ai/models` exists.
- [ ] **Step 2: Convert hard imports in colorize.py**
Move module-level `import numpy as np`, `import cv2`, `from PIL import Image` (lines 10-12) inside each function that uses them (`colorize_ddcolor`, `colorize_opencv`, `main`).
- [ ] **Step 3: Convert hard imports in restore.py**
Move module-level `import numpy as np`, `import cv2`, `from PIL import Image` (lines 13-15) inside each function that uses them.
- [ ] **Step 4: Commit**
```bash
git add packages/ai/python/dispatcher.py packages/ai/python/colorize.py packages/ai/python/restore.py
git commit -m "feat: add feature gating to Python dispatcher, convert hard imports to lazy"
```
---
### Task 8: Frontend Features Store and API Error Extension
**Files:**
- Create: `apps/web/src/stores/features-store.ts`
- Modify: `apps/web/src/lib/api.ts`
- Modify: `apps/web/src/hooks/use-tool-processor.ts`
- Modify: `apps/web/src/hooks/use-pipeline-processor.ts`
- [ ] **Step 1: Create the features store**
Create `apps/web/src/stores/features-store.ts` following the `settings-store.ts` pattern. Zustand store with `bundles: FeatureBundleState[]`, `loaded: boolean`, `fetch()` (one-shot), `refresh()` (force re-fetch), `isToolInstalled(toolId)`, `getBundleForTool(toolId)`. Fetches from `GET /api/v1/features`.
- [ ] **Step 2: Extend parseApiError for FEATURE_NOT_INSTALLED**
In `apps/web/src/lib/api.ts`, add a `FeatureNotInstalledError` interface export: `{ type: "feature_not_installed"; feature: string; featureName: string; estimatedSize: string }`.
Modify `parseApiError` return type to `string | FeatureNotInstalledError`. Add early return when `body.code === "FEATURE_NOT_INSTALLED"`.
- [ ] **Step 3: Update use-tool-processor.ts and use-pipeline-processor.ts**
In both hooks, where `parseApiError` is called and passed to `setError()`, add a type check:
```ts
const parsed = parseApiError(body, xhr.status);
if (typeof parsed === "object" && parsed.type === "feature_not_installed") {
setError(`Feature "${parsed.featureName}" is not installed. Enable it in Settings.`);
} else {
setError(parsed);
}
```
- [ ] **Step 4: Commit**
```bash
git add apps/web/src/stores/features-store.ts apps/web/src/lib/api.ts apps/web/src/hooks/use-tool-processor.ts apps/web/src/hooks/use-pipeline-processor.ts
git commit -m "feat: add frontend features store and FEATURE_NOT_INSTALLED error handling"
```
---
### Task 9: Frontend Tool Grid Badge
**Files:**
- Modify: `apps/web/src/components/common/tool-card.tsx`
- Modify: `apps/web/src/components/layout/tool-panel.tsx`
- Modify: `apps/web/src/pages/fullscreen-grid-page.tsx`
- [ ] **Step 1: Add download badge to ToolCard**
Import `useFeaturesStore`, `PYTHON_SIDECAR_TOOLS`, and `Download` icon from lucide-react. Compute `showDownloadBadge` when the tool is an AI tool and not installed. Render a `<Download className="h-3.5 w-3.5 text-muted-foreground" />` icon after the experimental badge.
- [ ] **Step 2: Fetch features on app load**
In `tool-panel.tsx`, add `useFeaturesStore().fetch()` in a useEffect alongside the existing settings fetch. Do the same in `fullscreen-grid-page.tsx`.
- [ ] **Step 3: Commit**
```bash
git add apps/web/src/components/common/tool-card.tsx apps/web/src/components/layout/tool-panel.tsx apps/web/src/pages/fullscreen-grid-page.tsx
git commit -m "feat: add download badge to uninstalled AI tools in tool grid"
```
---
### Task 10: Frontend Tool Page Install Prompt
**Files:**
- Create: `apps/web/src/components/features/feature-install-prompt.tsx`
- Modify: `apps/web/src/pages/tool-page.tsx`
- [ ] **Step 1: Create the FeatureInstallPrompt component**
Props: `{ bundle: FeatureBundleState; isAdmin: boolean }`.
For non-admins: show centered Download icon + "Feature Not Enabled" heading + "Ask your administrator" text.
For admins: show Download icon + bundle name/description + "requires additional download (~{estimatedSize})" + [Enable Feature] button. On click: POST to install endpoint, open EventSource for SSE progress, show progress bar with stage text and percent. On completion: call `useFeaturesStore().refresh()` to trigger re-render. On error: show error message with retry option.
Use same Tailwind patterns as existing components: `bg-primary text-primary-foreground` for buttons, `Loader2 animate-spin` for loading, `text-destructive` for errors.
**Robustness requirements for the frontend:**
- **Double-click prevention:** Set `installing = true` immediately on first click (before the API call). The button must be `disabled={installing || bundle.status === "installing"}`. This prevents any re-click.
- **Browser close / navigate away:** The server-side install continues regardless. On component mount, check `bundle.status` from the features store. If it's `"installing"`, immediately show the progress bar and open EventSource for the in-progress job (fetch `jobId` from the features endpoint or use the bundle's progress data).
- **SSE connection loss fallback:** If EventSource fires `onerror`, close it and fall back to polling `GET /api/v1/features` every 3 seconds via `setInterval`. When status changes from `"installing"` to `"installed"` or `"error"`, stop polling and update UI.
- **Page refresh during install:** The features store's `fetch()` returns current status. If a bundle is `"installing"`, the component renders progress state immediately — no need for the user to click anything.
- **Multiple admin sessions:** All sessions see the same `"installing"` status from the shared `GET /api/v1/features` endpoint. The server's install lock prevents concurrent installs. Any session trying to install gets a 409.
- **Retry after error:** Show a "Retry" button when status is `"error"`. On retry, call the install endpoint again (the lock is released on failure, so this works). pip cache means previously-downloaded wheels aren't re-downloaded. Idempotent model downloads skip already-complete files.
- [ ] **Step 2: Integrate into ToolPage**
In `tool-page.tsx`: import `useFeaturesStore`, `PYTHON_SIDECAR_TOOLS`, `useAuth`, and `FeatureInstallPrompt`. After the tool/registryEntry lookup, compute `isAiTool`, `toolInstalled`, `featureBundle`, `isAdmin`. After the "Tool not found" guard, add a guard that renders `<FeatureInstallPrompt>` wrapped in `<AppLayout>` when the tool is AI and not installed.
- [ ] **Step 3: Commit**
```bash
git add apps/web/src/components/features/feature-install-prompt.tsx apps/web/src/pages/tool-page.tsx
git commit -m "feat: add feature install prompt on uninstalled AI tool pages"
```
---
### Task 11: Settings AI Features Section
**Files:**
- Create: `apps/web/src/components/settings/ai-features-section.tsx`
- Modify: `apps/web/src/components/settings/settings-dialog.tsx`
- [ ] **Step 1: Create AiFeaturesSection component**
Follow the card-based layout of existing sections in `settings-dialog.tsx`. Use `useFeaturesStore()`. Render each bundle as a bordered card (`rounded-lg border border-border`) with: name, description, status indicator (green dot = installed, gray = not installed, spinning = installing), estimated size, Install/Uninstall button. Add "Install All" button at top. Show total disk usage at bottom (fetch from `GET /api/v1/admin/features/disk-usage`). Reuse the toggle/button patterns from `ToolsSection`.
- [ ] **Step 2: Add section to settings-dialog.tsx**
Add `"ai-features"` to the `Section` type union. Add to `NAV_ITEMS` between `"api-keys"` and `"tools"`: `{ id: "ai-features", label: "AI Features", icon: Sparkles, requiredPermission: "settings:write" }`. Import `Sparkles` from lucide-react. Add `{section === "ai-features" && <AiFeaturesSection />}` to the conditional render block. Lazy-import `AiFeaturesSection` from `"./ai-features-section"`.
- [ ] **Step 3: Commit**
```bash
git add apps/web/src/components/settings/ai-features-section.tsx apps/web/src/components/settings/settings-dialog.tsx
git commit -m "feat: add AI Features settings panel for managing feature bundles"
```
---
### Task 12: Dockerfile Restructuring
**Files:**
- Modify: `docker/Dockerfile`
- Modify: `docker/entrypoint.sh`
- [ ] **Step 1: Modify the Dockerfile**
In `docker/Dockerfile` production stage:
1. **Keep**: base image selection, Node.js install, pnpm setup, system packages, Python venv creation with base packages (numpy, Pillow, opencv)
2. **Remove**: all ML pip install commands (lines 175-206: onnxruntime, rembg, realesrgan, paddlepaddle, mediapipe, codeformer)
3. **Remove**: download_models.py COPY and RUN (lines 219-231)
4. **Remove**: the `apt-get purge build-essential python3-dev` line (line 251) so build-essential stays for runtime pip installs
5. **Add**: `COPY docker/feature-manifest.json /app/docker/feature-manifest.json`
6. **Add**: `COPY packages/ai/python/install_feature.py /app/packages/ai/python/install_feature.py`
7. **Update** env vars: `PYTHON_VENV_PATH=/data/ai/venv`, add `MODELS_PATH=/data/ai/models`, add `DATA_DIR=/data`
- [ ] **Step 2: Update entrypoint.sh for venv bootstrap**
Add venv bootstrap after auth defaults and before volume permission fix. Use atomic directory rename to prevent corrupt venv from partial copy:
```sh
AI_VENV="/data/ai/venv"
AI_VENV_TMP="/data/ai/venv.bootstrapping"
# Clean up any interrupted bootstrap from a previous start
if [ -d "$AI_VENV_TMP" ]; then
echo "Cleaning up interrupted venv bootstrap..."
rm -rf "$AI_VENV_TMP"
fi
# Bootstrap AI venv from base image on first run
if [ ! -d "$AI_VENV" ] && [ -d "/opt/venv" ]; then
echo "Bootstrapping AI venv from base image..."
mkdir -p /data/ai/models /data/ai/pip-cache
cp -r /opt/venv "$AI_VENV_TMP"
mv "$AI_VENV_TMP" "$AI_VENV"
echo "AI venv ready at $AI_VENV"
fi
```
The `cp -r` + `mv` pattern ensures `/data/ai/venv` is either fully present or absent — never half-copied. If the container is killed during `cp -r`, the `.bootstrapping` directory is cleaned up on next start.
- [ ] **Step 3: Build and verify**
```bash
docker build -f docker/Dockerfile -t ashim:dev .
docker images ashim:dev --format "{{.Size}}"
```
Expected: Image size ~5-6 GB (amd64) instead of ~30 GB.
- [ ] **Step 4: Commit**
```bash
git add docker/Dockerfile docker/entrypoint.sh
git commit -m "feat: restructure Dockerfile to remove ML packages and models
Base image now includes only Node.js + Sharp + Python with base deps.
AI features are downloaded on-demand via the feature install system.
Image reduced from ~30GB to ~5-6GB (amd64) / ~2-3GB (arm64)."
```
---
### Task 13: Integration Testing
**Files:**
- Create: `tests/e2e-docker/features.spec.ts`
- [ ] **Step 1: Create Docker e2e tests for feature system**
Create `tests/e2e-docker/features.spec.ts` using the existing `playwright.docker.config.ts` infrastructure:
```ts
import { expect, test } from "@playwright/test";
test.describe("On-demand AI features", () => {
test("GET /api/v1/features returns all 6 bundles", async ({ request }) => {
const response = await request.get("/api/v1/features");
expect(response.ok()).toBeTruthy();
const data = await response.json();
expect(data.bundles).toHaveLength(6);
for (const bundle of data.bundles) {
expect(bundle).toHaveProperty("id");
expect(bundle).toHaveProperty("name");
expect(bundle).toHaveProperty("status");
expect(bundle).toHaveProperty("enablesTools");
}
});
test("AI tool returns 501 FEATURE_NOT_INSTALLED when bundle not installed", async ({ request }) => {
const pngBuffer = Buffer.from(
"iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==",
"base64",
);
const response = await request.post("/api/v1/tools/remove-background", {
multipart: {
file: { name: "test.png", mimeType: "image/png", buffer: pngBuffer },
settings: JSON.stringify({}),
},
});
expect(response.status()).toBe(501);
const body = await response.json();
expect(body.code).toBe("FEATURE_NOT_INSTALLED");
expect(body.feature).toBe("background-removal");
});
test("uninstalled AI tool page shows install prompt for admin", async ({ page }) => {
await page.goto("/remove-background");
await expect(page.getByText("Enable")).toBeVisible({ timeout: 10000 });
await expect(page.getByText("additional download")).toBeVisible();
});
});
```
- [ ] **Step 2: Commit**
```bash
git add tests/e2e-docker/features.spec.ts
git commit -m "test: add e2e tests for on-demand AI feature system"
```
---
### Task Summary
| Task | Description | Key Files |
|------|------------|-----------|
| 1 | Shared types and bundle definitions | `packages/shared/src/features.ts` |
| 2 | Feature manifest JSON | `docker/feature-manifest.json` |
| 3 | Backend feature status service | `apps/api/src/lib/feature-status.ts` |
| 4 | Feature API routes | `apps/api/src/routes/features.ts` |
| 5 | Python install script | `packages/ai/python/install_feature.py` |
| 6 | Tool route guards | `tool-factory.ts`, `batch.ts`, `pipeline.ts` |
| 7 | Bridge + Python sidecar changes | `dispatcher.py`, `colorize.py`, `restore.py` |
| 8 | Frontend features store + error handling | `features-store.ts`, `api.ts` |
| 9 | Frontend tool grid badge | `tool-card.tsx`, `tool-panel.tsx` |
| 10 | Frontend tool page install prompt | `feature-install-prompt.tsx`, `tool-page.tsx` |
| 11 | Settings AI Features section | `ai-features-section.tsx`, `settings-dialog.tsx` |
| 12 | Dockerfile restructuring | `Dockerfile`, `entrypoint.sh` |
| 13 | Integration testing | `tests/e2e-docker/features.spec.ts` |

View file

@ -0,0 +1,524 @@
# On-Demand AI Feature Downloads
**Date:** 2026-04-17
**Status:** Approved
**Goal:** Reduce Docker image from ~30 GB to ~5-6 GB (amd64) / ~2-3 GB (arm64) by making AI features downloadable post-install.
## Problem
The Docker image bundles all Python ML packages (~8-10 GB) and model weights (~5-8 GB) regardless of whether users need AI features. Users who only want basic image tools (resize, crop, convert) must pull ~30 GB.
## Design Decisions
- **Single Docker image** — no lite/full variants
- **Individual feature bundles** — users cherry-pick by feature name, not model name
- **Admin-only downloads** — only admins can enable/disable AI features
- **AI tools visible with badge** — uninstalled tools appear in grid with a download indicator
- **Both tool-page and settings UI** — admins can download from the tool page or from a central management panel in settings
## Architecture
### Base Image Contents
The base image includes everything needed for non-AI tools plus the prerequisites for AI feature installation:
| Component | Rationale |
|-----------|-----------|
| Node.js 22 + pnpm + app source + frontend dist | Core application |
| Sharp, imagemagick, tesseract-ocr, potrace, libheif, exiftool | Non-AI image processing |
| caire binary | Content-aware resize |
| Python 3 + pip + build-essential | Required for pip install at runtime |
| numpy==1.26.4, Pillow, opencv-python-headless | Shared by all AI features, small (~300 MB) |
| CUDA runtime (amd64 only, from nvidia/cuda base) | Required for GPU-accelerated AI |
**Estimated size:** ~5-6 GB (amd64), ~2-3 GB (arm64)
### Feature Bundles
Six user-facing bundles, named by what they enable (not by model names). **Each tool belongs to exactly one bundle — no partial functionality.** When a bundle is installed, all its tools work fully. When it's not installed, those tools are locked entirely.
| Feature Name | Python Packages | Models | Tools Fully Enabled | Est. Size |
|---|---|---|---|---|
| **Background Removal** | rembg, onnxruntime(-gpu), mediapipe | birefnet-general-lite, blaze_face, face_landmarker | remove-background, passport-photo | ~700 MB - 1 GB |
| **Face Detection** | mediapipe | blaze_face, face_landmarker | blur-faces, red-eye-removal, smart-crop | ~200-300 MB |
| **Object Eraser & Colorize** | onnxruntime(-gpu) | LaMa ONNX, DDColor ONNX, OpenCV colorize | erase-object, colorize | ~600-800 MB |
| **Upscale & Enhance** | torch, torchvision, realesrgan, codeformer-pip (--no-deps), gfpgan, basicsr, lpips | RealESRGAN x4plus, GFPGANv1.3, CodeFormer (.pth), facexlib, SCUNet, NAFNet | upscale, enhance-faces, noise-removal | ~4-5 GB |
| **Photo Restoration** | onnxruntime(-gpu), mediapipe | LaMa ONNX, DDColor ONNX, CodeFormer ONNX, blaze_face, face_landmarker, OpenCV colorize | restore-photo | ~800 MB - 1 GB |
| **OCR** | paddlepaddle(-gpu), paddleocr | PP-OCRv5 (7 models), PaddleOCR-VL 1.5 | ocr | ~3-4 GB |
Notes:
- `passport-photo` is in the Background Removal bundle because it primarily needs rembg; mediapipe (for face landmarks) is included in the same bundle so the tool works fully
- `noise-removal` is in the Upscale & Enhance bundle because its quality/maximum tiers need PyTorch; all 4 tiers (including OpenCV-based quick/balanced) are locked until the bundle is installed
- `ocr` is fully locked until the OCR bundle is installed, including the Tesseract-based fast tier — this keeps the UX clean even though Tesseract is pre-installed in the base image
- `restore-photo` is its own bundle because it needs models from multiple domains (inpainting, face enhancement, colorization); all stages work when installed
- Some packages appear in multiple bundles (e.g., mediapipe in Background Removal, Face Detection, and Photo Restoration; onnxruntime in Background Removal, Object Eraser, and Photo Restoration). The install script skips already-installed packages — pip handles this naturally
- Some models appear in multiple bundles (e.g., blaze_face in both Background Removal and Face Detection). The install script skips already-downloaded model files
### Bundle Dependencies
```
Background Removal ───── standalone
Face Detection ────────── standalone
Object Eraser & Colorize ── standalone
Upscale & Enhance ─────── standalone
Photo Restoration ─────── standalone
OCR ───────────────────── standalone
```
All bundles are independently installable. Shared packages (mediapipe, onnxruntime) and shared models (blaze_face, LaMa, etc.) are silently skipped if already present from another bundle.
### Single Venv Strategy
The current architecture uses a single venv at `/opt/venv` (set via `PYTHON_VENV_PATH`). The bridge (`bridge.ts`) constructs `${venvPath}/bin/python3` — it can only point to one interpreter. Having two venvs (base at `/opt/venv`, features at `/data/ai/venv/`) is fragile: C extensions and entry points reference their venv prefix, and `PYTHONPATH` hacks break in practice.
**Solution:** Use a single venv on the persistent volume at `/data/ai/venv/`.
- The Dockerfile creates `/opt/venv` with base packages (numpy, Pillow, opencv) as before
- The entrypoint script bootstraps `/data/ai/venv/` on first run by copying `/opt/venv` into it (fast file copy, ~300 MB)
- `PYTHON_VENV_PATH` is set to `/data/ai/venv/` so the bridge uses it
- Feature installs add packages to this same venv
- On container update, the entrypoint checks if base package versions changed and updates the venv accordingly (pip install from wheel cache)
This gives us one venv with all packages, living on a persistent volume, bootstrapped from the image's base packages.
### Persistent Storage
All AI data lives under `/data/ai/` on the existing Docker volume (no docker-compose changes):
```
/data/ai/
venv/ # Single Python virtual environment (bootstrapped from /opt/venv, extended by feature installs)
models/ # Downloaded model weight files (same structure as /opt/models/)
pip-cache/ # Wheel cache for fast re-installs after updates
installed.json # Tracks installed bundles, versions, timestamps
```
### Feature Manifest
A `feature-manifest.json` file is baked into each Docker image at build time. It is the single source of truth for what each bundle installs:
```json
{
"manifestVersion": 1,
"imageVersion": "1.16.0",
"pythonVersion": "3.12",
"basePackages": ["numpy==1.26.4", "Pillow==11.1.0", "opencv-python-headless==4.10.0.84"],
"bundles": {
"background-removal": {
"name": "Background Removal",
"description": "Remove image backgrounds with AI",
"packages": {
"common": ["rembg==2.0.62"],
"amd64": ["onnxruntime-gpu==1.20.1", "mediapipe==0.10.21"],
"arm64": ["onnxruntime==1.20.1", "rembg[cpu]==2.0.62", "mediapipe==0.10.18"]
},
"pipFlags": {},
"models": [
{
"id": "birefnet-general-lite",
"downloadFn": "rembg_session",
"args": ["birefnet-general-lite"]
},
{
"id": "blaze-face-short-range",
"url": "https://storage.googleapis.com/mediapipe-models/face_detector/blaze_face_short_range/float16/latest/blaze_face_short_range.tflite",
"path": "mediapipe/blaze_face_short_range.tflite",
"minSize": 100000
},
{
"id": "face-landmarker",
"url": "https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/latest/face_landmarker.task",
"path": "mediapipe/face_landmarker.task",
"minSize": 5000000
}
],
"enablesTools": ["remove-background", "passport-photo"]
},
"upscale-enhance": {
"name": "Upscale & Enhance",
"description": "AI upscaling, face enhancement, and noise removal",
"packages": {
"common": ["codeformer-pip==0.0.4", "lpips"],
"amd64": [
"torch torchvision --extra-index-url https://download.pytorch.org/whl/cu126",
"realesrgan==0.3.0 --extra-index-url https://download.pytorch.org/whl/cu126"
],
"arm64": ["torch", "torchvision", "realesrgan==0.3.0"]
},
"pipFlags": {
"codeformer-pip==0.0.4": "--no-deps"
},
"postInstall": ["pip install numpy==1.26.4"],
"models": [
{ "id": "realesrgan-x4plus", "url": "https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth", "path": "realesrgan/RealESRGAN_x4plus.pth", "minSize": 67000000 },
{ "id": "gfpgan-v1.3", "url": "https://github.com/TencentARC/GFPGAN/releases/download/v1.3.0/GFPGANv1.3.pth", "path": "gfpgan/GFPGANv1.3.pth", "minSize": 332000000 },
{ "id": "codeformer-pth", "url": "https://github.com/sczhou/CodeFormer/releases/download/v0.1.0/codeformer.pth", "path": "codeformer/codeformer.pth", "minSize": 375000000 },
{ "id": "codeformer-onnx", "url": "hf://facefusion/models-3.0.0/codeformer.onnx", "path": "codeformer/codeformer.onnx", "minSize": 377000000 },
{ "id": "facexlib-detection", "url": "https://github.com/xinntao/facexlib/releases/download/v0.1.0/detection_Resnet50_Final.pth", "path": "gfpgan/facelib/detection_Resnet50_Final.pth", "minSize": 104000000 },
{ "id": "facexlib-parsing", "url": "https://github.com/xinntao/facexlib/releases/download/v0.2.2/parsing_parsenet.pth", "path": "gfpgan/facelib/parsing_parsenet.pth", "minSize": 85000000 },
{ "id": "scunet", "url": "https://github.com/cszn/KAIR/releases/download/v1.0/scunet_color_real_psnr.pth", "path": "scunet/scunet_color_real_psnr.pth", "minSize": 4000000 },
{ "id": "nafnet", "url": "hf://mikestealth/nafnet-models/NAFNet-SIDD-width64.pth", "path": "nafnet/NAFNet-SIDD-width64.pth", "minSize": 67000000 }
],
"enablesTools": ["upscale", "enhance-faces", "noise-removal"]
}
}
}
```
### Install Script
A Python script (`packages/ai/python/install_feature.py`) handles feature installation:
1. Reads the feature manifest from the image
2. Detects architecture (amd64/arm64) and GPU availability
3. Creates or reuses the venv at `/data/ai/venv/`
4. Runs pip install with the correct packages, flags, and index URLs per platform
5. Handles the numpy version conflict (--no-deps for codeformer, re-pin numpy)
6. Downloads model weights with retry logic (ported from `download_models.py`)
7. Updates `/data/ai/installed.json` with bundle status
8. Reports progress to stdout as JSON lines (consumed by the Node bridge)
The script must be idempotent — running it twice for the same bundle is a no-op.
### Uninstall and Shared Package Strategy
Bundles share Python packages (e.g., onnxruntime in Background Removal, Object Eraser, and Photo Restoration). Naively pip-uninstalling a bundle's packages could break other installed bundles.
**v1 approach (simple):** Uninstall removes model files and updates `installed.json`. Orphaned pip packages stay in the venv — they use disk but don't cause issues. A "Clean up" button in the AI Features settings panel rebuilds the venv from scratch: creates a fresh venv, installs only packages needed by currently-installed bundles, removes the old venv.
**Future improvement:** Reference counting — track which bundles need which packages, only remove packages exclusively owned by the target bundle.
### Tool Route Registration for Uninstalled Features
Currently `registerToolRoutes()` either registers a route or doesn't (disabled tools get 404). For uninstalled AI features, we need routes that return a structured error instead of 404.
**Solution: Register ALL tool routes always, add a pre-processing guard.**
In `tool-factory.ts`, after settings validation (around line 198) and before calling `config.process()`, check feature installation status. The response follows the existing `{ error, details }` shape used by `formatZodErrors` (`apps/api/src/lib/errors.ts`) and consumed by `parseApiError` (`apps/web/src/lib/api.ts`):
```typescript
if (isAiTool(config.toolId) && !isFeatureInstalled(config.toolId)) {
const bundle = getBundleForTool(config.toolId);
return reply.status(501).send({
error: "Feature not installed",
code: "FEATURE_NOT_INSTALLED",
feature: bundle.id,
featureName: bundle.name,
estimatedSize: bundle.estimatedSize,
});
}
```
This also applies to:
- `restore-photo.ts` (uses its own route handler, not the factory)
- `batch.ts` — the `getToolConfig(toolId)` call at line 35 is the gating point. Add a feature-installed check alongside the existing 404 check.
- `pipeline.ts` — the pre-validation loops (lines 143-172 for execute, lines 441-462 for batch) already validate all tool IDs before processing starts. Extend to also check feature installation.
**Frontend error detection:** Extend `parseApiError` in `apps/web/src/lib/api.ts` to detect the `FEATURE_NOT_INSTALLED` code and return structured data (bundle id, name, size) instead of a plain error string. This enables `use-tool-processor.ts` and `use-pipeline-processor.ts` (both already use `parseApiError`) to trigger the install prompt rather than showing a generic error.
The global Fastify error handler in `apps/api/src/index.ts` (lines 41-51) provides a safety net — any unhandled Python import errors will produce structured JSON rather than crashing.
### API Endpoints
New routes — read endpoint is public (no `/admin/` prefix), mutation endpoints are admin-only:
```
GET /api/v1/features
Returns: list of all bundles with install status, sizes, enabled tools
Auth: any authenticated user (read-only, needed by frontend for badges/tool page state)
Response: {
bundles: [{
id: "background-removal",
name: "Background Removal",
description: "Remove image backgrounds with AI",
status: "not_installed" | "installing" | "installed" | "error",
installedVersion: "1.15.3" | null,
estimatedSize: "500-700 MB",
enablesTools: ["remove-background"],
progress: { percent: 45, stage: "Downloading models..." } | null,
error: "pip install failed: ..." | null,
dependencies: [] | ["upscale-enhance"]
}]
}
POST /api/v1/admin/features/:bundleId/install
Starts background installation of a feature bundle.
Auth: admin only
Response: { jobId: "uuid" }
SSE progress at: GET /api/v1/jobs/:jobId/progress
POST /api/v1/admin/features/:bundleId/uninstall
Removes a feature bundle (pip packages + models).
Auth: admin only
Response: { ok: true, freedSpace: "500 MB" }
GET /api/v1/admin/features/disk-usage
Returns total disk usage of /data/ai/.
Auth: admin only
Response: { totalBytes: 5368709120, byBundle: { "background-removal": 734003200, ... } }
```
### Background Job Mechanism
Feature installation runs as a background child process (not inline with the HTTP request):
1. `POST /admin/features/:bundleId/install` spawns the install script as a child process
2. Progress is streamed via stderr JSON lines → captured by the Node process → pushed to SSE listeners
3. The existing SSE infrastructure (`/api/v1/jobs/:jobId/progress`) is reused
4. Job status is persisted to the `jobs` table for recovery on restart
5. Only one install can run at a time (mutex). Concurrent install requests return 409 Conflict.
### Python Sidecar Changes
**dispatcher.py:**
- On startup, read `/data/ai/installed.json` to know which features are available
- Populate `available_modules` based on what's actually installed
- When a script is requested for an uninstalled feature, return a structured error: `{"error": "feature_not_installed", "feature": "background-removal", "message": "Background Removal is not installed"}`
- After a feature is installed, the dispatcher must be restarted (or sent a reload signal) to pick up new packages. The bridge handles this by killing and re-spawning the dispatcher.
**Python scripts:**
- Convert hard module-level imports in `colorize.py` and `restore.py` to lazy imports inside functions
- All scripts should check for their feature's models and return a clear "not installed" error if missing
- The `sys.path` must include `/data/ai/venv/lib/python3.X/site-packages/` (set by the dispatcher on startup based on installed.json)
**Bridge (bridge.ts):**
- Update `PYTHON_VENV_PATH` logic to prefer `/data/ai/venv/` when it exists
- Add a `restartDispatcher()` function called after feature install completes
- Handle the new `feature_not_installed` error type from the dispatcher
### Model Path Resolution
Currently models are at `/opt/models/`. With on-demand downloads, they'll be at `/data/ai/models/`. The resolution order:
1. `/opt/models/<model>` (Docker-baked, for backwards compatibility if someone builds a full image)
2. `/data/ai/models/<model>` (on-demand download location)
3. `~/.cache/ashim/<model>` (local dev fallback)
Environment variables (`U2NET_HOME`, etc.) are updated by the install script to point to `/data/ai/models/`.
### Dockerfile Changes
1. Remove all `pip install` commands for ML packages (lines 175-206)
2. Remove `download_models.py` COPY and RUN (lines 219-231)
3. Keep: Python 3 + pip + build-essential (do NOT purge build-essential)
4. Keep: numpy, Pillow, opencv-python-headless install (lightweight shared deps)
5. Add: COPY `feature-manifest.json` into the image
6. Add: COPY `install_feature.py` into the image
7. Update entrypoint to set up `/data/ai/` directory structure on first run
8. Update env vars: `MODELS_PATH=/data/ai/models` as default, fallback to `/opt/models`
### Frontend: Tool Page (Uninstalled State)
When a user navigates to an AI tool that isn't installed:
**For admins:**
- Show a card replacing the normal upload area:
- Feature icon + name (e.g., "Background Removal")
- "This feature requires an additional download (~500-700 MB)"
- [Enable Feature] button
- After clicking: progress bar with stage text, estimated time
- On completion: page automatically transitions to the normal tool UI
**For non-admins:**
- Show: "This feature is not enabled. Ask your administrator to enable it in Settings."
### Frontend: Tool Grid (Badge)
AI tools in the grid show a small download icon overlay when not installed. When installed, the icon disappears and the tool looks like any other tool.
Tools with partial dependencies (e.g., passport-photo needs 2 bundles) show the badge until ALL required bundles are installed.
### Frontend: Settings Panel
New "AI Features" section in the settings dialog (admin only):
- List of all 6 feature bundles as cards
- Each card shows: name, description, status (installed/not installed/installing), disk usage
- Install/Uninstall buttons per bundle
- "Install All" button at the top
- Total AI disk usage summary at the bottom
- Progress bar during installation
- Dependency warnings (e.g., "Advanced Noise Removal requires Upscale & Face Enhance")
### Container Update Flow
When a user does `docker pull` + restart:
1. **Pull:** Only app code layers changed → ~50-100 MB download
2. **Startup:** Backend reads feature manifest from new image + installed.json from volume
3. **Comparison:**
- If bundle package versions unchanged → no action, instant startup
- If a package version bumped → `pip install --upgrade` from wheel cache (seconds)
- If a model URL/version changed → re-download that model only
- If Python major version changed → rebuild venv from cached wheels (rare, ~2-5 min)
4. **Dispatcher restart** if any packages changed
This check runs at startup, not blocking the HTTP server. AI features show "Updating..." status until the check completes.
### Robustness and Crash Recovery
The install system must handle every interruption gracefully: double-clicks, browser closes, container restarts mid-install, network failures, disk-full, corrupt downloads, and power loss.
#### Atomic Operations
**Model downloads** — never write directly to the final path:
1. Download to `<model_path>.downloading`
2. Verify file size against `minSize` from manifest
3. `os.rename()` to final path (atomic on same filesystem)
4. If the process dies mid-download, the `.downloading` file is an obvious orphan
**installed.json writes** — never write in-place:
1. Write to `installed.json.tmp`
2. `os.rename()` to `installed.json`
3. If the process dies mid-write, `installed.json` is intact (either old version or doesn't exist)
**Venv bootstrap** (entrypoint) — same pattern:
1. Copy `/opt/venv` to `/data/ai/venv.bootstrapping/`
2. Rename to `/data/ai/venv/` on completion
3. If interrupted, the `.bootstrapping/` directory is cleaned up on next start
#### File-Based Install Lock
In-memory `installInProgress` state is lost on container restart. Use a persistent lock file instead:
**`/data/ai/install.lock`** contains:
```json
{ "bundleId": "background-removal", "startedAt": "2026-04-17T12:00:00Z", "pid": 12345 }
```
- Created before install starts, deleted on success or acknowledged failure
- On server startup, if lock exists: check if PID is alive. If dead → the install was interrupted mid-flight
- If lock is stale (PID dead), mark the bundle as needing cleanup, delete lock
#### Startup Recovery Sequence
On server startup (in `apps/api/src/index.ts`, after `runMigrations()`), run a recovery check:
1. **Clean orphan temp files:** Delete any `*.downloading` files in `/data/ai/models/` (recursive)
2. **Clean orphan JSON:** If `installed.json.tmp` exists, delete it
3. **Clean orphan venv bootstrap:** If `/data/ai/venv.bootstrapping/` exists, delete it
4. **Check install lock:** If `/data/ai/install.lock` exists:
- Read PID from lock file
- If PID is not running → the install was interrupted
- Delete the lock file
- Log: "Previous installation of {bundleId} was interrupted, cleaned up"
5. **Verify installed bundles:** For each bundle in `installed.json`, check that all model files exist and meet minimum sizes from the manifest. If any model is missing or undersized:
- Mark the bundle status as `"error"` with message "Some model files are missing or corrupt. Please reinstall."
- Do NOT automatically remove from installed.json — let the admin decide to reinstall or uninstall
#### Frontend Button Hardening
**Double-click prevention:**
- Disable the button on first click (set `installing = true` immediately, before the API call)
- The button should be `disabled={installing || bundle.status === "installing"}`
- Even if the component re-renders, the disabled state persists from the store
**Browser close / navigate away:**
- The install runs as a server-side child process — it completes regardless of browser state
- When the user returns to the page, `useFeaturesStore.fetch()` picks up current status
- If an install is in progress, the UI shows the progress bar (driven by polling, not just SSE)
**SSE connection loss fallback:**
- The `FeatureInstallPrompt` component uses `EventSource` for real-time progress
- If the `EventSource` connection drops (`onerror`), fall back to polling `GET /api/v1/features` every 3 seconds
- When the install completes (status changes from "installing" to "installed" or "error"), stop polling
**Page refresh during install:**
- On mount, the features store calls `fetch()` which returns current bundle states including install progress
- If a bundle has `status: "installing"`, the component immediately shows the progress bar and opens an EventSource for the in-progress job
**Multiple admins:**
- Server mutex: only one install at a time (409 Conflict)
- The features store status reflects the global state — ALL admin sessions see "installing"
- The install lock file prevents even a container restart from allowing a concurrent install
#### Error Handling
| Scenario | Behavior |
|---|---|
| Double-click on Enable | Button disabled on first click. Second click is no-op. |
| Browser closed mid-install | Server-side install continues. Status visible on next page load. |
| Container restart mid-install | Startup recovery detects stale lock, cleans up `.downloading` files, marks as error. Admin can retry. |
| Network failure mid-pip-install | pip returns non-zero. Install script emits error. Bundle marked as "error" with pip output. Admin can retry (pip cache means previously-downloaded wheels aren't re-downloaded). |
| Network failure mid-model-download | `.downloading` file left behind. Retry 3 times with exponential backoff. On final failure, bundle marked as "error". On retry, `.downloading` file is deleted and re-downloaded. |
| Disk full | Check available disk space at the START of install (before any pip/download). Return clear error: "Not enough disk space. Need ~{estimatedSize}, only {available} available." If disk fills mid-install, pip/download fails, bundle marked as error. |
| pip succeeds, models fail | Bundle is NOT marked as installed. Status is "error" with message about which models failed. Packages remain in venv (harmless). Admin can retry — pip install is idempotent (skip already-installed), only failed models are re-downloaded. |
| Model file corrupt (downloads completely but data is bad) | Verify file size against `minSize` after download. If too small, delete and retry. For rembg/HuggingFace models, the library's own integrity checks apply. |
| installed.json corrupted | Atomic writes prevent this. If somehow corrupted (manual edit, etc.), `JSON.parse` fails, treat as empty (no bundles installed). Log a warning. |
| Power loss | Atomic operations ensure no file is in a half-written state. Startup recovery cleans up orphans. |
| No internet during install | pip fails immediately with a clear network error. Model downloads fail after retries. Bundle marked as "error". |
### Testing Strategy
All testing runs against Docker containers using the existing `playwright.docker.config.ts` and `tests/e2e-docker/` infrastructure:
- **Unit tests:** Feature manifest parsing, version comparison logic, bundle dependency resolution (Vitest, excluded from e2e via `vitest.config.ts`)
- **Integration tests:** Install/uninstall API endpoints, status reporting, SSE progress (Vitest integration suite)
- **E2e-docker tests:** Add to `tests/e2e-docker/` alongside existing `fixes-verification.spec.ts`:
- Verify uninstalled AI tool returns 501 with `FEATURE_NOT_INSTALLED` code
- Admin enables a feature from settings, tool page transitions from "not installed" to working
- Non-admin sees "not enabled" message on uninstalled tool page
- Feature install/uninstall round-trip
- **Docker build test:** Verify base image builds without ML packages, verify feature-manifest.json is present (CI, `SKIP_MODEL_DOWNLOADS` already exists)
### Migration Path
Since the new image is fundamentally different (no ML packages baked in), existing users upgrading from the full image will need to re-download their AI features. The Python ML packages are no longer in the system venv, so even if old model weights exist at `/opt/models/`, the features won't work without packages.
The first-run experience for upgrading users:
1. Detect this is an upgrade: no `/data/ai/installed.json` exists, but user data exists in `/data`
2. Show a one-time banner in the UI: "We've reduced the image size from 30 GB to 5 GB! AI features are now downloaded on-demand. Visit Settings → AI Features to enable the ones you need."
3. No automatic downloads — let the admin choose what to install
4. Old model weights at `/opt/models/` are ignored (they won't exist in the new image anyway since that layer is removed)
### Frontend: Feature Status Propagation
The frontend needs to know which tools are installed for three purposes: tool grid badges, tool page state, and settings panel.
**Features store** (`apps/web/src/stores/features-store.ts`):
- Zustand store fetched on app load (like `settings-store.ts`)
- Calls `GET /api/v1/features` to get bundle statuses
- Provides a derived mapping: `toolInstallStatus: Record<string, "installed" | "not_installed" | "installing">` (each tool maps to exactly one bundle, no partial states)
- Provides `isToolInstalled(toolId): boolean` and `getBundleForTool(toolId): BundleInfo | null` helpers
- Refreshes on install/uninstall completion
**Tool grid integration:**
- `ToolCard` checks `isToolInstalled(tool.id)` from the features store
- If not installed: show a download icon badge (similar to existing "Experimental" badge)
- The tool remains clickable (not disabled) — clicking navigates to the tool page where the install prompt appears
- `PYTHON_SIDECAR_TOOLS` constant is used to determine which tools are AI tools (only AI tools can be "not installed")
**Tool page integration:**
- `ToolPage` component checks feature status after the tool lookup
- If the user is admin and feature not installed: render `FeatureInstallPrompt` component instead of the normal tool UI
- If the user is non-admin and feature not installed: render "This feature is not enabled. Contact your administrator."
- The install prompt shows feature name, description, estimated size, and an "Enable" button
- After clicking "Enable": show progress bar with SSE-streamed progress, auto-transition to normal tool UI on completion
### Development and Testing
All development and testing is done via Docker containers — the same environment users run. Build the image locally and run it with:
```bash
docker run -d --name ashim -p 1349:1349 -v ashim-data:/data ghcr.io/ashim-hq/ashim:latest
```
Auth can be disabled for development by passing `-e AUTH_ENABLED=false`.
### Scope Boundaries
**In scope:**
- Dockerfile restructuring to remove ML packages and models
- Feature manifest system
- Install/uninstall API + background job
- Python sidecar changes for dynamic feature detection
- Frontend: tool page download prompt, grid badge, settings panel
- Container update handling with version manifest
**Out of scope (future work):**
- Additional rembg model variants as sub-downloads within Background Removal
- Automatic feature recommendations based on usage
- Download from private/custom model registries
- Bandwidth throttling for downloads
- Multiple venv support (e.g., different Python versions)