docs: add crash recovery and robustness to spec and plan

Atomic model downloads (.downloading suffix + rename), file-based install lock (survives container restart), atomic JSON writes, startup recovery sequence, frontend double-click prevention, SSE fallback polling, disk space pre-checks.
2026-04-21 13:37:52 +00:00 · 2026-04-17 17:43:33 +08:00 · 2026-04-17 17:43:33 +08:00 · 31424d4356
commit 31424d4356
parent 08a7ffe403
2 changed files with 1236 additions and 0 deletions
--- a/docs/superpowers/plans/2026-04-17-on-demand-ai-features.md
+++ b/docs/superpowers/plans/2026-04-17-on-demand-ai-features.md
@ -0,0 +1,712 @@
+# On-Demand AI Feature Downloads Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Reduce Docker image from ~30 GB to ~5-6 GB by making AI features downloadable post-install via a UI-driven bundle system.
+
+**Architecture:** Six feature bundles (Background Removal, Face Detection, Object Eraser & Colorize, Upscale & Enhance, Photo Restoration, OCR) are defined in a JSON manifest baked into the image. A Python install script handles pip + model downloads to a persistent volume. The backend exposes install/uninstall APIs with SSE progress. The frontend shows download badges on uninstalled tools and an install prompt on tool pages.
+
+**Tech Stack:** Fastify (API), Zustand (frontend state), Python (install script), Docker (image restructuring), SSE (progress streaming)
+
+**Spec:** `docs/superpowers/specs/2026-04-17-on-demand-ai-features-design.md`
+
+---
+
+## File Map
+
+```
+NEW FILES:
+  packages/shared/src/features.ts                              # Bundle definitions, tool-to-bundle map, types
+  docker/feature-manifest.json                                  # Authoritative manifest baked into image
+  apps/api/src/lib/feature-status.ts                            # Reads manifest + installed.json, provides status
+  apps/api/src/routes/features.ts                               # GET /features, POST install/uninstall, GET disk-usage
+  packages/ai/python/install_feature.py                         # Python install script (pip + model downloads)
+  apps/web/src/stores/features-store.ts                         # Zustand store for bundle statuses
+  apps/web/src/components/features/feature-install-prompt.tsx   # Install prompt card for tool pages
+  apps/web/src/components/settings/ai-features-section.tsx      # Settings panel section
+  tests/unit/features.test.ts                                   # Unit tests for feature logic
+
+MODIFIED FILES:
+  packages/ai/src/bridge.ts                                     # restartDispatcher(), FEATURE_NOT_INSTALLED handling
+  packages/ai/src/index.ts                                      # Export restartDispatcher
+  packages/ai/python/dispatcher.py                              # Read installed.json, gate scripts by feature
+  packages/ai/python/colorize.py                                # Hard imports to lazy imports
+  packages/ai/python/restore.py                                 # Hard imports to lazy imports
+  apps/api/src/index.ts                                         # Register feature routes, startup venv check
+  apps/api/src/routes/tool-factory.ts                           # Feature-installed guard before process()
+  apps/api/src/routes/batch.ts                                  # Feature-installed check at gating point
+  apps/api/src/routes/pipeline.ts                               # Feature-installed check in pre-validation
+  apps/api/src/routes/tools/restore-photo.ts                    # Feature-installed guard
+  apps/web/src/lib/api.ts                                       # Extend parseApiError for FEATURE_NOT_INSTALLED
+  apps/web/src/components/common/tool-card.tsx                  # Download badge on uninstalled AI tools
+  apps/web/src/pages/tool-page.tsx                              # Feature check then install prompt or "not enabled"
+  apps/web/src/components/layout/tool-panel.tsx                 # Fetch features on mount
+  apps/web/src/pages/fullscreen-grid-page.tsx                   # Fetch features on mount
+  apps/web/src/components/settings/settings-dialog.tsx          # Add AI Features nav item + section
+  docker/Dockerfile                                             # Remove ML packages/models, keep base
+  docker/entrypoint.sh                                          # Venv bootstrap, /data/ai/ setup
+```
+
+---
+
+### Task 1: Shared Feature Types and Bundle Definitions
+
+**Files:**
+- Create: `packages/shared/src/features.ts`
+- Modify: `packages/shared/src/index.ts`
+- Test: `tests/unit/features.test.ts`
+
+- [ ] **Step 1: Write the failing test for bundle definitions**
+
+Create `tests/unit/features.test.ts`:
+
+```ts
+import { describe, expect, it } from "vitest";
+import {
+  FEATURE_BUNDLES,
+  getBundleForTool,
+  getToolsForBundle,
+  TOOL_BUNDLE_MAP,
+} from "@ashim/shared/features";
+import { PYTHON_SIDECAR_TOOLS } from "@ashim/shared";
+
+describe("Feature bundles", () => {
+  it("every PYTHON_SIDECAR_TOOL maps to exactly one bundle", () => {
+    for (const toolId of PYTHON_SIDECAR_TOOLS) {
+      const bundle = getBundleForTool(toolId);
+      expect(bundle, `${toolId} has no bundle`).toBeDefined();
+    }
+  });
+
+  it("getBundleForTool returns null for non-AI tools", () => {
+    expect(getBundleForTool("resize")).toBeNull();
+    expect(getBundleForTool("crop")).toBeNull();
+  });
+
+  it("getToolsForBundle returns correct tools", () => {
+    const tools = getToolsForBundle("background-removal");
+    expect(tools).toContain("remove-background");
+    expect(tools).toContain("passport-photo");
+    expect(tools).not.toContain("upscale");
+  });
+
+  it("all 6 bundles are defined", () => {
+    expect(Object.keys(FEATURE_BUNDLES)).toHaveLength(6);
+    expect(FEATURE_BUNDLES["background-removal"]).toBeDefined();
+    expect(FEATURE_BUNDLES["face-detection"]).toBeDefined();
+    expect(FEATURE_BUNDLES["object-eraser-colorize"]).toBeDefined();
+    expect(FEATURE_BUNDLES["upscale-enhance"]).toBeDefined();
+    expect(FEATURE_BUNDLES["photo-restoration"]).toBeDefined();
+    expect(FEATURE_BUNDLES["ocr"]).toBeDefined();
+  });
+
+  it("TOOL_BUNDLE_MAP covers all sidecar tools", () => {
+    const mappedTools = Object.keys(TOOL_BUNDLE_MAP);
+    for (const toolId of PYTHON_SIDECAR_TOOLS) {
+      expect(mappedTools, `${toolId} missing from TOOL_BUNDLE_MAP`).toContain(toolId);
+    }
+  });
+});
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+Run: `pnpm test:unit -- tests/unit/features.test.ts`
+Expected: FAIL with module not found error.
+
+- [ ] **Step 3: Create the feature definitions module**
+
+Create `packages/shared/src/features.ts`:
+
+```ts
+export interface FeatureBundleInfo {
+  id: string;
+  name: string;
+  description: string;
+  estimatedSize: string;
+  enablesTools: string[];
+}
+
+export type FeatureStatus = "not_installed" | "installing" | "installed" | "error";
+
+export interface FeatureBundleState {
+  id: string;
+  name: string;
+  description: string;
+  status: FeatureStatus;
+  installedVersion: string | null;
+  estimatedSize: string;
+  enablesTools: string[];
+  progress: { percent: number; stage: string } | null;
+  error: string | null;
+}
+
+export const FEATURE_BUNDLES: Record<string, FeatureBundleInfo> = {
+  "background-removal": {
+    id: "background-removal",
+    name: "Background Removal",
+    description: "Remove image backgrounds with AI",
+    estimatedSize: "700 MB - 1 GB",
+    enablesTools: ["remove-background", "passport-photo"],
+  },
+  "face-detection": {
+    id: "face-detection",
+    name: "Face Detection",
+    description: "Detect and blur faces, fix red-eye, smart crop",
+    estimatedSize: "200-300 MB",
+    enablesTools: ["blur-faces", "red-eye-removal", "smart-crop"],
+  },
+  "object-eraser-colorize": {
+    id: "object-eraser-colorize",
+    name: "Object Eraser & Colorize",
+    description: "Erase objects from photos and colorize B&W images",
+    estimatedSize: "600-800 MB",
+    enablesTools: ["erase-object", "colorize"],
+  },
+  "upscale-enhance": {
+    id: "upscale-enhance",
+    name: "Upscale & Enhance",
+    description: "AI upscaling, face enhancement, and noise removal",
+    estimatedSize: "4-5 GB",
+    enablesTools: ["upscale", "enhance-faces", "noise-removal"],
+  },
+  "photo-restoration": {
+    id: "photo-restoration",
+    name: "Photo Restoration",
+    description: "Restore old or damaged photos",
+    estimatedSize: "800 MB - 1 GB",
+    enablesTools: ["restore-photo"],
+  },
+  ocr: {
+    id: "ocr",
+    name: "OCR",
+    description: "Extract text from images",
+    estimatedSize: "3-4 GB",
+    enablesTools: ["ocr"],
+  },
+};
+
+export const TOOL_BUNDLE_MAP: Record<string, string> = {};
+for (const [bundleId, bundle] of Object.entries(FEATURE_BUNDLES)) {
+  for (const toolId of bundle.enablesTools) {
+    TOOL_BUNDLE_MAP[toolId] = bundleId;
+  }
+}
+
+export function getBundleForTool(toolId: string): FeatureBundleInfo | null {
+  const bundleId = TOOL_BUNDLE_MAP[toolId];
+  return bundleId ? FEATURE_BUNDLES[bundleId] : null;
+}
+
+export function getToolsForBundle(bundleId: string): string[] {
+  return FEATURE_BUNDLES[bundleId]?.enablesTools ?? [];
+}
+```
+
+- [ ] **Step 4: Export from shared package**
+
+Add to the end of `packages/shared/src/index.ts`:
+
+```ts
+export * from "./features.js";
+```
+
+- [ ] **Step 5: Run test to verify it passes**
+
+Run: `pnpm test:unit -- tests/unit/features.test.ts`
+Expected: PASS, all 5 tests green.
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add packages/shared/src/features.ts packages/shared/src/index.ts tests/unit/features.test.ts
+git commit -m "feat: add shared feature bundle definitions and tool-to-bundle mapping"
+```
+
+---
+
+### Task 2: Feature Manifest File
+
+**Files:**
+- Create: `docker/feature-manifest.json`
+
+- [ ] **Step 1: Create the feature manifest**
+
+Create `docker/feature-manifest.json` containing the full bundle definitions with exact package versions, pip flags, platform-specific packages, and model download URLs. Source exact versions from the current Dockerfile (lines 167-206) and model URLs from `docker/download_models.py`.
+
+Key details: amd64 uses `--extra-index-url https://download.pytorch.org/whl/cu126` for torch/realesrgan; amd64 uses `paddlepaddle-gpu>=3.2.1` from `https://www.paddlepaddle.org.cn/packages/stable/cu126/`; arm64 uses `mediapipe==0.10.18`; `codeformer-pip==0.0.4` needs `--no-deps`; `postInstall` re-pins `numpy==1.26.4`.
+
+The file should contain a top-level `manifestVersion`, `imageVersion`, `pythonVersion`, `basePackages` array, and `bundles` object with all 6 bundles. Each bundle has `name`, `description`, `estimatedSize`, `packages` (with `common`/`amd64`/`arm64` arrays), `pipFlags`, `postInstall`, `models` array, and `enablesTools` array.
+
+Model entries use either: `{ "id", "url", "path", "minSize" }` for direct downloads, `{ "id", "downloadFn": "rembg_session", "args": [...] }` for rembg models, or `{ "id", "downloadFn": "hf_snapshot", "args": [repo_id, local_subpath] }` for HuggingFace snapshots.
+
+- [ ] **Step 2: Commit**
+
+```bash
+git add docker/feature-manifest.json
+git commit -m "feat: add feature manifest with all 6 bundle definitions"
+```
+
+---
+
+### Task 3: Backend Feature Status Service
+
+**Files:**
+- Create: `apps/api/src/lib/feature-status.ts`
+
+- [ ] **Step 1: Create the feature status service**
+
+Create `apps/api/src/lib/feature-status.ts`. This module reads/writes `/data/ai/installed.json`, provides `isFeatureInstalled(bundleId)`, `isToolInstalled(toolId)`, `getFeatureStates()`, `markInstalled()`, `markUninstalled()`, `setInstallProgress()`, and `ensureAiDirs()`.
+
+Uses `FEATURE_BUNDLES` and `TOOL_BUNDLE_MAP` from `@ashim/shared`. Caches `installed.json` in memory with `invalidateCache()` for refresh after install/uninstall. Detects Docker environment via `existsSync("/.dockerenv")`.
+
+See spec section "Persistent Storage" for directory structure: `/data/ai/venv/`, `/data/ai/models/`, `/data/ai/pip-cache/`, `/data/ai/installed.json`.
+
+**Robustness requirements for this module:**
+
+- **Atomic JSON writes:** `markInstalled()` and `markUninstalled()` must write to `installed.json.tmp` first, then `renameSync()` to `installed.json`. Never write directly to `installed.json`.
+- **Corrupt JSON recovery:** `readInstalled()` wraps `JSON.parse` in try/catch. If the file is corrupt, treat as empty `{ bundles: {} }` and log a warning.
+- **File-based install lock:** Instead of just in-memory `installInProgress`, use `/data/ai/install.lock` file containing `{ bundleId, startedAt, pid }`. Create lock before install, delete on completion/failure. `getInstallingBundle()` reads from the lock file, not memory.
+- **`recoverInterruptedInstalls()`** function called on startup:
+  1. Delete any `*.downloading` files in `/data/ai/models/` (recursive glob)
+  2. Delete `installed.json.tmp` if it exists
+  3. Delete `/data/ai/venv.bootstrapping/` if it exists
+  4. If `install.lock` exists: check if PID is alive (via `process.kill(pid, 0)` in try/catch). If dead, delete the lock and log a warning. If alive, leave it (install is still running from a previous container lifecycle — unlikely but possible with shared volumes).
+  5. For each bundle in `installed.json`, verify model files exist and meet `minSize` from the feature manifest. If any model is missing/undersized, set the bundle's error field to "Some model files are missing. Reinstall this feature." but do NOT remove from installed.json.
+- **`acquireInstallLock(bundleId)`** and **`releaseInstallLock()`** functions that create/delete the lock file atomically.
+
+- [ ] **Step 2: Commit**
+
+```bash
+git add apps/api/src/lib/feature-status.ts
+git commit -m "feat: add backend feature status service for tracking installed bundles"
+```
+
+---
+
+### Task 4: Feature API Routes
+
+**Files:**
+- Create: `apps/api/src/routes/features.ts`
+- Modify: `apps/api/src/index.ts`
+
+- [ ] **Step 1: Create the features route file**
+
+Create `apps/api/src/routes/features.ts` with 4 endpoints:
+
+1. `GET /api/v1/features` (any authenticated user) — returns `{ bundles: FeatureBundleState[] }`. In non-Docker environments, returns all features as installed.
+2. `POST /api/v1/admin/features/:bundleId/install` (admin only) — validates bundle exists, checks not already installed, checks no other install in progress (409). Spawns `install_feature.py` as child process via `spawn()`. Parses stderr JSON progress lines, updates progress via `updateSingleFileProgress()` from `progress.ts`. On success, calls `invalidateCache()` and `shutdownDispatcher()` (from `@ashim/ai`). Returns `{ jobId }`.
+3. `POST /api/v1/admin/features/:bundleId/uninstall` (admin only) — removes model files listed in the manifest, calls `markUninstalled()`, calls `shutdownDispatcher()`. Returns `{ ok: true }`.
+4. `GET /api/v1/admin/features/disk-usage` (admin only) — returns `{ totalBytes }` by recursively sizing `/data/ai/`.
+
+Note: Use `spawn()` from `node:child_process` (not `exec()`) for the install script to avoid shell injection. Pass arguments as array elements.
+
+**Robustness requirements for install endpoint:**
+- Call `acquireInstallLock(bundleId)` before spawning the child process. If lock acquisition fails (lock file already exists with a live PID), return 409.
+- Check available disk space before starting: `const { availableParallelism } = require("node:os"); const stats = statfsSync("/data"); const freeBytes = stats.bfree * stats.bsize;`. Compare against a rough estimate for the bundle. If insufficient, return 400 with disk space info.
+- On child process `close` event with code 0: call `releaseInstallLock()`, `invalidateCache()`, `shutdownDispatcher()`.
+- On child process `close` event with non-zero code: call `releaseInstallLock()`, set error state. Do NOT leave the lock file behind.
+- On child process `error` event (spawn failure): call `releaseInstallLock()`, return error.
+- The install endpoint returns `{ jobId }` immediately. The child process runs asynchronously. The HTTP response does not block on completion.
+
+- [ ] **Step 2: Register feature routes in index.ts**
+
+In `apps/api/src/index.ts`: import `registerFeatureRoutes`, call it after the settings routes registration. Also import and call `ensureAiDirs()` and `recoverInterruptedInstalls()` near the top of the startup sequence after `runMigrations()`.
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add apps/api/src/routes/features.ts apps/api/src/index.ts
+git commit -m "feat: add feature install/uninstall API routes with SSE progress"
+```
+
+---
+
+### Task 5: Python Install Script
+
+**Files:**
+- Create: `packages/ai/python/install_feature.py`
+
+- [ ] **Step 1: Create the install script**
+
+Create `packages/ai/python/install_feature.py`. Takes 3 CLI args: `bundleId`, `manifestPath`, `modelsDir`. Reads manifest JSON, detects architecture via `platform.machine()`, runs pip install for each package using `subprocess.run([sys.executable, "-m", "pip", "install", ...])`, downloads models with retry logic (exponential backoff, 3 retries, file size assertions).
+
+Progress reported via stderr JSON lines: `{"progress": N, "stage": "..."}`. Result written to stdout JSON: `{"success": true, "bundleId": "...", "version": "...", "models": [...]}`.
+
+Port the retry pattern from `docker/download_models.py` `_urlretrieve()` (lines 18-35). Handle rembg models via `rembg.new_session()` and HuggingFace models via `huggingface_hub.snapshot_download()`. Must be idempotent.
+
+Writes to `/data/ai/installed.json` on success (matching the structure read by `feature-status.ts`).
+
+**Robustness requirements for the install script:**
+
+- **Atomic model downloads:** For each URL-based model:
+  1. Check if final path already exists and meets `minSize` — skip if so (idempotent)
+  2. Delete any existing `<path>.downloading` file (orphan from a previous failed attempt)
+  3. Download to `<path>.downloading`
+  4. Verify file size against `minSize`. If too small, delete and raise error.
+  5. `os.rename(<path>.downloading, <path>)` — atomic on same filesystem
+  6. Never leave a `.downloading` file behind on success
+- **Atomic JSON writes:** When writing `installed.json`:
+  1. Write to `installed.json.tmp`
+  2. `os.rename()` to `installed.json`
+- **Disk space pre-check:** Before starting, check available disk space via `shutil.disk_usage()`. If free space is less than estimated bundle size, exit with a clear error message.
+- **pip failure recovery:** If `pip install` fails for one package, emit the error and exit. The packages that were already installed remain (pip is idempotent — re-running skips them). The admin can retry.
+- **Model failure isolation:** If one model fails to download after retries, continue downloading other models. At the end, report which models failed. Exit with non-zero code so the bundle is NOT marked as installed. On retry, only the failed models need downloading (others pass the exists+size check).
+
+- [ ] **Step 2: Commit**
+
+```bash
+git add packages/ai/python/install_feature.py
+git commit -m "feat: add Python install script for feature bundles"
+```
+
+---
+
+### Task 6: Tool Route Guards
+
+**Files:**
+- Modify: `apps/api/src/routes/tool-factory.ts`
+- Modify: `apps/api/src/routes/batch.ts`
+- Modify: `apps/api/src/routes/pipeline.ts`
+- Modify: `apps/api/src/routes/tools/restore-photo.ts`
+
+- [ ] **Step 1: Add feature guard to tool-factory.ts**
+
+Import `isToolInstalled` from `../lib/feature-status.js` and `TOOL_BUNDLE_MAP`, `getBundleForTool` from `@ashim/shared`. Inside `createToolRoute`, after settings validation and before `config.process()`, add:
+
+```ts
+const bundleId = TOOL_BUNDLE_MAP[config.toolId];
+if (bundleId && !isToolInstalled(config.toolId)) {
+  const bundle = getBundleForTool(config.toolId);
+  return reply.status(501).send({
+    error: "Feature not installed",
+    code: "FEATURE_NOT_INSTALLED",
+    feature: bundleId,
+    featureName: bundle?.name ?? bundleId,
+    estimatedSize: bundle?.estimatedSize ?? "unknown",
+  });
+}
+```
+
+- [ ] **Step 2: Add feature guard to batch.ts**
+
+Same imports. After `getToolConfig(toolId)` returns (around line 35-37), add the same guard returning 501 with `FEATURE_NOT_INSTALLED` code.
+
+- [ ] **Step 3: Add feature guard to pipeline.ts**
+
+Same imports. In both pre-validation loops (execute at lines 143-172, batch at lines 441-462), after successful `getToolConfig(resolvedToolId)`, add the guard. Return 501 with step number in the error message.
+
+- [ ] **Step 4: Add feature guard to restore-photo.ts**
+
+This tool uses its own route handler, not the factory. Import `isToolInstalled` and add the guard before `restorePhoto()` is called.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add apps/api/src/routes/tool-factory.ts apps/api/src/routes/batch.ts apps/api/src/routes/pipeline.ts apps/api/src/routes/tools/restore-photo.ts
+git commit -m "feat: add feature-installed guards to tool routes, batch, and pipeline"
+```
+
+---
+
+### Task 7: Bridge and Python Sidecar Changes
+
+**Files:**
+- Modify: `packages/ai/python/dispatcher.py`
+- Modify: `packages/ai/python/colorize.py`
+- Modify: `packages/ai/python/restore.py`
+
+- [ ] **Step 1: Add feature gating to dispatcher.py**
+
+Add a `TOOL_BUNDLE_MAP` dict mapping Python script names (without `.py`) to bundle IDs: `remove_bg` -> `background-removal`, `detect_faces` -> `face-detection`, `face_landmarks` -> `face-detection`, `red_eye_removal` -> `face-detection`, `inpaint` -> `object-eraser-colorize`, `colorize` -> `object-eraser-colorize`, `upscale` -> `upscale-enhance`, `enhance_faces` -> `upscale-enhance`, `noise_removal` -> `upscale-enhance`, `restore` -> `photo-restoration`, `ocr` -> `ocr`.
+
+Add `_get_installed_bundles()` that reads `/data/ai/installed.json` and returns a set of installed bundle IDs.
+
+In `_run_script_main()`, before the `exec()` call, check if the script's bundle is installed. If not, return a JSON error: `{"success": false, "error": "feature_not_installed", "feature": bundle_id, "message": "..."}`.
+
+Also set `U2NET_HOME` to `/data/ai/models/rembg` on startup if `/data/ai/models` exists.
+
+- [ ] **Step 2: Convert hard imports in colorize.py**
+
+Move module-level `import numpy as np`, `import cv2`, `from PIL import Image` (lines 10-12) inside each function that uses them (`colorize_ddcolor`, `colorize_opencv`, `main`).
+
+- [ ] **Step 3: Convert hard imports in restore.py**
+
+Move module-level `import numpy as np`, `import cv2`, `from PIL import Image` (lines 13-15) inside each function that uses them.
+
+- [ ] **Step 4: Commit**
+
+```bash
+git add packages/ai/python/dispatcher.py packages/ai/python/colorize.py packages/ai/python/restore.py
+git commit -m "feat: add feature gating to Python dispatcher, convert hard imports to lazy"
+```
+
+---
+
+### Task 8: Frontend Features Store and API Error Extension
+
+**Files:**
+- Create: `apps/web/src/stores/features-store.ts`
+- Modify: `apps/web/src/lib/api.ts`
+- Modify: `apps/web/src/hooks/use-tool-processor.ts`
+- Modify: `apps/web/src/hooks/use-pipeline-processor.ts`
+
+- [ ] **Step 1: Create the features store**
+
+Create `apps/web/src/stores/features-store.ts` following the `settings-store.ts` pattern. Zustand store with `bundles: FeatureBundleState[]`, `loaded: boolean`, `fetch()` (one-shot), `refresh()` (force re-fetch), `isToolInstalled(toolId)`, `getBundleForTool(toolId)`. Fetches from `GET /api/v1/features`.
+
+- [ ] **Step 2: Extend parseApiError for FEATURE_NOT_INSTALLED**
+
+In `apps/web/src/lib/api.ts`, add a `FeatureNotInstalledError` interface export: `{ type: "feature_not_installed"; feature: string; featureName: string; estimatedSize: string }`.
+
+Modify `parseApiError` return type to `string | FeatureNotInstalledError`. Add early return when `body.code === "FEATURE_NOT_INSTALLED"`.
+
+- [ ] **Step 3: Update use-tool-processor.ts and use-pipeline-processor.ts**
+
+In both hooks, where `parseApiError` is called and passed to `setError()`, add a type check:
+
+```ts
+const parsed = parseApiError(body, xhr.status);
+if (typeof parsed === "object" && parsed.type === "feature_not_installed") {
+  setError(`Feature "${parsed.featureName}" is not installed. Enable it in Settings.`);
+} else {
+  setError(parsed);
+}
+```
+
+- [ ] **Step 4: Commit**
+
+```bash
+git add apps/web/src/stores/features-store.ts apps/web/src/lib/api.ts apps/web/src/hooks/use-tool-processor.ts apps/web/src/hooks/use-pipeline-processor.ts
+git commit -m "feat: add frontend features store and FEATURE_NOT_INSTALLED error handling"
+```
+
+---
+
+### Task 9: Frontend Tool Grid Badge
+
+**Files:**
+- Modify: `apps/web/src/components/common/tool-card.tsx`
+- Modify: `apps/web/src/components/layout/tool-panel.tsx`
+- Modify: `apps/web/src/pages/fullscreen-grid-page.tsx`
+
+- [ ] **Step 1: Add download badge to ToolCard**
+
+Import `useFeaturesStore`, `PYTHON_SIDECAR_TOOLS`, and `Download` icon from lucide-react. Compute `showDownloadBadge` when the tool is an AI tool and not installed. Render a `<Download className="h-3.5 w-3.5 text-muted-foreground" />` icon after the experimental badge.
+
+- [ ] **Step 2: Fetch features on app load**
+
+In `tool-panel.tsx`, add `useFeaturesStore().fetch()` in a useEffect alongside the existing settings fetch. Do the same in `fullscreen-grid-page.tsx`.
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add apps/web/src/components/common/tool-card.tsx apps/web/src/components/layout/tool-panel.tsx apps/web/src/pages/fullscreen-grid-page.tsx
+git commit -m "feat: add download badge to uninstalled AI tools in tool grid"
+```
+
+---
+
+### Task 10: Frontend Tool Page Install Prompt
+
+**Files:**
+- Create: `apps/web/src/components/features/feature-install-prompt.tsx`
+- Modify: `apps/web/src/pages/tool-page.tsx`
+
+- [ ] **Step 1: Create the FeatureInstallPrompt component**
+
+Props: `{ bundle: FeatureBundleState; isAdmin: boolean }`.
+
+For non-admins: show centered Download icon + "Feature Not Enabled" heading + "Ask your administrator" text.
+
+For admins: show Download icon + bundle name/description + "requires additional download (~{estimatedSize})" + [Enable Feature] button. On click: POST to install endpoint, open EventSource for SSE progress, show progress bar with stage text and percent. On completion: call `useFeaturesStore().refresh()` to trigger re-render. On error: show error message with retry option.
+
+Use same Tailwind patterns as existing components: `bg-primary text-primary-foreground` for buttons, `Loader2 animate-spin` for loading, `text-destructive` for errors.
+
+**Robustness requirements for the frontend:**
+
+- **Double-click prevention:** Set `installing = true` immediately on first click (before the API call). The button must be `disabled={installing || bundle.status === "installing"}`. This prevents any re-click.
+- **Browser close / navigate away:** The server-side install continues regardless. On component mount, check `bundle.status` from the features store. If it's `"installing"`, immediately show the progress bar and open EventSource for the in-progress job (fetch `jobId` from the features endpoint or use the bundle's progress data).
+- **SSE connection loss fallback:** If EventSource fires `onerror`, close it and fall back to polling `GET /api/v1/features` every 3 seconds via `setInterval`. When status changes from `"installing"` to `"installed"` or `"error"`, stop polling and update UI.
+- **Page refresh during install:** The features store's `fetch()` returns current status. If a bundle is `"installing"`, the component renders progress state immediately — no need for the user to click anything.
+- **Multiple admin sessions:** All sessions see the same `"installing"` status from the shared `GET /api/v1/features` endpoint. The server's install lock prevents concurrent installs. Any session trying to install gets a 409.
+- **Retry after error:** Show a "Retry" button when status is `"error"`. On retry, call the install endpoint again (the lock is released on failure, so this works). pip cache means previously-downloaded wheels aren't re-downloaded. Idempotent model downloads skip already-complete files.
+
+- [ ] **Step 2: Integrate into ToolPage**
+
+In `tool-page.tsx`: import `useFeaturesStore`, `PYTHON_SIDECAR_TOOLS`, `useAuth`, and `FeatureInstallPrompt`. After the tool/registryEntry lookup, compute `isAiTool`, `toolInstalled`, `featureBundle`, `isAdmin`. After the "Tool not found" guard, add a guard that renders `<FeatureInstallPrompt>` wrapped in `<AppLayout>` when the tool is AI and not installed.
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add apps/web/src/components/features/feature-install-prompt.tsx apps/web/src/pages/tool-page.tsx
+git commit -m "feat: add feature install prompt on uninstalled AI tool pages"
+```
+
+---
+
+### Task 11: Settings AI Features Section
+
+**Files:**
+- Create: `apps/web/src/components/settings/ai-features-section.tsx`
+- Modify: `apps/web/src/components/settings/settings-dialog.tsx`
+
+- [ ] **Step 1: Create AiFeaturesSection component**
+
+Follow the card-based layout of existing sections in `settings-dialog.tsx`. Use `useFeaturesStore()`. Render each bundle as a bordered card (`rounded-lg border border-border`) with: name, description, status indicator (green dot = installed, gray = not installed, spinning = installing), estimated size, Install/Uninstall button. Add "Install All" button at top. Show total disk usage at bottom (fetch from `GET /api/v1/admin/features/disk-usage`). Reuse the toggle/button patterns from `ToolsSection`.
+
+- [ ] **Step 2: Add section to settings-dialog.tsx**
+
+Add `"ai-features"` to the `Section` type union. Add to `NAV_ITEMS` between `"api-keys"` and `"tools"`: `{ id: "ai-features", label: "AI Features", icon: Sparkles, requiredPermission: "settings:write" }`. Import `Sparkles` from lucide-react. Add `{section === "ai-features" && <AiFeaturesSection />}` to the conditional render block. Lazy-import `AiFeaturesSection` from `"./ai-features-section"`.
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add apps/web/src/components/settings/ai-features-section.tsx apps/web/src/components/settings/settings-dialog.tsx
+git commit -m "feat: add AI Features settings panel for managing feature bundles"
+```
+
+---
+
+### Task 12: Dockerfile Restructuring
+
+**Files:**
+- Modify: `docker/Dockerfile`
+- Modify: `docker/entrypoint.sh`
+
+- [ ] **Step 1: Modify the Dockerfile**
+
+In `docker/Dockerfile` production stage:
+
+1. **Keep**: base image selection, Node.js install, pnpm setup, system packages, Python venv creation with base packages (numpy, Pillow, opencv)
+2. **Remove**: all ML pip install commands (lines 175-206: onnxruntime, rembg, realesrgan, paddlepaddle, mediapipe, codeformer)
+3. **Remove**: download_models.py COPY and RUN (lines 219-231)
+4. **Remove**: the `apt-get purge build-essential python3-dev` line (line 251) so build-essential stays for runtime pip installs
+5. **Add**: `COPY docker/feature-manifest.json /app/docker/feature-manifest.json`
+6. **Add**: `COPY packages/ai/python/install_feature.py /app/packages/ai/python/install_feature.py`
+7. **Update** env vars: `PYTHON_VENV_PATH=/data/ai/venv`, add `MODELS_PATH=/data/ai/models`, add `DATA_DIR=/data`
+
+- [ ] **Step 2: Update entrypoint.sh for venv bootstrap**
+
+Add venv bootstrap after auth defaults and before volume permission fix. Use atomic directory rename to prevent corrupt venv from partial copy:
+
+```sh
+AI_VENV="/data/ai/venv"
+AI_VENV_TMP="/data/ai/venv.bootstrapping"
+
+# Clean up any interrupted bootstrap from a previous start
+if [ -d "$AI_VENV_TMP" ]; then
+  echo "Cleaning up interrupted venv bootstrap..."
+  rm -rf "$AI_VENV_TMP"
+fi
+
+# Bootstrap AI venv from base image on first run
+if [ ! -d "$AI_VENV" ] && [ -d "/opt/venv" ]; then
+  echo "Bootstrapping AI venv from base image..."
+  mkdir -p /data/ai/models /data/ai/pip-cache
+  cp -r /opt/venv "$AI_VENV_TMP"
+  mv "$AI_VENV_TMP" "$AI_VENV"
+  echo "AI venv ready at $AI_VENV"
+fi
+```
+
+The `cp -r` + `mv` pattern ensures `/data/ai/venv` is either fully present or absent — never half-copied. If the container is killed during `cp -r`, the `.bootstrapping` directory is cleaned up on next start.
+
+- [ ] **Step 3: Build and verify**
+
+```bash
+docker build -f docker/Dockerfile -t ashim:dev .
+docker images ashim:dev --format "{{.Size}}"
+```
+Expected: Image size ~5-6 GB (amd64) instead of ~30 GB.
+
+- [ ] **Step 4: Commit**
+
+```bash
+git add docker/Dockerfile docker/entrypoint.sh
+git commit -m "feat: restructure Dockerfile to remove ML packages and models
+
+Base image now includes only Node.js + Sharp + Python with base deps.
+AI features are downloaded on-demand via the feature install system.
+Image reduced from ~30GB to ~5-6GB (amd64) / ~2-3GB (arm64)."
+```
+
+---
+
+### Task 13: Integration Testing
+
+**Files:**
+- Create: `tests/e2e-docker/features.spec.ts`
+
+- [ ] **Step 1: Create Docker e2e tests for feature system**
+
+Create `tests/e2e-docker/features.spec.ts` using the existing `playwright.docker.config.ts` infrastructure:
+
+```ts
+import { expect, test } from "@playwright/test";
+
+test.describe("On-demand AI features", () => {
+  test("GET /api/v1/features returns all 6 bundles", async ({ request }) => {
+    const response = await request.get("/api/v1/features");
+    expect(response.ok()).toBeTruthy();
+    const data = await response.json();
+    expect(data.bundles).toHaveLength(6);
+    for (const bundle of data.bundles) {
+      expect(bundle).toHaveProperty("id");
+      expect(bundle).toHaveProperty("name");
+      expect(bundle).toHaveProperty("status");
+      expect(bundle).toHaveProperty("enablesTools");
+    }
+  });
+
+  test("AI tool returns 501 FEATURE_NOT_INSTALLED when bundle not installed", async ({ request }) => {
+    const pngBuffer = Buffer.from(
+      "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==",
+      "base64",
+    );
+    const response = await request.post("/api/v1/tools/remove-background", {
+      multipart: {
+        file: { name: "test.png", mimeType: "image/png", buffer: pngBuffer },
+        settings: JSON.stringify({}),
+      },
+    });
+    expect(response.status()).toBe(501);
+    const body = await response.json();
+    expect(body.code).toBe("FEATURE_NOT_INSTALLED");
+    expect(body.feature).toBe("background-removal");
+  });
+
+  test("uninstalled AI tool page shows install prompt for admin", async ({ page }) => {
+    await page.goto("/remove-background");
+    await expect(page.getByText("Enable")).toBeVisible({ timeout: 10000 });
+    await expect(page.getByText("additional download")).toBeVisible();
+  });
+});
+```
+
+- [ ] **Step 2: Commit**
+
+```bash
+git add tests/e2e-docker/features.spec.ts
+git commit -m "test: add e2e tests for on-demand AI feature system"
+```
+
+---
+
+### Task Summary
+
+| Task | Description | Key Files |
+|------|------------|-----------|
+| 1 | Shared types and bundle definitions | `packages/shared/src/features.ts` |
+| 2 | Feature manifest JSON | `docker/feature-manifest.json` |
+| 3 | Backend feature status service | `apps/api/src/lib/feature-status.ts` |
+| 4 | Feature API routes | `apps/api/src/routes/features.ts` |
+| 5 | Python install script | `packages/ai/python/install_feature.py` |
+| 6 | Tool route guards | `tool-factory.ts`, `batch.ts`, `pipeline.ts` |
+| 7 | Bridge + Python sidecar changes | `dispatcher.py`, `colorize.py`, `restore.py` |
+| 8 | Frontend features store + error handling | `features-store.ts`, `api.ts` |
+| 9 | Frontend tool grid badge | `tool-card.tsx`, `tool-panel.tsx` |
+| 10 | Frontend tool page install prompt | `feature-install-prompt.tsx`, `tool-page.tsx` |
+| 11 | Settings AI Features section | `ai-features-section.tsx`, `settings-dialog.tsx` |
+| 12 | Dockerfile restructuring | `Dockerfile`, `entrypoint.sh` |
+| 13 | Integration testing | `tests/e2e-docker/features.spec.ts` |
--- a/docs/superpowers/specs/2026-04-17-on-demand-ai-features-design.md
+++ b/docs/superpowers/specs/2026-04-17-on-demand-ai-features-design.md
@ -0,0 +1,524 @@
+# On-Demand AI Feature Downloads
+
+**Date:** 2026-04-17
+**Status:** Approved
+**Goal:** Reduce Docker image from ~30 GB to ~5-6 GB (amd64) / ~2-3 GB (arm64) by making AI features downloadable post-install.
+
+## Problem
+
+The Docker image bundles all Python ML packages (~8-10 GB) and model weights (~5-8 GB) regardless of whether users need AI features. Users who only want basic image tools (resize, crop, convert) must pull ~30 GB.
+
+## Design Decisions
+
+- **Single Docker image** — no lite/full variants
+- **Individual feature bundles** — users cherry-pick by feature name, not model name
+- **Admin-only downloads** — only admins can enable/disable AI features
+- **AI tools visible with badge** — uninstalled tools appear in grid with a download indicator
+- **Both tool-page and settings UI** — admins can download from the tool page or from a central management panel in settings
+
+## Architecture
+
+### Base Image Contents
+
+The base image includes everything needed for non-AI tools plus the prerequisites for AI feature installation:
+
+| Component | Rationale |
+|-----------|-----------|
+| Node.js 22 + pnpm + app source + frontend dist | Core application |
+| Sharp, imagemagick, tesseract-ocr, potrace, libheif, exiftool | Non-AI image processing |
+| caire binary | Content-aware resize |
+| Python 3 + pip + build-essential | Required for pip install at runtime |
+| numpy==1.26.4, Pillow, opencv-python-headless | Shared by all AI features, small (~300 MB) |
+| CUDA runtime (amd64 only, from nvidia/cuda base) | Required for GPU-accelerated AI |
+
+**Estimated size:** ~5-6 GB (amd64), ~2-3 GB (arm64)
+
+### Feature Bundles
+
+Six user-facing bundles, named by what they enable (not by model names). **Each tool belongs to exactly one bundle — no partial functionality.** When a bundle is installed, all its tools work fully. When it's not installed, those tools are locked entirely.
+
+| Feature Name | Python Packages | Models | Tools Fully Enabled | Est. Size |
+|---|---|---|---|---|
+| **Background Removal** | rembg, onnxruntime(-gpu), mediapipe | birefnet-general-lite, blaze_face, face_landmarker | remove-background, passport-photo | ~700 MB - 1 GB |
+| **Face Detection** | mediapipe | blaze_face, face_landmarker | blur-faces, red-eye-removal, smart-crop | ~200-300 MB |
+| **Object Eraser & Colorize** | onnxruntime(-gpu) | LaMa ONNX, DDColor ONNX, OpenCV colorize | erase-object, colorize | ~600-800 MB |
+| **Upscale & Enhance** | torch, torchvision, realesrgan, codeformer-pip (--no-deps), gfpgan, basicsr, lpips | RealESRGAN x4plus, GFPGANv1.3, CodeFormer (.pth), facexlib, SCUNet, NAFNet | upscale, enhance-faces, noise-removal | ~4-5 GB |
+| **Photo Restoration** | onnxruntime(-gpu), mediapipe | LaMa ONNX, DDColor ONNX, CodeFormer ONNX, blaze_face, face_landmarker, OpenCV colorize | restore-photo | ~800 MB - 1 GB |
+| **OCR** | paddlepaddle(-gpu), paddleocr | PP-OCRv5 (7 models), PaddleOCR-VL 1.5 | ocr | ~3-4 GB |
+
+Notes:
+- `passport-photo` is in the Background Removal bundle because it primarily needs rembg; mediapipe (for face landmarks) is included in the same bundle so the tool works fully
+- `noise-removal` is in the Upscale & Enhance bundle because its quality/maximum tiers need PyTorch; all 4 tiers (including OpenCV-based quick/balanced) are locked until the bundle is installed
+- `ocr` is fully locked until the OCR bundle is installed, including the Tesseract-based fast tier — this keeps the UX clean even though Tesseract is pre-installed in the base image
+- `restore-photo` is its own bundle because it needs models from multiple domains (inpainting, face enhancement, colorization); all stages work when installed
+- Some packages appear in multiple bundles (e.g., mediapipe in Background Removal, Face Detection, and Photo Restoration; onnxruntime in Background Removal, Object Eraser, and Photo Restoration). The install script skips already-installed packages — pip handles this naturally
+- Some models appear in multiple bundles (e.g., blaze_face in both Background Removal and Face Detection). The install script skips already-downloaded model files
+
+### Bundle Dependencies
+
+```
+Background Removal ───── standalone
+Face Detection ────────── standalone
+Object Eraser & Colorize ── standalone
+Upscale & Enhance ─────── standalone
+Photo Restoration ─────── standalone
+OCR ───────────────────── standalone
+```
+
+All bundles are independently installable. Shared packages (mediapipe, onnxruntime) and shared models (blaze_face, LaMa, etc.) are silently skipped if already present from another bundle.
+
+### Single Venv Strategy
+
+The current architecture uses a single venv at `/opt/venv` (set via `PYTHON_VENV_PATH`). The bridge (`bridge.ts`) constructs `${venvPath}/bin/python3` — it can only point to one interpreter. Having two venvs (base at `/opt/venv`, features at `/data/ai/venv/`) is fragile: C extensions and entry points reference their venv prefix, and `PYTHONPATH` hacks break in practice.
+
+**Solution:** Use a single venv on the persistent volume at `/data/ai/venv/`.
+
+- The Dockerfile creates `/opt/venv` with base packages (numpy, Pillow, opencv) as before
+- The entrypoint script bootstraps `/data/ai/venv/` on first run by copying `/opt/venv` into it (fast file copy, ~300 MB)
+- `PYTHON_VENV_PATH` is set to `/data/ai/venv/` so the bridge uses it
+- Feature installs add packages to this same venv
+- On container update, the entrypoint checks if base package versions changed and updates the venv accordingly (pip install from wheel cache)
+
+This gives us one venv with all packages, living on a persistent volume, bootstrapped from the image's base packages.
+
+### Persistent Storage
+
+All AI data lives under `/data/ai/` on the existing Docker volume (no docker-compose changes):
+
+```
+/data/ai/
+  venv/           # Single Python virtual environment (bootstrapped from /opt/venv, extended by feature installs)
+  models/         # Downloaded model weight files (same structure as /opt/models/)
+  pip-cache/      # Wheel cache for fast re-installs after updates
+  installed.json  # Tracks installed bundles, versions, timestamps
+```
+
+### Feature Manifest
+
+A `feature-manifest.json` file is baked into each Docker image at build time. It is the single source of truth for what each bundle installs:
+
+```json
+{
+  "manifestVersion": 1,
+  "imageVersion": "1.16.0",
+  "pythonVersion": "3.12",
+  "basePackages": ["numpy==1.26.4", "Pillow==11.1.0", "opencv-python-headless==4.10.0.84"],
+  "bundles": {
+    "background-removal": {
+      "name": "Background Removal",
+      "description": "Remove image backgrounds with AI",
+      "packages": {
+        "common": ["rembg==2.0.62"],
+        "amd64": ["onnxruntime-gpu==1.20.1", "mediapipe==0.10.21"],
+        "arm64": ["onnxruntime==1.20.1", "rembg[cpu]==2.0.62", "mediapipe==0.10.18"]
+      },
+      "pipFlags": {},
+      "models": [
+        {
+          "id": "birefnet-general-lite",
+          "downloadFn": "rembg_session",
+          "args": ["birefnet-general-lite"]
+        },
+        {
+          "id": "blaze-face-short-range",
+          "url": "https://storage.googleapis.com/mediapipe-models/face_detector/blaze_face_short_range/float16/latest/blaze_face_short_range.tflite",
+          "path": "mediapipe/blaze_face_short_range.tflite",
+          "minSize": 100000
+        },
+        {
+          "id": "face-landmarker",
+          "url": "https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/latest/face_landmarker.task",
+          "path": "mediapipe/face_landmarker.task",
+          "minSize": 5000000
+        }
+      ],
+      "enablesTools": ["remove-background", "passport-photo"]
+    },
+    "upscale-enhance": {
+      "name": "Upscale & Enhance",
+      "description": "AI upscaling, face enhancement, and noise removal",
+      "packages": {
+        "common": ["codeformer-pip==0.0.4", "lpips"],
+        "amd64": [
+          "torch torchvision --extra-index-url https://download.pytorch.org/whl/cu126",
+          "realesrgan==0.3.0 --extra-index-url https://download.pytorch.org/whl/cu126"
+        ],
+        "arm64": ["torch", "torchvision", "realesrgan==0.3.0"]
+      },
+      "pipFlags": {
+        "codeformer-pip==0.0.4": "--no-deps"
+      },
+      "postInstall": ["pip install numpy==1.26.4"],
+      "models": [
+        { "id": "realesrgan-x4plus", "url": "https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth", "path": "realesrgan/RealESRGAN_x4plus.pth", "minSize": 67000000 },
+        { "id": "gfpgan-v1.3", "url": "https://github.com/TencentARC/GFPGAN/releases/download/v1.3.0/GFPGANv1.3.pth", "path": "gfpgan/GFPGANv1.3.pth", "minSize": 332000000 },
+        { "id": "codeformer-pth", "url": "https://github.com/sczhou/CodeFormer/releases/download/v0.1.0/codeformer.pth", "path": "codeformer/codeformer.pth", "minSize": 375000000 },
+        { "id": "codeformer-onnx", "url": "hf://facefusion/models-3.0.0/codeformer.onnx", "path": "codeformer/codeformer.onnx", "minSize": 377000000 },
+        { "id": "facexlib-detection", "url": "https://github.com/xinntao/facexlib/releases/download/v0.1.0/detection_Resnet50_Final.pth", "path": "gfpgan/facelib/detection_Resnet50_Final.pth", "minSize": 104000000 },
+        { "id": "facexlib-parsing", "url": "https://github.com/xinntao/facexlib/releases/download/v0.2.2/parsing_parsenet.pth", "path": "gfpgan/facelib/parsing_parsenet.pth", "minSize": 85000000 },
+        { "id": "scunet", "url": "https://github.com/cszn/KAIR/releases/download/v1.0/scunet_color_real_psnr.pth", "path": "scunet/scunet_color_real_psnr.pth", "minSize": 4000000 },
+        { "id": "nafnet", "url": "hf://mikestealth/nafnet-models/NAFNet-SIDD-width64.pth", "path": "nafnet/NAFNet-SIDD-width64.pth", "minSize": 67000000 }
+      ],
+      "enablesTools": ["upscale", "enhance-faces", "noise-removal"]
+    }
+  }
+}
+```
+
+### Install Script
+
+A Python script (`packages/ai/python/install_feature.py`) handles feature installation:
+
+1. Reads the feature manifest from the image
+2. Detects architecture (amd64/arm64) and GPU availability
+3. Creates or reuses the venv at `/data/ai/venv/`
+4. Runs pip install with the correct packages, flags, and index URLs per platform
+5. Handles the numpy version conflict (--no-deps for codeformer, re-pin numpy)
+6. Downloads model weights with retry logic (ported from `download_models.py`)
+7. Updates `/data/ai/installed.json` with bundle status
+8. Reports progress to stdout as JSON lines (consumed by the Node bridge)
+
+The script must be idempotent — running it twice for the same bundle is a no-op.
+
+### Uninstall and Shared Package Strategy
+
+Bundles share Python packages (e.g., onnxruntime in Background Removal, Object Eraser, and Photo Restoration). Naively pip-uninstalling a bundle's packages could break other installed bundles.
+
+**v1 approach (simple):** Uninstall removes model files and updates `installed.json`. Orphaned pip packages stay in the venv — they use disk but don't cause issues. A "Clean up" button in the AI Features settings panel rebuilds the venv from scratch: creates a fresh venv, installs only packages needed by currently-installed bundles, removes the old venv.
+
+**Future improvement:** Reference counting — track which bundles need which packages, only remove packages exclusively owned by the target bundle.
+
+### Tool Route Registration for Uninstalled Features
+
+Currently `registerToolRoutes()` either registers a route or doesn't (disabled tools get 404). For uninstalled AI features, we need routes that return a structured error instead of 404.
+
+**Solution: Register ALL tool routes always, add a pre-processing guard.**
+
+In `tool-factory.ts`, after settings validation (around line 198) and before calling `config.process()`, check feature installation status. The response follows the existing `{ error, details }` shape used by `formatZodErrors` (`apps/api/src/lib/errors.ts`) and consumed by `parseApiError` (`apps/web/src/lib/api.ts`):
+
+```typescript
+if (isAiTool(config.toolId) && !isFeatureInstalled(config.toolId)) {
+  const bundle = getBundleForTool(config.toolId);
+  return reply.status(501).send({
+    error: "Feature not installed",
+    code: "FEATURE_NOT_INSTALLED",
+    feature: bundle.id,
+    featureName: bundle.name,
+    estimatedSize: bundle.estimatedSize,
+  });
+}
+```
+
+This also applies to:
+- `restore-photo.ts` (uses its own route handler, not the factory)
+- `batch.ts` — the `getToolConfig(toolId)` call at line 35 is the gating point. Add a feature-installed check alongside the existing 404 check.
+- `pipeline.ts` — the pre-validation loops (lines 143-172 for execute, lines 441-462 for batch) already validate all tool IDs before processing starts. Extend to also check feature installation.
+
+**Frontend error detection:** Extend `parseApiError` in `apps/web/src/lib/api.ts` to detect the `FEATURE_NOT_INSTALLED` code and return structured data (bundle id, name, size) instead of a plain error string. This enables `use-tool-processor.ts` and `use-pipeline-processor.ts` (both already use `parseApiError`) to trigger the install prompt rather than showing a generic error.
+
+The global Fastify error handler in `apps/api/src/index.ts` (lines 41-51) provides a safety net — any unhandled Python import errors will produce structured JSON rather than crashing.
+
+### API Endpoints
+
+New routes — read endpoint is public (no `/admin/` prefix), mutation endpoints are admin-only:
+
+```
+GET  /api/v1/features
+  Returns: list of all bundles with install status, sizes, enabled tools
+  Auth: any authenticated user (read-only, needed by frontend for badges/tool page state)
+  Response: {
+    bundles: [{
+      id: "background-removal",
+      name: "Background Removal",
+      description: "Remove image backgrounds with AI",
+      status: "not_installed" | "installing" | "installed" | "error",
+      installedVersion: "1.15.3" | null,
+      estimatedSize: "500-700 MB",
+      enablesTools: ["remove-background"],
+      progress: { percent: 45, stage: "Downloading models..." } | null,
+      error: "pip install failed: ..." | null,
+      dependencies: [] | ["upscale-enhance"]
+    }]
+  }
+
+POST /api/v1/admin/features/:bundleId/install
+  Starts background installation of a feature bundle.
+  Auth: admin only
+  Response: { jobId: "uuid" }
+  SSE progress at: GET /api/v1/jobs/:jobId/progress
+
+POST /api/v1/admin/features/:bundleId/uninstall
+  Removes a feature bundle (pip packages + models).
+  Auth: admin only
+  Response: { ok: true, freedSpace: "500 MB" }
+
+GET  /api/v1/admin/features/disk-usage
+  Returns total disk usage of /data/ai/.
+  Auth: admin only
+  Response: { totalBytes: 5368709120, byBundle: { "background-removal": 734003200, ... } }
+```
+
+### Background Job Mechanism
+
+Feature installation runs as a background child process (not inline with the HTTP request):
+
+1. `POST /admin/features/:bundleId/install` spawns the install script as a child process
+2. Progress is streamed via stderr JSON lines → captured by the Node process → pushed to SSE listeners
+3. The existing SSE infrastructure (`/api/v1/jobs/:jobId/progress`) is reused
+4. Job status is persisted to the `jobs` table for recovery on restart
+5. Only one install can run at a time (mutex). Concurrent install requests return 409 Conflict.
+
+### Python Sidecar Changes
+
+**dispatcher.py:**
+- On startup, read `/data/ai/installed.json` to know which features are available
+- Populate `available_modules` based on what's actually installed
+- When a script is requested for an uninstalled feature, return a structured error: `{"error": "feature_not_installed", "feature": "background-removal", "message": "Background Removal is not installed"}`
+- After a feature is installed, the dispatcher must be restarted (or sent a reload signal) to pick up new packages. The bridge handles this by killing and re-spawning the dispatcher.
+
+**Python scripts:**
+- Convert hard module-level imports in `colorize.py` and `restore.py` to lazy imports inside functions
+- All scripts should check for their feature's models and return a clear "not installed" error if missing
+- The `sys.path` must include `/data/ai/venv/lib/python3.X/site-packages/` (set by the dispatcher on startup based on installed.json)
+
+**Bridge (bridge.ts):**
+- Update `PYTHON_VENV_PATH` logic to prefer `/data/ai/venv/` when it exists
+- Add a `restartDispatcher()` function called after feature install completes
+- Handle the new `feature_not_installed` error type from the dispatcher
+
+### Model Path Resolution
+
+Currently models are at `/opt/models/`. With on-demand downloads, they'll be at `/data/ai/models/`. The resolution order:
+
+1. `/opt/models/<model>` (Docker-baked, for backwards compatibility if someone builds a full image)
+2. `/data/ai/models/<model>` (on-demand download location)
+3. `~/.cache/ashim/<model>` (local dev fallback)
+
+Environment variables (`U2NET_HOME`, etc.) are updated by the install script to point to `/data/ai/models/`.
+
+### Dockerfile Changes
+
+1. Remove all `pip install` commands for ML packages (lines 175-206)
+2. Remove `download_models.py` COPY and RUN (lines 219-231)
+3. Keep: Python 3 + pip + build-essential (do NOT purge build-essential)
+4. Keep: numpy, Pillow, opencv-python-headless install (lightweight shared deps)
+5. Add: COPY `feature-manifest.json` into the image
+6. Add: COPY `install_feature.py` into the image
+7. Update entrypoint to set up `/data/ai/` directory structure on first run
+8. Update env vars: `MODELS_PATH=/data/ai/models` as default, fallback to `/opt/models`
+
+### Frontend: Tool Page (Uninstalled State)
+
+When a user navigates to an AI tool that isn't installed:
+
+**For admins:**
+- Show a card replacing the normal upload area:
+  - Feature icon + name (e.g., "Background Removal")
+  - "This feature requires an additional download (~500-700 MB)"
+  - [Enable Feature] button
+  - After clicking: progress bar with stage text, estimated time
+  - On completion: page automatically transitions to the normal tool UI
+
+**For non-admins:**
+- Show: "This feature is not enabled. Ask your administrator to enable it in Settings."
+
+### Frontend: Tool Grid (Badge)
+
+AI tools in the grid show a small download icon overlay when not installed. When installed, the icon disappears and the tool looks like any other tool.
+
+Tools with partial dependencies (e.g., passport-photo needs 2 bundles) show the badge until ALL required bundles are installed.
+
+### Frontend: Settings Panel
+
+New "AI Features" section in the settings dialog (admin only):
+
+- List of all 6 feature bundles as cards
+- Each card shows: name, description, status (installed/not installed/installing), disk usage
+- Install/Uninstall buttons per bundle
+- "Install All" button at the top
+- Total AI disk usage summary at the bottom
+- Progress bar during installation
+- Dependency warnings (e.g., "Advanced Noise Removal requires Upscale & Face Enhance")
+
+### Container Update Flow
+
+When a user does `docker pull` + restart:
+
+1. **Pull:** Only app code layers changed → ~50-100 MB download
+2. **Startup:** Backend reads feature manifest from new image + installed.json from volume
+3. **Comparison:**
+   - If bundle package versions unchanged → no action, instant startup
+   - If a package version bumped → `pip install --upgrade` from wheel cache (seconds)
+   - If a model URL/version changed → re-download that model only
+   - If Python major version changed → rebuild venv from cached wheels (rare, ~2-5 min)
+4. **Dispatcher restart** if any packages changed
+
+This check runs at startup, not blocking the HTTP server. AI features show "Updating..." status until the check completes.
+
+### Robustness and Crash Recovery
+
+The install system must handle every interruption gracefully: double-clicks, browser closes, container restarts mid-install, network failures, disk-full, corrupt downloads, and power loss.
+
+#### Atomic Operations
+
+**Model downloads** — never write directly to the final path:
+1. Download to `<model_path>.downloading`
+2. Verify file size against `minSize` from manifest
+3. `os.rename()` to final path (atomic on same filesystem)
+4. If the process dies mid-download, the `.downloading` file is an obvious orphan
+
+**installed.json writes** — never write in-place:
+1. Write to `installed.json.tmp`
+2. `os.rename()` to `installed.json`
+3. If the process dies mid-write, `installed.json` is intact (either old version or doesn't exist)
+
+**Venv bootstrap** (entrypoint) — same pattern:
+1. Copy `/opt/venv` to `/data/ai/venv.bootstrapping/`
+2. Rename to `/data/ai/venv/` on completion
+3. If interrupted, the `.bootstrapping/` directory is cleaned up on next start
+
+#### File-Based Install Lock
+
+In-memory `installInProgress` state is lost on container restart. Use a persistent lock file instead:
+
+**`/data/ai/install.lock`** contains:
+```json
+{ "bundleId": "background-removal", "startedAt": "2026-04-17T12:00:00Z", "pid": 12345 }
+```
+
+- Created before install starts, deleted on success or acknowledged failure
+- On server startup, if lock exists: check if PID is alive. If dead → the install was interrupted mid-flight
+- If lock is stale (PID dead), mark the bundle as needing cleanup, delete lock
+
+#### Startup Recovery Sequence
+
+On server startup (in `apps/api/src/index.ts`, after `runMigrations()`), run a recovery check:
+
+1. **Clean orphan temp files:** Delete any `*.downloading` files in `/data/ai/models/` (recursive)
+2. **Clean orphan JSON:** If `installed.json.tmp` exists, delete it
+3. **Clean orphan venv bootstrap:** If `/data/ai/venv.bootstrapping/` exists, delete it
+4. **Check install lock:** If `/data/ai/install.lock` exists:
+   - Read PID from lock file
+   - If PID is not running → the install was interrupted
+   - Delete the lock file
+   - Log: "Previous installation of {bundleId} was interrupted, cleaned up"
+5. **Verify installed bundles:** For each bundle in `installed.json`, check that all model files exist and meet minimum sizes from the manifest. If any model is missing or undersized:
+   - Mark the bundle status as `"error"` with message "Some model files are missing or corrupt. Please reinstall."
+   - Do NOT automatically remove from installed.json — let the admin decide to reinstall or uninstall
+
+#### Frontend Button Hardening
+
+**Double-click prevention:**
+- Disable the button on first click (set `installing = true` immediately, before the API call)
+- The button should be `disabled={installing || bundle.status === "installing"}`
+- Even if the component re-renders, the disabled state persists from the store
+
+**Browser close / navigate away:**
+- The install runs as a server-side child process — it completes regardless of browser state
+- When the user returns to the page, `useFeaturesStore.fetch()` picks up current status
+- If an install is in progress, the UI shows the progress bar (driven by polling, not just SSE)
+
+**SSE connection loss fallback:**
+- The `FeatureInstallPrompt` component uses `EventSource` for real-time progress
+- If the `EventSource` connection drops (`onerror`), fall back to polling `GET /api/v1/features` every 3 seconds
+- When the install completes (status changes from "installing" to "installed" or "error"), stop polling
+
+**Page refresh during install:**
+- On mount, the features store calls `fetch()` which returns current bundle states including install progress
+- If a bundle has `status: "installing"`, the component immediately shows the progress bar and opens an EventSource for the in-progress job
+
+**Multiple admins:**
+- Server mutex: only one install at a time (409 Conflict)
+- The features store status reflects the global state — ALL admin sessions see "installing"
+- The install lock file prevents even a container restart from allowing a concurrent install
+
+#### Error Handling
+
+| Scenario | Behavior |
+|---|---|
+| Double-click on Enable | Button disabled on first click. Second click is no-op. |
+| Browser closed mid-install | Server-side install continues. Status visible on next page load. |
+| Container restart mid-install | Startup recovery detects stale lock, cleans up `.downloading` files, marks as error. Admin can retry. |
+| Network failure mid-pip-install | pip returns non-zero. Install script emits error. Bundle marked as "error" with pip output. Admin can retry (pip cache means previously-downloaded wheels aren't re-downloaded). |
+| Network failure mid-model-download | `.downloading` file left behind. Retry 3 times with exponential backoff. On final failure, bundle marked as "error". On retry, `.downloading` file is deleted and re-downloaded. |
+| Disk full | Check available disk space at the START of install (before any pip/download). Return clear error: "Not enough disk space. Need ~{estimatedSize}, only {available} available." If disk fills mid-install, pip/download fails, bundle marked as error. |
+| pip succeeds, models fail | Bundle is NOT marked as installed. Status is "error" with message about which models failed. Packages remain in venv (harmless). Admin can retry — pip install is idempotent (skip already-installed), only failed models are re-downloaded. |
+| Model file corrupt (downloads completely but data is bad) | Verify file size against `minSize` after download. If too small, delete and retry. For rembg/HuggingFace models, the library's own integrity checks apply. |
+| installed.json corrupted | Atomic writes prevent this. If somehow corrupted (manual edit, etc.), `JSON.parse` fails, treat as empty (no bundles installed). Log a warning. |
+| Power loss | Atomic operations ensure no file is in a half-written state. Startup recovery cleans up orphans. |
+| No internet during install | pip fails immediately with a clear network error. Model downloads fail after retries. Bundle marked as "error". |
+
+### Testing Strategy
+
+All testing runs against Docker containers using the existing `playwright.docker.config.ts` and `tests/e2e-docker/` infrastructure:
+
+- **Unit tests:** Feature manifest parsing, version comparison logic, bundle dependency resolution (Vitest, excluded from e2e via `vitest.config.ts`)
+- **Integration tests:** Install/uninstall API endpoints, status reporting, SSE progress (Vitest integration suite)
+- **E2e-docker tests:** Add to `tests/e2e-docker/` alongside existing `fixes-verification.spec.ts`:
+  - Verify uninstalled AI tool returns 501 with `FEATURE_NOT_INSTALLED` code
+  - Admin enables a feature from settings, tool page transitions from "not installed" to working
+  - Non-admin sees "not enabled" message on uninstalled tool page
+  - Feature install/uninstall round-trip
+- **Docker build test:** Verify base image builds without ML packages, verify feature-manifest.json is present (CI, `SKIP_MODEL_DOWNLOADS` already exists)
+
+### Migration Path
+
+Since the new image is fundamentally different (no ML packages baked in), existing users upgrading from the full image will need to re-download their AI features. The Python ML packages are no longer in the system venv, so even if old model weights exist at `/opt/models/`, the features won't work without packages.
+
+The first-run experience for upgrading users:
+
+1. Detect this is an upgrade: no `/data/ai/installed.json` exists, but user data exists in `/data`
+2. Show a one-time banner in the UI: "We've reduced the image size from 30 GB to 5 GB! AI features are now downloaded on-demand. Visit Settings → AI Features to enable the ones you need."
+3. No automatic downloads — let the admin choose what to install
+4. Old model weights at `/opt/models/` are ignored (they won't exist in the new image anyway since that layer is removed)
+
+### Frontend: Feature Status Propagation
+
+The frontend needs to know which tools are installed for three purposes: tool grid badges, tool page state, and settings panel.
+
+**Features store** (`apps/web/src/stores/features-store.ts`):
+- Zustand store fetched on app load (like `settings-store.ts`)
+- Calls `GET /api/v1/features` to get bundle statuses
+- Provides a derived mapping: `toolInstallStatus: Record<string, "installed" | "not_installed" | "installing">` (each tool maps to exactly one bundle, no partial states)
+- Provides `isToolInstalled(toolId): boolean` and `getBundleForTool(toolId): BundleInfo | null` helpers
+- Refreshes on install/uninstall completion
+
+**Tool grid integration:**
+- `ToolCard` checks `isToolInstalled(tool.id)` from the features store
+- If not installed: show a download icon badge (similar to existing "Experimental" badge)
+- The tool remains clickable (not disabled) — clicking navigates to the tool page where the install prompt appears
+- `PYTHON_SIDECAR_TOOLS` constant is used to determine which tools are AI tools (only AI tools can be "not installed")
+
+**Tool page integration:**
+- `ToolPage` component checks feature status after the tool lookup
+- If the user is admin and feature not installed: render `FeatureInstallPrompt` component instead of the normal tool UI
+- If the user is non-admin and feature not installed: render "This feature is not enabled. Contact your administrator."
+- The install prompt shows feature name, description, estimated size, and an "Enable" button
+- After clicking "Enable": show progress bar with SSE-streamed progress, auto-transition to normal tool UI on completion
+
+### Development and Testing
+
+All development and testing is done via Docker containers — the same environment users run. Build the image locally and run it with:
+
+```bash
+docker run -d --name ashim -p 1349:1349 -v ashim-data:/data ghcr.io/ashim-hq/ashim:latest
+```
+
+Auth can be disabled for development by passing `-e AUTH_ENABLED=false`.
+
+### Scope Boundaries
+
+**In scope:**
+- Dockerfile restructuring to remove ML packages and models
+- Feature manifest system
+- Install/uninstall API + background job
+- Python sidecar changes for dynamic feature detection
+- Frontend: tool page download prompt, grid badge, settings panel
+- Container update handling with version manifest
+
+**Out of scope (future work):**
+- Additional rembg model variants as sub-downloads within Background Removal
+- Automatic feature recommendations based on usage
+- Download from private/custom model registries
+- Bandwidth throttling for downloads
+- Multiple venv support (e.g., different Python versions)