* docs: restructure agent and contributor documentation Restructure AGENTS.md from ~627 lines to ~55 lines of high-signal architectural invariants. Extract code style into STYLEGUIDE.md and development workflow into DEVELOPMENT.md. Overhaul CONTRIBUTING.md to reflect agent-assisted development as the primary workflow. Move skills and sub-agents from .claude/ to .agents/ as the tool-agnostic home, with symlinks back for Claude Code compatibility. Add architecture/ skeleton with 10 stub files for incremental population. Implements PR 1 of #427. Made-with: Cursor * remove obsolete new-sdg skill The new-sdg skill is superseded by skills/data-designer/, which is the proper usage skill for building datasets. Update .agents/README.md to reference the usage skill's actual location. Made-with: Cursor * docs: expand style guide and refine development docs Add docstring conventions (Google style), Pydantic/dataclass guidance, error handling patterns, and f-string preference to STYLEGUIDE.md. Clarify per-package test targets, flat test style, e2e API key requirement, notebook regeneration commands, and import perf threshold in DEVELOPMENT.md. Point dataset-building agents to the data-designer skill in AGENTS.md and clarify dependency direction arrows. Made-with: Cursor * docs: link AGENTS.md to architecture/ directory Made-with: Cursor * docs: refine CONTRIBUTING.md contribution workflow Add plan document step, self-review with multi-model passes, automated CI review expectations, and comment resolution protocol. Made-with: Cursor * docs: add architecture/ to PR 2 scope and link from AGENTS.md Move architecture doc population from deferred/incremental to PR 2 since the subsystems already exist. Update plan delivery strategy, execution order, and out-of-scope sections accordingly. Made-with: Cursor * docs: address PR review comments on style guide, dev guide, and contributing Replace pd.DataFrame with list[dict[str, str]] in naming example to avoid contradicting lazy-import guidance in the same file. Soften "enforced by SIM" to note SIM rules are not yet enabled in CI. Fix upstream sync instructions for fork-based contributors. Update copyright year in CONTRIBUTING.md from 2025 to 2026 to match STYLEGUIDE.md.
4 KiB
AGENTS.md
This file is for agents developing DataDesigner — the codebase you are working in.
If you are an agent helping a user build a dataset, use the data-designer skill and the product documentation instead.
DataDesigner is an NVIDIA NeMo framework for creating synthetic datasets from scratch. Users declare what their data should look like (columns, types, relationships, validation rules); the engine figures out how to generate it. Every change you make should preserve this "declare, don't orchestrate" contract.
The Layering Is Structural
The data_designer namespace is split across three installable packages that merge at runtime via PEP 420 implicit namespace packages (no top-level __init__.py).
| Package | Path | Owns |
|---|---|---|
data-designer-config |
packages/data-designer-config/ |
data_designer.config — column configs, model configs, sampler params, builder API, plugin system, lazy imports |
data-designer-engine |
packages/data-designer-engine/ |
data_designer.engine — column generators, dataset builders, DAG execution, model facade, validators, sampling |
data-designer |
packages/data-designer/ |
data_designer.interface — public DataDesigner class, results, errors; data_designer.cli — CLI entry point; data_designer.integrations |
Dependency direction (left depends on right): interface → engine → config. Never import against this flow.
Core Concepts
- Column — a named field in the output dataset, defined by a column config
- Sampler — a built-in statistical generator (UUID, Category, Uniform, Gaussian, Person, DateTime, etc.)
- Seed dataset — an existing dataset used as input for generation
- Processor — a post-generation transformation applied to column values
- Model — an LLM endpoint configured via
ModelConfigand accessed through the model facade - Plugin — a user-supplied extension registered via entry points (custom column generators, validators, profilers)
Core Design Principles
- Declarative config, imperative engine. Users build configs; the engine compiles them into an execution plan. Config objects are data; they never call the engine directly.
- Registries connect types to behavior. Column generators, validators, and profilers are discovered through registries. Adding a new type means registering it, not modifying orchestration code.
- Errors normalize at boundaries. Third-party exceptions are wrapped into canonical project error types at module boundaries. Callers depend on
data_designererrors, not leaked internals.
Structural Invariants
- Import direction — interface → engine → config (left depends on right). No reverse imports.
- Fast imports — heavy third-party libraries are lazy-loaded via
data_designer.lazy_heavy_imports. See STYLEGUIDE.md for the pattern. - No relative imports — absolute imports only, enforced by ruff rule
TID. - Typed code — all functions, methods, and class attributes require type annotations. Modern syntax:
list[str],str | None. from __future__ import annotations— required in every Python source file.- Follow established patterns — match the conventions of the module you're editing. When in doubt, read the neighboring code.
- No untested code paths — new logic requires tests. See DEVELOPMENT.md for testing guidance.
Development
make check-all-fix # format + lint (ruff)
make test # run all test suites
make update-license-headers # add SPDX headers to new files
make perf-import CLEAN=1 # profile import time (run after adding heavy deps)
For full setup, testing, and workflow details see DEVELOPMENT.md.
For code style, naming, and import conventions see STYLEGUIDE.md.
For deeper dives into specific subsystems see architecture/.