DataDesigner/AGENTS.md
Nabin Mulepati 5b9492e6cf
docs: restructure agent and contributor documentation (plan 427, PR 1) (#454)
* docs: restructure agent and contributor documentation

Restructure AGENTS.md from ~627 lines to ~55 lines of high-signal
architectural invariants. Extract code style into STYLEGUIDE.md and
development workflow into DEVELOPMENT.md. Overhaul CONTRIBUTING.md
to reflect agent-assisted development as the primary workflow.

Move skills and sub-agents from .claude/ to .agents/ as the
tool-agnostic home, with symlinks back for Claude Code compatibility.
Add architecture/ skeleton with 10 stub files for incremental
population.

Implements PR 1 of #427.

Made-with: Cursor

* remove obsolete new-sdg skill

The new-sdg skill is superseded by skills/data-designer/, which is the
proper usage skill for building datasets. Update .agents/README.md to
reference the usage skill's actual location.

Made-with: Cursor

* docs: expand style guide and refine development docs

Add docstring conventions (Google style), Pydantic/dataclass guidance,
error handling patterns, and f-string preference to STYLEGUIDE.md.

Clarify per-package test targets, flat test style, e2e API key
requirement, notebook regeneration commands, and import perf threshold
in DEVELOPMENT.md.

Point dataset-building agents to the data-designer skill in AGENTS.md
and clarify dependency direction arrows.

Made-with: Cursor

* docs: link AGENTS.md to architecture/ directory

Made-with: Cursor

* docs: refine CONTRIBUTING.md contribution workflow

Add plan document step, self-review with multi-model passes,
automated CI review expectations, and comment resolution protocol.

Made-with: Cursor

* docs: add architecture/ to PR 2 scope and link from AGENTS.md

Move architecture doc population from deferred/incremental to PR 2
since the subsystems already exist. Update plan delivery strategy,
execution order, and out-of-scope sections accordingly.

Made-with: Cursor

* docs: address PR review comments on style guide, dev guide, and contributing

Replace pd.DataFrame with list[dict[str, str]] in naming example to
avoid contradicting lazy-import guidance in the same file. Soften
"enforced by SIM" to note SIM rules are not yet enabled in CI. Fix
upstream sync instructions for fork-based contributors. Update
copyright year in CONTRIBUTING.md from 2025 to 2026 to match
STYLEGUIDE.md.
2026-03-25 12:38:42 -06:00

4 KiB

AGENTS.md

This file is for agents developing DataDesigner — the codebase you are working in. If you are an agent helping a user build a dataset, use the data-designer skill and the product documentation instead.

DataDesigner is an NVIDIA NeMo framework for creating synthetic datasets from scratch. Users declare what their data should look like (columns, types, relationships, validation rules); the engine figures out how to generate it. Every change you make should preserve this "declare, don't orchestrate" contract.

The Layering Is Structural

The data_designer namespace is split across three installable packages that merge at runtime via PEP 420 implicit namespace packages (no top-level __init__.py).

Package Path Owns
data-designer-config packages/data-designer-config/ data_designer.config — column configs, model configs, sampler params, builder API, plugin system, lazy imports
data-designer-engine packages/data-designer-engine/ data_designer.engine — column generators, dataset builders, DAG execution, model facade, validators, sampling
data-designer packages/data-designer/ data_designer.interface — public DataDesigner class, results, errors; data_designer.cli — CLI entry point; data_designer.integrations

Dependency direction (left depends on right): interface → engine → config. Never import against this flow.

Core Concepts

  • Column — a named field in the output dataset, defined by a column config
  • Sampler — a built-in statistical generator (UUID, Category, Uniform, Gaussian, Person, DateTime, etc.)
  • Seed dataset — an existing dataset used as input for generation
  • Processor — a post-generation transformation applied to column values
  • Model — an LLM endpoint configured via ModelConfig and accessed through the model facade
  • Plugin — a user-supplied extension registered via entry points (custom column generators, validators, profilers)

Core Design Principles

  1. Declarative config, imperative engine. Users build configs; the engine compiles them into an execution plan. Config objects are data; they never call the engine directly.
  2. Registries connect types to behavior. Column generators, validators, and profilers are discovered through registries. Adding a new type means registering it, not modifying orchestration code.
  3. Errors normalize at boundaries. Third-party exceptions are wrapped into canonical project error types at module boundaries. Callers depend on data_designer errors, not leaked internals.

Structural Invariants

  • Import direction — interface → engine → config (left depends on right). No reverse imports.
  • Fast imports — heavy third-party libraries are lazy-loaded via data_designer.lazy_heavy_imports. See STYLEGUIDE.md for the pattern.
  • No relative imports — absolute imports only, enforced by ruff rule TID.
  • Typed code — all functions, methods, and class attributes require type annotations. Modern syntax: list[str], str | None.
  • from __future__ import annotations — required in every Python source file.
  • Follow established patterns — match the conventions of the module you're editing. When in doubt, read the neighboring code.
  • No untested code paths — new logic requires tests. See DEVELOPMENT.md for testing guidance.

Development

make check-all-fix        # format + lint (ruff)
make test                 # run all test suites
make update-license-headers  # add SPDX headers to new files
make perf-import CLEAN=1  # profile import time (run after adding heavy deps)

For full setup, testing, and workflow details see DEVELOPMENT.md. For code style, naming, and import conventions see STYLEGUIDE.md. For deeper dives into specific subsystems see architecture/.