19 KiB
AGENTS.md
This file provides guidance to agents when working with code in this repository.
Project Overview
DataDesigner is an NVIDIA NeMo project for creating synthetic datasets from scratch. It's a comprehensive framework that generates structured data using multiple generation strategies:
- Sampled data: Built-in generators (UUID, DateTime, etc.) and Faker integration
- LLM-generated content: Text, code, and structured data via LiteLLM
- Expression-based columns: Derived columns using Jinja2 templates
- Validation & scoring: Python, SQL, and remote validators; LLM-based judge scoring
- Seed dataset-based generation: Generate from existing datasets
Architecture
The project follows a layered architecture:
-
Config Layer (packages/data-designer-config/src/data_designer/config/): User-facing configuration API
config_builder.py: Main builder API for constructing configurationscolumn_configs.py: Column configuration types (Sampler, LLMText, LLMCode, LLMStructured, LLMJudge, Expression, Validation, SeedDataset)models.py: Model configurations and inference parameterssampler_params.py: Parametrized samplers (Uniform, Category, Person, DateTime, etc.)
-
Engine Layer (packages/data-designer-engine/src/data_designer/engine/): Internal generation and processing
column_generators/: Generates individual columns from configsdataset_builders/: Orchestrates full dataset generation with DAG-based dependency managementmodels/: LLM integration via LiteLLM with response parsingvalidators/: Column validation (Python, SQL, Code, Remote)sampling_gen/: Sophisticated person/entity sampling
-
Interface Layer (packages/data-designer/src/data_designer/interface/): Public API
data_designer.py: MainDataDesignerclass (primary entry point)results.py: Result containerserrors.py: Public error types
Recommended Import Pattern
import data_designer.config as dd
from data_designer.interface import DataDesigner
# Usage:
data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder()
config_builder.add_column(
dd.SamplerColumnConfig(
name="category",
sampler_type=dd.SamplerType.CATEGORY,
params=dd.CategorySamplerParams(values=["A", "B"]),
)
)
Key Design Patterns
- Builder pattern: Configuration construction via
DataDesignerConfigBuilder - Registry pattern: Plugin system for column generators, validators, and profilers
- Strategy pattern: Multiple generation approaches (sampled, LLM, expression, seed)
- DAG-based execution: Column dependencies managed as directed acyclic graph
Development Workflow
This project uses uv for dependency management and make for common tasks:
# Install dependencies
uv sync
# Install with dev dependencies
uv sync --all-extras
# Run the main module (if applicable)
uv run python -m data_designer
Code Quality
# Using Make (recommended)
make lint # Run ruff linter
make lint-fix # Fix linting issues automatically
make format # Format code with ruff
make format-check # Check code formatting without changes
make check-all # Run all checks (format-check + lint)
make check-all-fix # Run all checks with autofix (format + lint-fix)
# Direct commands
uv run ruff check # Lint all files
uv run ruff check --fix # Lint with autofix
uv run ruff format # Format all files
uv run ruff format --check # Check formatting
Running Tests
# Run all tests
uv run pytest
# Run tests with verbose output
uv run pytest -v
# Run a specific test file
uv run pytest tests/config/test_sampler_constraints.py
# Run tests with coverage
uv run pytest --cov=data_designer --cov-report=term-missing --cov-report=html
# Using Make
make test # Run all tests
make coverage # Run tests with coverage report
Key Files
- packages/data-designer/src/data_designer/interface/data_designer.py - Main entry point (
DataDesignerclass) - packages/data-designer-config/src/data_designer/config/config_builder.py - Configuration API (
DataDesignerConfigBuilder) - packages/data-designer-config/src/data_designer/config/init.py - User-facing config API exports
- packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py - Generation orchestrator
- pyproject.toml - Project dependencies and tool configurations
- Makefile - Common development commands
Working Guidelines
- Comments: Only insert comments when code is especially important to understand. For basic code blocks, comments aren't necessary. We want readable code without vacuous comments.
- License headers: All Python files must include the NVIDIA SPDX license header:
Use# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0make update-license-headersto add headers to all files automatically. - Imports: Avoid importing Python modules inside method definitions. Prefer module-level imports for better performance and clarity.
- Type annotations: ALWAYS add type annotations to all functions, methods, and class attributes (including tests).
Code Style
This project uses ruff (v0.12.3) for linting and formatting. Follow these guidelines to avoid linter errors:
General Formatting
- Line length: Maximum 120 characters per line
- Quote style: Always use double quotes (
") for strings - Indentation: Use 4 spaces (never tabs)
- Target version: Python 3.11+
Type Annotations
Type annotations are REQUIRED for all code in this project. This is strictly enforced for code quality and maintainability.
-
ALWAYS add type annotations to all functions, methods, and class attributes (including tests)
-
Use primitive types when possible:
listnotList,dictnotDict,setnotSet,tuplenotTuple -
Use modern union syntax with
|for optional and union types (Python 3.10+):str | NonenotOptional[str]int | strnotUnion[int, str]
-
Only import from
typingwhen absolutely necessary for complex generic types -
For Pydantic models, use field-level type annotations
# Good def process_items(items: list[str], max_count: int | None = None) -> dict[str, int]: return {item: len(item) for item in items} # Avoid - missing type annotations def process_items(items, max_count=None): return {item: len(item) for item in items} # Avoid - old-style typing from typing import List, Dict, Optional def process_items(items: List[str], max_count: Optional[int] = None) -> Dict[str, int]: return {item: len(item) for item in items}
Import Style
-
ALWAYS use absolute imports, never relative imports
-
Place imports at module level, not inside functions (exception: it is unavoidable for performance reasons)
-
Import sorting is handled by
ruff'sisort- imports should be grouped and sorted:- Standard library imports
- Third-party imports (use
lazy_heavy_importsfor heavy libraries) - First-party imports (
data_designer)
-
Use standard import conventions (enforced by
ICN) -
See Lazy Loading and TYPE_CHECKING section for optimization guidelines
# Good from data_designer.config.config_builder import DataDesignerConfigBuilder # Bad - relative import (will cause linter errors) from .config_builder import DataDesignerConfigBuilder # Good - imports at module level from pathlib import Path def process_file(filename: str) -> None: path = Path(filename) # Bad - import inside function def process_file(filename: str) -> None: from pathlib import Path path = Path(filename)
Lazy Loading and TYPE_CHECKING
This project uses lazy loading for heavy third-party dependencies to optimize import performance.
When to Use Lazy Loading
Heavy third-party libraries (>100ms import cost) should be lazy-loaded via lazy_heavy_imports.py:
# ❌ Don't import directly
import pandas as pd
import numpy as np
# ✅ Use lazy loading with IDE support
from typing import TYPE_CHECKING
from data_designer.lazy_heavy_imports import pd, np
if TYPE_CHECKING:
import pandas as pd # For IDE autocomplete and type hints
import numpy as np
This pattern provides:
- Runtime lazy loading (fast startup)
- Full IDE support (autocomplete, type hints)
- Type checker validation
See lazy_heavy_imports.py for the current list of lazy-loaded libraries.
Adding New Heavy Dependencies
If you add a new dependency with significant import cost (>100ms):
-
Add to
lazy_heavy_imports.py:_LAZY_IMPORTS = { # ... existing entries ... "your_lib": "your_library_name", } -
Update imports across codebase:
from typing import TYPE_CHECKING from data_designer.lazy_heavy_imports import your_lib if TYPE_CHECKING: import your_library_name as your_lib # For IDE support -
Verify with performance test:
make perf-import CLEAN=1
Using TYPE_CHECKING Blocks
TYPE_CHECKING blocks defer imports that are only needed for type hints, preventing circular dependencies and reducing import time.
For internal data_designer imports:
from __future__ import annotations # Always include at top
from typing import TYPE_CHECKING
# Runtime imports
from pathlib import Path
from data_designer.config.base import ConfigBase
if TYPE_CHECKING:
# Type-only imports - only visible to type checkers
from data_designer.engine.models.facade import ModelFacade
def get_model(model: ModelFacade) -> str:
return model.name
For lazy-loaded libraries (see pattern in "When to Use Lazy Loading" above):
- Import from
lazy_heavy_importsfor runtime - Add full import in
TYPE_CHECKINGblock for IDE support
Rules for TYPE_CHECKING:
✅ DO put in TYPE_CHECKING:
- Internal
data_designerimports used only in type hints - Imports that would cause circular dependencies
- Full imports of lazy-loaded libraries for IDE support (e.g.,
import pandas as pdin addition to runtimefrom data_designer.lazy_heavy_imports import pd)
❌ DON'T put in TYPE_CHECKING:
- Standard library imports (
Path,Any,Callable,Literal,TypeAlias, etc.) - Pydantic model types used in field definitions (needed at runtime for validation)
- Types used in discriminated unions (Pydantic needs them at runtime)
- Any import used at runtime (instantiation, method calls, base classes, etc.)
Examples:
# ✅ CORRECT - Lazy-loaded library with IDE support
from typing import TYPE_CHECKING
from data_designer.lazy_heavy_imports import pd
if TYPE_CHECKING:
import pandas as pd # IDE gets full type hints
def load_data(path: str) -> pd.DataFrame: # IDE understands pd.DataFrame
return pd.read_csv(path)
# ✅ CORRECT - Standard library NOT in TYPE_CHECKING
from pathlib import Path
from typing import Any
def process_file(path: Path) -> Any:
return path.read_text()
# ✅ CORRECT - Internal type-only import
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from data_designer.engine.models.facade import ModelFacade
def get_model(model: ModelFacade) -> str: # Only used in type hint
return model.name
# ❌ INCORRECT - Pydantic field type in TYPE_CHECKING
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from data_designer.config.models import ModelConfig # Wrong!
class MyConfig(BaseModel):
model: ModelConfig # Pydantic needs this at runtime!
# ✅ CORRECT - Pydantic field type at runtime
from data_designer.config.models import ModelConfig
class MyConfig(BaseModel):
model: ModelConfig
Naming Conventions (PEP 8)
Follow PEP 8 naming conventions:
-
Functions and variables:
snake_case -
Classes:
PascalCase -
Constants:
UPPER_SNAKE_CASE -
Private attributes: prefix with single underscore
_private_var# Good class DatasetGenerator: MAX_RETRIES = 3 def __init__(self) -> None: self._cache: dict[str, str] = {} def generate_dataset(self, config: dict[str, str]) -> pd.DataFrame: pass # Bad class dataset_generator: # Should be PascalCase maxRetries = 3 # Should be UPPER_SNAKE_CASE def GenerateDataset(self, Config): # Should be snake_case pass
Common Pitfalls to Avoid
-
Mutable default arguments:
# Bad - mutable default argument def add_item(item: str, items: list[str] = []) -> list[str]: items.append(item) return items # Good def add_item(item: str, items: list[str] | None = None) -> list[str]: if items is None: items = [] items.append(item) return items -
Unused imports and variables:
# Bad - unused import from pathlib import Path from typing import Any # Not used def process() -> None: pass # Good - only import what you use from pathlib import Path def process() -> None: pass -
Simplify code where possible (enforced by
SIM):# Bad if condition: return True else: return False # Good return condition # Bad if key in my_dict: value = my_dict[key] else: value = default # Good value = my_dict.get(key, default) -
Use comprehensions properly:
# Bad list([x for x in items]) # Unnecessary list() call # Good [x for x in items] # Bad dict([(k, v) for k, v in items]) # Good {k: v for k, v in items} -
Proper return statements:
# Bad - unnecessary else after return def get_value(condition: bool) -> str: if condition: return "yes" else: return "no" # Good def get_value(condition: bool) -> str: if condition: return "yes" return "no"
Active Linter Rules
The following ruff linter rules are currently enabled (see pyproject.toml):
W: pycodestyle warningsF: pyflakes (unused imports, undefined names)I: isort (import sorting)ICN: flake8-import-conventions (standard import names)PIE: flake8-pie (miscellaneous lints)
Note: Additional rules (E, N, UP, ANN, B, C4, DTZ, RET, SIM, PTH) are commented out but may be enabled in the future. Write code that would pass these checks for future-proofing.
Testing Patterns
The project uses pytest with the following patterns:
- Fixtures: Shared test data and configurations in tests/conftest.py
- Stub configs: YAML-based configuration stubs for testing (see
stub_data_designer_config_strfixture) - Mocking: Use
unittest.mock.patchfor external services and dependencies - Async support: pytest-asyncio for async tests (
asyncio_default_fixture_loop_scope = "session") - HTTP mocking: pytest-httpx for mocking HTTP requests
- Coverage: Track test coverage with pytest-cov
Example test structure:
import pytest
from data_designer.config.config_builder import DataDesignerConfigBuilder
def test_something(stub_model_configs):
"""Test description."""
builder = DataDesignerConfigBuilder(model_configs=stub_model_configs)
# ... test implementation
assert expected == actual
Column Configuration Types
When working with column configurations, understand these key types:
SamplerColumnConfig: Built-in samplers (UUID, Category, Uniform, Gaussian, Person, DateTime, etc.)LLMTextColumnConfig: LLM text generation with Jinja2 templatingLLMCodeColumnConfig: Code generation with language specificationLLMStructuredColumnConfig: Structured JSON generation with schemaLLMJudgeColumnConfig: Judge/scoring columns for quality assessmentExpressionColumnConfig: Expression-based derived columns (Python eval or Jinja2)ValidationColumnConfig: Validation results (Python, SQL, Code, Remote validators)SeedDatasetColumnConfig: Data from seed datasetsEmbeddingColumnConfig: Embedding generation for text columns using a specified modelCustomColumnConfig: Custom user-defined column generators via@custom_column_generatordecorator
See packages/data-designer-config/src/data_designer/config/column_configs.py for detailed schemas.
Model Configuration
Models are configured via ModelConfig with:
alias: User-defined alias for the modelmodel: Model ID (e.g., from build.nvidia.com)inference_parameters: Temperature, top_p, max_tokens (can be distribution-based)system_prompt: Optional system promptimage_modality: Support for image inputs
See packages/data-designer-config/src/data_designer/config/models.py for details.
Registry System
The project uses a registry pattern for extensibility. Key registries:
- Column generators: packages/data-designer-engine/src/data_designer/engine/column_generators/registry.py
- Validators: packages/data-designer-engine/src/data_designer/engine/validators/
- Column profilers: packages/data-designer-engine/src/data_designer/engine/analysis/column_profilers/registry.py
- Models: packages/data-designer-engine/src/data_designer/engine/models/registry.py
When adding new generators or validators, register them appropriately.
Pre-commit Hooks
The project uses pre-commit hooks to enforce code quality. Install them with:
uv run pre-commit install
Hooks include:
- Trailing whitespace removal
- End-of-file fixer
- YAML/JSON/TOML validation
- Merge conflict detection
- Debug statement detection
- Ruff linting and formatting
Common Development Tasks
# Clean up generated files
make clean
# Update license headers
make update-license-headers
# Run all checks before committing
make check-all-fix
make test
# Generate coverage report
make coverage
# View htmlcov/index.html in browser
Additional Resources
- README.md: Installation and basic usage examples
- packages/data-designer-config/src/data_designer/config/: Configuration API documentation
- tests/: Comprehensive test suite with usage examples