* save progress * undo review-code skill change * delete status file * small tweaks * Fix 429 info * update workind on skill info * updates * Update architecture/overview.md Co-authored-by: Johnny Greco <jogreco@nvidia.com> * fix: correct symbol names and CLI commands in architecture docs Address review comments: - models.md: describe clients as native httpx adapters, not SDK wrappers - agent-introspection.md: use actual family keys (columns, samplers, etc.) not column-types - cli.md: use correct command `data-designer config models` - plugins.md: SEED_READER not SEED_SOURCE, inject_into_processor_config_type_union Made-with: Cursor --------- Co-authored-by: Johnny Greco <jogreco@nvidia.com>
4.9 KiB
Engine Layer
The engine layer (data_designer.engine) compiles declarative configs into executable generation plans and runs them. It owns column generators, dataset builders, model access, MCP integration, sampling, validation, and profiling.
Source: packages/data-designer-engine/src/data_designer/engine/
Overview
The engine is the largest package, organized into focused subsystems:
| Subsystem | Path | Role |
|---|---|---|
| Column generators | column_generators/ |
Registry + concrete generators for each column type |
| Dataset builders | dataset_builders/ |
Sync/async orchestration, DAG, batching |
| Models | models/ |
Facade, registry, clients, parsers, recipes, usage |
| MCP | mcp/ |
Tool registry, facade, I/O service |
| Sampling | sampling_gen/ |
Schema, DAG, data sources, person/entity helpers |
| Processing | processing/ |
Processors, Ginja (Jinja for generation), Gsonschema |
| Validators | validators/ |
Runtime row/batch validation |
| Analysis | analysis/ |
Dataset/column profiling |
| Registry | registry/ |
Generic TaskRegistry base + DataDesignerRegistry aggregator |
| Resources | resources/ |
Seed/person readers, managed datasets |
| Storage | storage/ |
Artifact and media storage |
Top-level modules handle cross-cutting concerns: compiler.py (config compilation), validation.py (static config validation), context.py (execution context), configurable_task.py (base for all tasks), secret_resolver.py, model_provider.py.
Key Components
Compilation Pipeline
compiler.py transforms a DataDesignerConfig into an execution-ready form:
- Enriches the config with seed columns and an internal UUID column
- Runs static validation (
validation.py) — checks Jinja references, code columns, processor targets, constraint consistency - Produces
Violationobjects with typedViolationTypefor structured error reporting
Registry System
TaskRegistry (in registry/base.py) is the generic base: maps an enum value to a task class + config class. Uses __new__-based singleton per subclass to prevent duplicate instances.
DataDesignerRegistry bundles the three registries used by DatasetBuilder:
ColumnGeneratorRegistry— column type → generator classColumnProfilerRegistry— column type → profiler class- Processor registry
create_default_column_generator_registry() registers all built-in types and merges plugin entry points.
Column Generator Hierarchy
ConfigurableTask
└── ColumnGenerator (abstract: get_generation_strategy, generate/agenerate)
├── FromScratchColumnGenerator (can_generate_from_scratch)
├── ColumnGeneratorWithModelRegistry
│ └── ColumnGeneratorWithModel (cached model, inference params, MCP)
├── ColumnGeneratorCellByCell (strategy: CELL_BY_CELL, generate(dict))
└── ColumnGeneratorFullColumn (strategy: FULL_COLUMN, generate(DataFrame))
Each concrete generator (e.g., SamplerColumnGenerator, LLMTextColumnGenerator) combines the appropriate base classes. The GenerationStrategy enum (CELL_BY_CELL or FULL_COLUMN) determines how the dataset builder dispatches work.
ResourceProvider
Bundles everything a generator needs at runtime: ModelRegistry, MCPRegistry, ArtifactStorage, seed readers, person readers, secret resolver. Passed to generators during initialization.
Data Flow
DatasetBuilderreceives aDataDesignerConfigand aDataDesignerRegistry- Compilation produces a topologically sorted list of column configs
- Generators are instantiated from the registry for each column config
- The builder executes generators in dependency order (see Dataset Builders)
- Post-generation processors and profilers run on the completed dataset
Design Decisions
- Registry + strategy pattern decouples column type definitions (config) from generation behavior (engine). Adding a new column type means registering a config class and a generator class — no changes to orchestration code.
ConfigurableTaskas the universal base ensures all tasks (generators, profilers, processors) share config validation and resource access patterns.- Static validation before execution catches config errors (missing references, invalid templates) before any LLM calls are made, failing fast and cheaply.
- Sync/async bridge on
ColumnGeneratorallows generators to be written as async and called from sync contexts via_run_coroutine_sync/asyncio.to_thread.
Cross-References
- System Architecture — package relationships
- Config Layer — column configs and builder API
- Dataset Builders — sync/async execution, DAG
- Models — model facade and client adapters
- MCP — tool execution integration
- Sampling — sampler generators
- Plugins — how plugins register generators