mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

docs: graduate plugins out of experimental mode (#603 )

* chore: add __init__.py to engine namespace subpackages

Griffe (used by mkdocstrings) skips directories without __init__.py
when resolving module paths, which prevented the new plugins code
reference from rendering SeedReader, FileSystemSeedReader, and
Processor. Adding empty __init__.py files in engine/resources/,
engine/processing/, and engine/processing/processors/ aligns with
the convention already used in engine/mcp/, engine/models/, etc.

* docs: flesh out docstrings on plugin extension-point classes

Plugin authors now see meaningful descriptions for every field and
method on the bases rendered in the plugins code reference:

- Plugin and PluginType: class docstrings + Attributes tables for
  fields and enum members; fix typo in config_qualified_name field
  description.
- SingleColumnConfig: document allow_resize.
- ProcessorConfig: document processor_type discriminator.
- SeedSource: document seed_type discriminator.
- FileSystemSeedSource: add class docstring + Attributes table for
  path / file_pattern / recursive.
- ColumnGeneratorFullColumn and ColumnGeneratorCellByCell: add
  class docstrings explaining when to use each base, plus method
  docstrings on the abstract generate() implementations.

* docs: graduate plugins out of experimental mode

Restructures plugin documentation around the now-stable extension
points (column generator, seed reader, processor) and treats plugins
as a first-class story for customizing Data Designer.

- Add code_reference/plugins.md: single-stop reference for the Plugin
  object and the config + implementation base classes used by all
  three plugin types.
- Add code_reference/generators.md: column generator implementation
  base classes, separated from column configs.
- Surface SingleColumnConfig in code_reference/column_configs.md.
- Add plugins/implement.md ("Build Your Own"): per-type implementation
  instructions across column generators, seed readers, and processors.
- Add plugins/processor.md: complete processor plugin package example.
- Rewrite plugins/overview.md: open with why plugins exist, drop the
  internal-helpers note (PluginRegistry / PluginManager), and focus
  the guide on what plugin builders need.
- Refresh plugins/available.md (Catalog) and
  plugins/filesystem_seed_reader.md to match the new structure.
- Delete plugins/example.md (replaced by per-type guides).
- Reorder Code Reference nav alphabetically and add the new pages.
- Minor link / wording fixes in concepts/processors.md and
  concepts/deployment-options.md.

* docs: simplify plugin docs structure

Replace the overview's how-to walkthrough and the per-type plugin
guides with a single Build Your Own page that covers all three
plugin types side-by-side. Add a dedicated Using Models in Plugins
guide and a seed_readers code reference, and trim the overview down
to what the plugin types are, how to use one, and how discovery
works.

- Rename plugins/implement.md to plugins/build_your_own.md.
- Delete plugins/filesystem_seed_reader.md and plugins/processor.md
  (their content is now in build_your_own.md and the per-type code
  references).
- Add plugins/models.md for model-backed column generator authoring.
- Add code_reference/seed_readers.md for seed reader implementation
  base classes.
- Rewrite plugins/overview.md: shorter intro, type bullets link to
  the relevant code reference, drop the multi-step "How do you
  create plugins" walkthrough in favor of a single Build a Plugin
  pointer, tighten Discovery troubleshooting.
- Refresh plugins/available.md (Available Plugins): point to the
  DataDesignerPlugins catalog and explain how to request a community
  listing.
- Update cross-page links in concepts/processors.md,
  concepts/seed-datasets.md, recipes/plugin_development/markdown_seed_reader.md,
  code_reference/plugins.md, and code_reference/generators.md to
  match the new structure.
- Update mkdocs.yml nav: rename to Build Your Own, add Using Models,
  add seed_readers code reference.

* docs: scroll wide tables horizontally instead of wrapping

Code-heavy reference tables (plugin bases, column generators, etc.)
were wrapping aggressively on narrow viewports, breaking long
identifiers across multiple lines. Switch the table container to
horizontal overflow and prevent code cells from wrapping so
identifiers stay readable.

* docs: address PR #603 review feedback

- Add an Implementation base section to code_reference/processors.md
  rendering the engine-side Processor class. This justifies the
  engine/processing/__init__.py files added earlier and gives
  processor plugin authors an auto-rendered API reference, matching
  the pattern used by code_reference/generators.md and seed_readers.md.
- build_your_own.md: replace the placeholder "x" emoji on the
  IndexMultiplier example with the actual multiplication sign.
- build_your_own.md: drop the manual `re.compile + apply(lambda)`
  pattern in the regex-filter processor in favor of the idiomatic
  `Series.str.contains(..., regex=True)`.
- build_your_own.md: add a kernel-restart caveat after the editable
  install instructions — PluginRegistry caches discovery on first
  import, so notebooks need a fresh kernel to pick up freshly
  installed plugins.
- build_your_own.md: state explicitly what `assert_valid_plugin`
  checks (config base + plugin-type-appropriate impl base).
- code_reference/plugins.md: link out to the processors code
  reference alongside generators and seed_readers.

* docs: split code reference by package

* docs: add interface code reference

* docs: add code reference overviews

* docs: refine code reference pages

* docs: improve code reference tables

* docs: correct reference docstrings

* docs: embed plugin catalog table

* docs: note plugin discovery restart caveat

* docs: explain generator base class choice

* docs: mention async cell generator examples

* docs: clarify plugin model usage

* docs: clarify plugin model aliases

* docs: address plugin review feedback

* docs: update available plugins page

2026-05-06 18:12:44 -04:00

10 KiB

Raw Blame History

Validators

Validators are quality assurance mechanisms in Data Designer that check generated content against rules and return structured pass/fail results. They enable automated verification of data for correctness, code quality, and adherence to specifications.

!!! note "Quality Gates for Generated Data" Validators act as quality gates in your generation pipeline. Use them to filter invalid records, score code quality, verify format compliance, or integrate with external validation services.

Overview

Validation columns execute validation logic against target columns and produce structured results indicating:

is_valid: Boolean pass/fail status
Additional metadata: Error messages, scores, severity levels, and custom fields

Validators currently support three execution strategies:

Code validation: Lint and check Python or SQL code using industry-standard tools
Local callable validation: Execute custom Python functions for flexible validation logic
Remote validation: Send data to HTTP endpoints for external validation services

Validator Types

🐍 Python Code Validator

The Python code validator runs generated Python code through Ruff, a fast Python linter that checks for syntax errors, undefined variables, and code quality issues.

Configuration:

import data_designer.config as dd

validator_params = dd.CodeValidatorParams(code_lang=dd.CodeLang.PYTHON)

Validation Output:

Each validated record returns:

is_valid: True if no fatal or error-level issues found
python_linter_score: Quality score from 0-10 (based on pylint formula)
python_linter_severity: Highest severity level found ("none", "convention", "refactor", "warning", "error", "fatal")
python_linter_messages: List of linter messages with line numbers, columns, and descriptions

Severity Levels:

Fatal: Syntax errors preventing code execution
Error: Undefined names, invalid syntax
Warning: Code smells and potential issues
Refactor: Simplification opportunities
Convention: Style guide violations

A record is marked valid if it has no messages or only messages at warning/convention/refactor levels.

Example Validation Result:

{
    "is_valid": False,
    "python_linter_score": 0,
    "python_linter_severity": "error",
    "python_linter_messages": [
        {
            "type": "error",
            "symbol": "F821",
            "line": 1,
            "column": 7,
            "message": "Undefined name `it`"
        }
    ]
}

🗄️ SQL Code Validator

The SQL code validator uses SQLFluff, a dialect-aware SQL linter that checks query syntax and structure.

Configuration:

import data_designer.config as dd

validator_params = dd.CodeValidatorParams(code_lang=dd.CodeLang.SQL_POSTGRES)

!!! tip "Multiple Dialects" The SQL code validator supports multiple dialects: SQL_POSTGRES, SQL_ANSI, SQL_MYSQL, SQL_SQLITE, SQL_TSQL and SQL_BIGQUERY.

Validation Output:

Each validated record returns:

is_valid: True if no parsing errors found
error_messages: Concatenated error descriptions (empty string if valid)

The validator focuses on parsing errors (PRS codes) that indicate malformed SQL. It also checks for common pitfalls like DECIMAL definitions without scale parameters.

Example Validation Result:

# Valid SQL
{
    "is_valid": True,
    "error_messages": ""
}

# Invalid SQL
{
    "is_valid": False,
    "error_messages": "PRS: Line 1, Position 1: Found unparsable section: 'NOT SQL'"
}

🔧 Local Callable Validator

The local callable validator executes custom Python functions for flexible validation logic.

Configuration:

import pandas as pd

import data_designer.config as dd

def my_validation_function(df: pd.DataFrame) -> pd.DataFrame:
    """Validate that values are positive.

    Args:
        df: DataFrame with target columns

    Returns:
        DataFrame with is_valid column and optional metadata
    """
    result = pd.DataFrame()
    result["is_valid"] = df["price"] > 0
    result["error_message"] = result["is_valid"].apply(
        lambda valid: "" if valid else "Price must be positive"
    )
    return result

validator_params = dd.LocalCallableValidatorParams(
    validation_function=my_validation_function,
    output_schema={  # Optional: enforce output schema
        "type": "object",
        "properties": {
            "data": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "is_valid": {"type": ["boolean", "null"]},
                        "error_message": {"type": "string"}
                    },
                    "required": ["is_valid"]
                }
            }
        }
    }
)

Function Requirements:

Input: DataFrame with target columns
Output: DataFrame with is_valid column (boolean or null)
Extra fields: Any additional columns become validation metadata

The output_schema parameter is optional but recommended—it validates the function's output against a JSON schema, catching unexpected return formats.

🌐 Remote Validator

The remote validator sends data to HTTP endpoints for validation-as-a-service. This is useful for when you have validation software that needs to run on external compute and you can expose it through a service. Some examples are:

External linting services
Security scanners
Domain-specific validators
Proprietary validation systems

!!! note "Authentication" Currently, the remote validator is only able to perform unauthenticated API calls. When implementing your own service, you can rely on network isolation for security. If you need to reach a service that requires authentication, you should implement a local proxy.

Configuration:

import data_designer.config as dd

validator_params = dd.RemoteValidatorParams(
    endpoint_url="https://api.example.com/validate",
    timeout=30.0,  # Request timeout in seconds
    max_retries=3,  # Retry attempts on failure
    retry_backoff=2.0,  # Exponential backoff factor
    max_parallel_requests=4,  # Concurrent request limit
    output_schema={  # Optional: enforce response schema
        "type": "object",
        "properties": {
            "data": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "is_valid": {"type": ["boolean", "null"]},
                        "confidence": {"type": "string"}
                    }
                }
            }
        }
    }
)

Request Format:

The validator sends POST requests with this structure:

{
    "data": [
        {"column1": "value1", "column2": "value2"},
        {"column1": "value3", "column2": "value4"}
    ]
}

Expected Response Format:

The endpoint must return:

{
    "data": [
        {
            "is_valid": true,
            "custom_field": "any additional metadata"
        },
        {
            "is_valid": false,
            "custom_field": "more metadata"
        }
    ]
}

Retry Behavior:

The validator automatically retries on:

Network errors
HTTP status codes: 429 (rate limit), 500, 502, 503, 504

Failed requests use exponential backoff: delay = retry_backoff^attempt.

Parallelization:

Set max_parallel_requests to control concurrency. Higher values improve throughput but increase server load. The validator batches requests according to the batch_size parameter in the validation column configuration.

Using Validators in Columns

Add validation columns to your configuration using the builder's add_column method:

import data_designer.config as dd

builder = dd.DataDesignerConfigBuilder()

# Generate Python code
builder.add_column(
    dd.LLMCodeColumnConfig(
        name="sorting_algorithm",
        prompt="Write a Python function to sort a list using bubble sort.",
        code_lang="python",
        model_alias="my-model"
    )
)

# Validate the generated code
builder.add_column(
    dd.ValidationColumnConfig(
        name="code_validation",
        target_columns=["sorting_algorithm"],
        validator_type="code",
        validator_params=dd.CodeValidatorParams(code_lang=dd.CodeLang.PYTHON),
        batch_size=10,
        drop=False,
    )
)

The target_columns parameter specifies which columns to validate. All target columns are passed to the validator together (except for code validators, which process each column separately).

Configuration Parameters

See more about parameters used to instantiate ValidationColumnConfig in the code reference.

Batch Size Considerations

Larger batch sizes improve efficiency but consume more memory:

Code validators: 5-20 records (file I/O overhead)
Local callable: 10-50 records (depends on function complexity)
Remote validators: 1-10 records (network latency, server capacity)

Adjust based on:

Validator computational cost
Available memory
Network bandwidth (for remote validators)
Server rate limits

If the validation logic uses information from other samples, only samples in the batch will be considered.

Multiple Column Validation

Validate multiple columns simultaneously:

import data_designer.config as dd

builder.add_column(
    dd.ValidationColumnConfig(
        name="multi_column_validation",
        target_columns=["column_a", "column_b", "column_c"],
        validator_type="remote",
        validator_params=dd.RemoteValidatorParams(
            endpoint_url="https://api.example.com/validate"
        )
    )
)

Note: Code validators always process each target column separately, even when multiple columns are specified. Local callable and remote validators receive all target columns together.

10 KiB Raw Blame History

Validators

Overview

Validator Types

🐍 Python Code Validator

🗄️ SQL Code Validator

🔧 Local Callable Validator

🌐 Remote Validator

Using Validators in Columns

Configuration Parameters

Batch Size Considerations

Multiple Column Validation

See Also

10 KiB

Raw Blame History