mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

docs: graduate plugins out of experimental mode (#603 )

* chore: add __init__.py to engine namespace subpackages

Griffe (used by mkdocstrings) skips directories without __init__.py
when resolving module paths, which prevented the new plugins code
reference from rendering SeedReader, FileSystemSeedReader, and
Processor. Adding empty __init__.py files in engine/resources/,
engine/processing/, and engine/processing/processors/ aligns with
the convention already used in engine/mcp/, engine/models/, etc.

* docs: flesh out docstrings on plugin extension-point classes

Plugin authors now see meaningful descriptions for every field and
method on the bases rendered in the plugins code reference:

- Plugin and PluginType: class docstrings + Attributes tables for
  fields and enum members; fix typo in config_qualified_name field
  description.
- SingleColumnConfig: document allow_resize.
- ProcessorConfig: document processor_type discriminator.
- SeedSource: document seed_type discriminator.
- FileSystemSeedSource: add class docstring + Attributes table for
  path / file_pattern / recursive.
- ColumnGeneratorFullColumn and ColumnGeneratorCellByCell: add
  class docstrings explaining when to use each base, plus method
  docstrings on the abstract generate() implementations.

* docs: graduate plugins out of experimental mode

Restructures plugin documentation around the now-stable extension
points (column generator, seed reader, processor) and treats plugins
as a first-class story for customizing Data Designer.

- Add code_reference/plugins.md: single-stop reference for the Plugin
  object and the config + implementation base classes used by all
  three plugin types.
- Add code_reference/generators.md: column generator implementation
  base classes, separated from column configs.
- Surface SingleColumnConfig in code_reference/column_configs.md.
- Add plugins/implement.md ("Build Your Own"): per-type implementation
  instructions across column generators, seed readers, and processors.
- Add plugins/processor.md: complete processor plugin package example.
- Rewrite plugins/overview.md: open with why plugins exist, drop the
  internal-helpers note (PluginRegistry / PluginManager), and focus
  the guide on what plugin builders need.
- Refresh plugins/available.md (Catalog) and
  plugins/filesystem_seed_reader.md to match the new structure.
- Delete plugins/example.md (replaced by per-type guides).
- Reorder Code Reference nav alphabetically and add the new pages.
- Minor link / wording fixes in concepts/processors.md and
  concepts/deployment-options.md.

* docs: simplify plugin docs structure

Replace the overview's how-to walkthrough and the per-type plugin
guides with a single Build Your Own page that covers all three
plugin types side-by-side. Add a dedicated Using Models in Plugins
guide and a seed_readers code reference, and trim the overview down
to what the plugin types are, how to use one, and how discovery
works.

- Rename plugins/implement.md to plugins/build_your_own.md.
- Delete plugins/filesystem_seed_reader.md and plugins/processor.md
  (their content is now in build_your_own.md and the per-type code
  references).
- Add plugins/models.md for model-backed column generator authoring.
- Add code_reference/seed_readers.md for seed reader implementation
  base classes.
- Rewrite plugins/overview.md: shorter intro, type bullets link to
  the relevant code reference, drop the multi-step "How do you
  create plugins" walkthrough in favor of a single Build a Plugin
  pointer, tighten Discovery troubleshooting.
- Refresh plugins/available.md (Available Plugins): point to the
  DataDesignerPlugins catalog and explain how to request a community
  listing.
- Update cross-page links in concepts/processors.md,
  concepts/seed-datasets.md, recipes/plugin_development/markdown_seed_reader.md,
  code_reference/plugins.md, and code_reference/generators.md to
  match the new structure.
- Update mkdocs.yml nav: rename to Build Your Own, add Using Models,
  add seed_readers code reference.

* docs: scroll wide tables horizontally instead of wrapping

Code-heavy reference tables (plugin bases, column generators, etc.)
were wrapping aggressively on narrow viewports, breaking long
identifiers across multiple lines. Switch the table container to
horizontal overflow and prevent code cells from wrapping so
identifiers stay readable.

* docs: address PR #603 review feedback

- Add an Implementation base section to code_reference/processors.md
  rendering the engine-side Processor class. This justifies the
  engine/processing/__init__.py files added earlier and gives
  processor plugin authors an auto-rendered API reference, matching
  the pattern used by code_reference/generators.md and seed_readers.md.
- build_your_own.md: replace the placeholder "x" emoji on the
  IndexMultiplier example with the actual multiplication sign.
- build_your_own.md: drop the manual `re.compile + apply(lambda)`
  pattern in the regex-filter processor in favor of the idiomatic
  `Series.str.contains(..., regex=True)`.
- build_your_own.md: add a kernel-restart caveat after the editable
  install instructions — PluginRegistry caches discovery on first
  import, so notebooks need a fresh kernel to pick up freshly
  installed plugins.
- build_your_own.md: state explicitly what `assert_valid_plugin`
  checks (config base + plugin-type-appropriate impl base).
- code_reference/plugins.md: link out to the processors code
  reference alongside generators and seed_readers.

* docs: split code reference by package

* docs: add interface code reference

* docs: add code reference overviews

* docs: refine code reference pages

* docs: improve code reference tables

* docs: correct reference docstrings

* docs: embed plugin catalog table

* docs: note plugin discovery restart caveat

* docs: explain generator base class choice

* docs: mention async cell generator examples

* docs: clarify plugin model usage

* docs: clarify plugin model aliases

* docs: address plugin review feedback

* docs: update available plugins page

2026-05-06 18:12:44 -04:00

10 KiB

Raw Blame History

Security

Data Designer can run in two very different trust models:

Trusted / monolithic: The same user or team writes the config and runs the engine.
Untrusted / shared execution: One user submits a config and a different process, service, or team executes it.

That distinction matters for features that evaluate user-supplied configuration at runtime, such as Jinja template rendering. In a trusted local workflow, broader template flexibility may be acceptable. In a shared-service deployment, user-supplied Jinja becomes part of the engine's remote code execution surface. A template sandbox escape would execute inside the process running Data Designer.

See Deployment Options for the architectures where that trust boundary changes.

Jinja Rendering Modes

Data Designer exposes the renderer choice through RunConfig:

import data_designer.config as dd

run_config = dd.RunConfig(
    jinja_rendering_engine=dd.JinjaRenderingEngine.SECURE,
)

SECURE is the default. Opt into NATIVE only when you are comfortable treating the config author and the engine operator as the same trust domain.

Mode	What it uses	Best fit
`SECURE`	Data Designer's hardened renderer built on top of Jinja2's sandbox	Shared services, microservices, internal platforms, or any deployment where config submission is separated from execution
`NATIVE`	Jinja2's built-in sandbox with Data Designer's variable whitelist	Local library usage and other trusted, monolithic workflows that want broader Jinja behavior

!!! warning "Treat untrusted Jinja as a security boundary" If many users can submit configs to one engine, or if configs are accepted over an API and executed elsewhere, keep JinjaRenderingEngine.SECURE. In that model, Jinja templates are no longer just prompt-formatting helpers. They are untrusted user programs being evaluated by your engine.

Compatibility Matrix

NATIVE is not an unrestricted Python template engine. The matrix below shows what each mode permits, restricts, or adds on top of Jinja2's standard sandbox behavior.

Capability	`NATIVE`	`SECURE`
Jinja2 `ImmutableSandboxedEnvironment` baseline	Yes	Yes
References to explicitly provided dataset variables only	Yes	Yes
Standard Jinja built-in filter set	Yes	Subset only
Data Designer `jsonpath` filter	Yes	Yes
`import`, `macro`, `set`, `extends`, `block` support	Yes	No
Nested or recursive `for` loops	Yes	No
Unbounded AST complexity	Yes	No
Template context sanitized to JSON-compatible types before render	No	Yes
Empty, oversized, or built-in-like rendered output is permitted	Yes	No

What `SECURE` Adds on Top of Standard Jinja Sandbox

The SECURE renderer uses a hardened environment implemented in the renderer source file on GitHub. Compared with the standard Jinja sandbox, it adds several additional controls.

Record Sanitization Before Render

Before rendering, SECURE forces template context through a JSON-compatible serialization step. That means remote templates operate on plain data, not arbitrary Python objects.

# Intended shape for remote template context
record = {
    "user": {
        "name": "alice",
        "roles": ["admin", "reviewer"],
    }
}

# Not the kind of server-side object SECURE wants to expose directly
record = {
    "user": SomePythonObject(...),
}

In a remote execution setting, exposing rich Python objects increases the risk of attribute- and method-based sandbox escapes. Jinja's sandbox security considerations note that the sandbox is not a complete security boundary, and past escapes have included str.format (CVE-2016-10745), str.format_map (CVE-2019-10906), indirect str.format references (CVE-2024-56326), and |attr-based access to format (CVE-2025-27516); PortSwigger's server-side template injection research covers the broader object-traversal pattern.

Filter Allowlist

SECURE keeps only a small approved subset of Jinja filters plus the Data Designer jsonpath filter. If a filter is not on that allowlist, the template is rejected. Common excluded filters are:

Disallowed filters	Why they are excluded in `SECURE`
`attr`, `xmlattr`	These add dynamic attribute lookup or attribute-name construction, which widens the object-traversal surface in untrusted templates.
`map`, `select`, `reject`, `selectattr`, `rejectattr`, `groupby`, `batch`, `slice`, `sum`	These make templates behave more like a data-processing language and can multiply compute across large inputs.
`join`, `format`, `indent`, `wordwrap`, `center`, `filesizeformat`	These expand presentation and composition logic inside the template. `SECURE` keeps formatting logic narrow so templates stay close to interpolation.
`default`, `d`, `dictsort`, `count`, `wordcount`, `pprint`, `tojson`	These encourage fallback logic, secondary data shaping, or debug-style output inside the template rather than in the engine or config layer.
`safe`, `striptags`, `urlize`	These are primarily HTML-oriented output transforms and are unnecessary for server-side dataset rendering.

Some omitted convenience filters, such as the e alias for escape, are excluded because SECURE uses a small explicit allowlist. The current implementation does not assign each omitted filter its own separate security rationale.

Use NATIVE when full Jinja filter compatibility matters more than the additional restrictions used for untrusted template execution.

Template Features Removed

SECURE rejects import, macro, set, extends, and block.

{% macro render_name(name) %}{{ name }}{% endmacro %}
{{ render_name(customer_name) }}

{% set temp = user_id %}
{{ temp }}

Those features are useful in trusted authoring environments, but they also make user templates more expressive and stateful. In a remote execution model, SECURE intentionally narrows the language so templates stay closer to data interpolation than to a reusable programming layer.

Loop Restrictions

SECURE rejects recursive loops and nested for loops.

{% for row in rows %}
  {% for item in row %}
    {{ item }}
  {% endfor %}
{% endfor %}

Nested and recursive loops are especially risky in shared execution because they can amplify compute cost and output size in ways that are hard to reason about from the outside.

AST Complexity Limits

SECURE statically analyzes the parsed Jinja AST and rejects templates that exceed the current limits of 600 nodes or depth 10.

{% if a %}
  {% if b %}
    {% if c %}
      {{ value }}
    {% endif %}
  {% endif %}
{% endif %}

This is not about any one feature being unsafe by itself. It is about limiting how much control flow and composition untrusted templates can pack into a single server-side render operation, which helps prevent compute bombs in shared execution.

`self` References Blocked

SECURE rejects references to self.

{{ self }}

The point is to avoid exposing template internals back to the submitter. In a remote setting, even accidental access to those internals is unnecessary surface area.

Rendered Output Guards

SECURE validates rendered output after template execution. It rejects empty output, very large output, and strings that look like Python built-in or function representations.

{{ "" }}

<built-in method ...>
<function ...>

These checks matter because not all bad outcomes come from parse-time behavior. Some templates are syntactically valid but still produce output that is clearly broken, oversized, or revealing internal implementation details.

Sanitized User-Facing Errors

At the engine boundary, SECURE normalizes most template failures into a generic invalid-template message.

User provided prompt generation template is invalid.

That matters in remote execution because exception details can leak information about server-side implementation, supported objects, or internal execution paths that untrusted users do not need to see.

These controls exist because the standard sandbox is a good baseline, but shared-service deployments need a narrower and more defensive execution model.

Why This Matters in Multi-User Deployments

The security posture changes as soon as config submission and execution are separated.

Examples:

A centralized Data Designer service accepts configs from many users.
An internal platform lets users upload or edit configs that are executed by a background worker.
A REST API accepts Jinja-containing configs and runs them on server-side infrastructure.

In those environments, templates are no longer just local convenience syntax. They are untrusted input being evaluated by infrastructure the submitter does not control. In practice, that makes Jinja rendering a remote code execution concern, which is why SECURE exists and why it remains the default.

If you are deciding between local library usage and a shared service model, read Deployment Options. The library patterns are often still "trusted" deployments. The shared microservice pattern is not.

When To Use `NATIVE`

Use NATIVE when all of the following are true:

The person submitting the config is also the person running the engine, or they are in the same trusted operational boundary.
You want broader standard Jinja behavior than SECURE allows.
You understand that this is a flexibility tradeoff, not the safer default.

For example, this is often reasonable in a notebook, local script, or other single-user library workflow.

10 KiB Raw Blame History