Elgato_dark/DataDesigner

Fork 0

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

Nabin Mulepati f73da1975c

CI / Test (Python 3.10 on macos-latest) (push) Has been cancelled

Details

CI / Test (Python 3.11 on macos-latest) (push) Has been cancelled

Details

CI / Test (Python 3.12 on macos-latest) (push) Has been cancelled

Details

CI / Test (Python 3.13 on macos-latest) (push) Has been cancelled

Details

CI / Test (Python 3.10 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test (Python 3.11 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test (Python 3.12 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test (Python 3.13 on ubuntu-latest) (push) Has been cancelled

Details

CI / Lint and Format Check (push) Has been cancelled

Details

CI / Check License Headers (push) Has been cancelled

Details

CI / Test Config (Python 3.10 on macos-latest) (push) Has been cancelled

Details

CI / Test Config (Python 3.11 on macos-latest) (push) Has been cancelled

Details

CI / Test Config (Python 3.12 on macos-latest) (push) Has been cancelled

Details

CI / Test Config (Python 3.13 on macos-latest) (push) Has been cancelled

Details

CI / Test Config (Python 3.10 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Config (Python 3.11 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Config (Python 3.12 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Config (Python 3.13 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Engine (Python 3.10 on macos-latest) (push) Has been cancelled

Details

CI / Test Engine (Python 3.11 on macos-latest) (push) Has been cancelled

Details

CI / Test Engine (Python 3.12 on macos-latest) (push) Has been cancelled

Details

CI / Test Engine (Python 3.13 on macos-latest) (push) Has been cancelled

Details

CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Interface (Python 3.10 on macos-latest) (push) Has been cancelled

Details

CI / Test Interface (Python 3.11 on macos-latest) (push) Has been cancelled

Details

CI / Test Interface (Python 3.12 on macos-latest) (push) Has been cancelled

Details

CI / Test Interface (Python 3.13 on macos-latest) (push) Has been cancelled

Details

CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Has been cancelled

Details

CI / Coverage Check (Python 3.11) (push) Has been cancelled

Details

CI / End to end test (Python 3.10 on macos-latest) (push) Has been cancelled

Details

CI / End to end test (Python 3.11 on macos-latest) (push) Has been cancelled

Details

CI / End to end test (Python 3.12 on macos-latest) (push) Has been cancelled

Details

CI / End to end test (Python 3.13 on macos-latest) (push) Has been cancelled

Details

CI / End to end test (Python 3.10 on ubuntu-latest) (push) Has been cancelled

Details

CI / End to end test (Python 3.11 on ubuntu-latest) (push) Has been cancelled

Details

CI / End to end test (Python 3.12 on ubuntu-latest) (push) Has been cancelled

Details

CI / End to end test (Python 3.13 on ubuntu-latest) (push) Has been cancelled

Details

feat(models): deprecate implicit default provider routing (#594 )

* feat(models): deprecate implicit default provider routing

Emit DeprecationWarning whenever the legacy "implicit default
provider" path is exercised: `ModelConfig.provider=None`, the
registry-level `ModelProviderRegistry.default`, the YAML
`default:` key in `~/.data-designer/model_providers.yaml`, and
the CLI's "Change default provider" workflow.

`resolve_model_provider_registry` skips passing `default=` in the
single-provider case so the common construction path stays quiet.
Multi-provider registries still pass `default` (per
`check_implicit_default`) and warn accordingly.

Update docs, the package README, and test fixtures to specify
`provider=` explicitly on every `ModelConfig`. New tests cover
each warning entry point and pin the post-deprecation happy paths.

Refs #589

Made-with: Cursor

* fix(models): address PR #594 review feedback

Greptile P1: ProviderRepository.load emitted its DeprecationWarning
inside a `try/except Exception` block. Under
`filterwarnings("error", DeprecationWarning)` the warn would raise,
the except would swallow it, and `load()` would silently return None
(losing the registry). Move the warn outside the catch-all so the
strict-warning path no longer drops valid configs.

Greptile P2 / johnnygreco: `_warn_on_implicit_provider` and
`_warn_on_explicit_default` use `stacklevel=2`, which lands inside
pydantic v2's validator dispatch rather than at the user's
`ModelConfig(...)` / `ModelProviderRegistry(...)` call. That broke
both attribution (the source line was unhelpful) and Python's
once-per-location dedup (every call collapsed to the same
pydantic-internal key, suppressing all but the first warning).
Introduce `data_designer.config.utils.warning_helpers.warn_at_caller`,
which walks past the helper, validator, and any pydantic frames to
find the user's call site and emits via `warnings.warn_explicit` with
the user frame's `__warningregistry__`. Keeps attribution accurate
and dedup keyed on the user's (filename, lineno).

johnnygreco: align the `provider_repository.py` warning copy with the
sibling site in `default_model_settings.py` ("specify provider=
explicitly on each ModelConfig instead") so both YAML-default warning
sites give the same migration instruction. The previous wording
pointed users at "ModelConfig entries" inside `model_providers.yaml`,
where ModelConfig entries don't actually live.

johnnygreco: dedup the cascade in `DataDesigner.__init__`. With
`model_providers=None` and a YAML `default:`, the user previously saw
two DeprecationWarnings for the same root cause —
`get_default_provider_name()` warns about the YAML key, then
`resolve_model_provider_registry(...)` re-warns from
`_warn_on_explicit_default`. Suppress the registry-level duplicate in
the YAML-fallback branch via `warnings.catch_warnings()` so users see
exactly one warning per user action.

johnnygreco: tighten `_warn_on_explicit_default` to fire only when
`default is not None`. Passing `default=None` explicitly is
semantically equivalent to omitting it (caller is opting *out* of a
registry-level default), and shouldn't trigger the deprecation
nudge.

johnnygreco: add a `model_validate({...})` regression test for
`ModelConfig` so the deserialization path (legacy on-disk configs)
is pinned alongside the construction path.

Tests:
- Update `test_load_exists` and `test_save` to omit `default=` so the
  roundtrip stops exercising the deprecated YAML-default path
  unguarded (Greptile note).
- Wrap `test_resolve_model_provider_registry_with_explicit_default`,
  `test_get_provider`, and
  `test_init_user_supplied_providers_preserve_first_wins_over_yaml_default`
  in `pytest.warns` so the suite stays green under
  `-W error::DeprecationWarning` (Greptile note).
- Add `test_explicit_default_none_does_not_emit_deprecation_warning`
  to pin the tightened predicate.
- Add `test_init_yaml_default_emits_single_deprecation_warning` to
  pin the cascade-dedup behavior.

Refs #589

Made-with: Cursor

* fix(models): make deprecation warnings visible under default filters

andreatgretel (PR #594): the YAML-default warning in
`get_default_provider_name` and the registry-default warning emitted
from inside DataDesigner helpers were attributing to data_designer
library frames, not user code. Python's default filter chain includes
`ignore::DeprecationWarning`, so library-attributed entries are
silenced — meaning a normal `DataDesigner()` call with a YAML
`default:` set showed nothing, and `resolve_model_provider_registry`
warnings were similarly invisible. Two related changes:

1. `warn_at_caller`: extend the default skip-list from `("pydantic",)`
   to `("pydantic", "pydantic_core", "data_designer")` so the walk
   escapes both pydantic's validator-dispatch frames and data_designer
   helper frames before attributing. Also tighten the prefix predicate
   to exact-or-dotted-prefix matching (`name == p or
   name.startswith(p + ".")`) so e.g. `pydantic_helpers` is not
   falsely matched as part of `pydantic` (johnnygreco nit). Allow
   callers to pass a custom `skip_prefixes` for flexibility. Drop the
   "skip frame 0+1 unconditionally" guard now that prefix matching
   covers it.

2. `get_default_provider_name`: switch from
   `warnings.warn(stacklevel=2)` to `warn_at_caller`. The previous
   stacklevel pointed into `default_model_settings.py`, which is a
   library file → silenced under default filters. Verified the fix
   empirically with `python -W default`: warning is now attributed to
   the user's call site and rendered.

johnnygreco (PR #594): add the missing
`test_explicit_default_none_does_not_emit_deprecation_warning`
regression for the `self.default is not None` predicate landed in
the prior round.

Tests:
- New `test_warning_helpers.py` pins prefix-matching precision
  (rejects `pydantic_helpers` / `data_designer_other`), default
  skip-list contents, attribution past skip-prefix frames, and
  per-call-site dedup behavior.
- `test_get_default_provider_name_warning_attributes_to_user_frame`
  pins andreatgretel's repro for the YAML-default site.
- `test_explicit_default_warning_attributes_to_user_frame` pins the
  multi-frame case: construction goes through
  `resolve_model_provider_registry`, so the walk has to escape both
  pydantic and data_designer before landing on the test file.
- `test_explicit_default_none_does_not_emit_deprecation_warning`
  pins johnnygreco's predicate-tightening regression.

3,124 tests pass (540 config + 1,923 engine + 653 interface; +10 net
from this round).

Refs #589

Made-with: Cursor

* fix(models): apply warn_at_caller to remaining deprecation sites

greptile-apps (PR #594, r3189904028): `ProviderRepository.load`'s
YAML-default `DeprecationWarning` was using `warnings.warn(stacklevel=2)`,
which attributes to whichever data_designer frame called `load()` —
controllers, services, list/reset commands, agent introspection. Every
real call path lands on `data_designer.cli.*`, which falls under
Python's default `ignore::DeprecationWarning` filter and is silenced.
Audit found two more sites with the same problem:

- `DatasetBuilder._resolve_async_compatibility` (`allow_resize` /
  issue #552) — was using `stacklevel=4` to walk past
  `_resolve_async_compatibility -> build/build_preview -> interface ->
  user`. Brittle: any added frame (decorator, async wrapping, the
  `try/except DeprecationWarning: raise` boundary) shifts attribution
  silently. The existing test passed only because it used
  `simplefilter("always") + record=True`, which records warnings
  regardless of attribution.
- `ProviderController._handle_change_default` — was using
  `stacklevel=2`, which lands on the menu dispatcher in the same
  controller module. `print_warning` already shows the message
  visually, but programmatic observers (`pytest.warns`,
  `filterwarnings("error", ...)`) saw a library-attributed entry that
  default filters silenced.

All three migrated to `warn_at_caller` (the helper from 247fa30) so
attribution lands on the user's call site regardless of internal
chain shape. `data_designer` is already in
`DEFAULT_INTERNAL_PREFIXES`, so the walk escapes the entire library
in one pass.

Added attribution regression tests at each site asserting
`warning.filename == __file__`. A future regression to
`warnings.warn(stacklevel=N)` now fails CI instead of silently
silencing the user-facing nudge:

- `test_load_with_yaml_default_attributes_warning_to_caller`
  (test_provider_repository.py)
- `test_resolve_async_compatibility` extended with the same assertion
- `test_handle_change_default_emits_deprecation_warning` rewritten
  from `pytest.warns(...)` to a `catch_warnings(record=True)` block
  that filters for the message and asserts `filename == __file__`
  (`pytest.warns` does not check attribution, so the rewrite is
  required to actually catch the regression).

3,125 tests pass (548 config + 1,923 engine + 654 interface).

Refs #589

2026-05-05 13:39:12 -06:00

7.5 KiB

Raw Permalink Blame History

Custom Model Settings

While Data Designer ships with pre-configured model providers and configurations, you can create custom configurations to use different models, adjust inference parameters, or connect to custom API endpoints.

When to Use Custom Settings

Use custom model settings when you need to:

Use models not included in the defaults
Adjust inference parameters (temperature, top_p, max_tokens) for specific use cases
Add distribution-based inference parameters for variability
Connect to self-hosted or custom model endpoints
Create multiple variants of the same model with different settings

Creating and Using Custom Settings

Custom Models with Default Providers

Create custom model configurations that use the default providers (no need to define providers yourself):

import data_designer.config as dd
from data_designer.interface import DataDesigner

# Create custom models using default providers
custom_models = [
    # High-temperature for more variability
    dd.ModelConfig(
        alias="creative-writer",
        model="nvidia/nemotron-3-nano-30b-a3b",
        provider="nvidia",  # Uses default NVIDIA provider
        inference_parameters=dd.ChatCompletionInferenceParams(
            temperature=1.2,
            top_p=0.98,
            max_tokens=4096,
        ),
    ),
    # Low-temperature for less variability
    dd.ModelConfig(
        alias="fact-checker",
        model="nvidia/nemotron-3-nano-30b-a3b",
        provider="nvidia",  # Uses default NVIDIA provider
        inference_parameters=dd.ChatCompletionInferenceParams(
            temperature=0.1,
            top_p=0.9,
            max_tokens=2048,
        ),
    ),
]

# Create DataDesigner (uses default providers)
data_designer = DataDesigner()

# Pass custom models to config builder
config_builder = dd.DataDesignerConfigBuilder(model_configs=custom_models)

# Add a topic column using a categorical sampler
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="topic",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(
            values=["Artificial Intelligence", "Space Exploration", "Ancient History", "Climate Science"],
        ),
    )
)

# Use your custom models
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="creative_story",
        model_alias="creative-writer",
        prompt="Write a creative short story about {{topic}}.",
    )
)

config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="facts",
        model_alias="fact-checker",
        prompt="List 3 facts about {{topic}}.",
    )
)

# Preview your dataset
preview_result = data_designer.preview(config_builder=config_builder)
preview_result.display_sample_record()

!!! note "Default Providers Always Available" When you only specify model_configs, the default model providers (NVIDIA, OpenAI, and OpenRouter) are still available. You only need to create custom providers if you want to connect to different endpoints or modify provider settings.

!!! warning "Always specify provider= on ModelConfig" Leaving provider unset (or passing provider=None) on ModelConfig is deprecated. The legacy "implicit default provider" routing — used when provider is omitted — emits a DeprecationWarning and will be removed in a future release. Always reference the intended provider by name, as the examples below do. See issue #589.

!!! tip "Mixing Custom and Default Models" When you provide custom model_configs to DataDesignerConfigBuilder, they replace the defaults entirely. To use custom model configs in addition to the default configs, use the add_model_config method:

```python
import data_designer.config as dd

# Load defaults first
config_builder = dd.DataDesignerConfigBuilder()

# Add custom model to defaults
config_builder.add_model_config(
    dd.ModelConfig(
        alias="my-custom-model",
        model="nvidia/llama-3.3-nemotron-super-49b-v1.5",
        provider="nvidia",  # Uses default provider
        inference_parameters=dd.ChatCompletionInferenceParams(
            temperature=0.6,
            max_tokens=8192,
        ),
    )
)

# Now you can use both default and custom models
# Default: nvidia-text, nvidia-reasoning, nvidia-vision, etc.
# Custom: my-custom-model
```

Custom Providers with Custom Models

Define both custom providers and custom model configurations when you need to connect to services not included in the defaults:

!!! warning "Network Accessibility" The custom provider endpoints must be reachable from where Data Designer runs. Ensure network connectivity, firewall rules, and any VPN requirements are properly configured.

import data_designer.config as dd
from data_designer.interface import DataDesigner

# Step 1: Define custom providers
custom_providers = [
    dd.ModelProvider(
        name="my-custom-provider",
        endpoint="https://api.my-llm-service.com/v1",
        provider_type="openai",  # OpenAI-compatible API
        api_key="MY_SERVICE_API_KEY",  # Environment variable name
    ),
    dd.ModelProvider(
        name="my-self-hosted-provider",
        endpoint="https://my-org.internal.com/llm/v1",
        provider_type="openai",
        api_key="SELF_HOSTED_API_KEY",
    ),
]

# Step 2: Define custom models
custom_models = [
    dd.ModelConfig(
        alias="my-text-model",
        model="openai/some-model-id",
        provider="my-custom-provider",  # References provider by name
        inference_parameters=dd.ChatCompletionInferenceParams(
            temperature=0.85,
            top_p=0.95,
            max_tokens=2048,
        ),
    ),
    dd.ModelConfig(
        alias="my-self-hosted-text-model",
        model="openai/some-hosted-model-id",
        provider="my-self-hosted-provider",
        inference_parameters=dd.ChatCompletionInferenceParams(
            temperature=0.7,
            top_p=0.9,
            max_tokens=1024,
        ),
    ),
]

# Step 3: Create DataDesigner with custom providers
data_designer = DataDesigner(model_providers=custom_providers)

# Step 4: Create config builder with custom models
config_builder = dd.DataDesignerConfigBuilder(model_configs=custom_models)

# Step 5: Add a topic column using a categorical sampler
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="topic",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(
            values=["Technology", "Healthcare", "Finance", "Education"],
        ),
    )
)

# Step 6: Use your custom model by referencing its alias
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="short_news_article",
        model_alias="my-text-model",  # Reference custom alias
        prompt="Write a short news article about the '{{topic}}' topic in 10 sentences.",
    )
)

config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="long_news_article",
        model_alias="my-self-hosted-text-model",  # Reference custom alias
        prompt="Write a detailed news article about the '{{topic}}' topic.",
    )
)

# Step 7: Preview your dataset
preview_result = data_designer.preview(config_builder=config_builder)
preview_result.display_sample_record()

7.5 KiB Raw Permalink Blame History

Custom Model Settings

When to Use Custom Settings

Creating and Using Custom Settings

Custom Models with Default Providers

Custom Providers with Custom Models

See Also

7.5 KiB

Raw Permalink Blame History