DataDesigner/packages/data-designer-engine/tests/engine/analysis/conftest.py
Johnny Greco 1439bbea7e
chore: Improve CLI startup with lazy heavy import cleanup (#330)
* perf: defer heavy imports to improve CLI startup time

Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost.

Key changes:
- Defer controller imports to inside command functions
- Remove eager re-export chains from CLI package __init__ files
- Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time
- Add lazy __getattr__ exports in interface/__init__.py
- Replace module-level tokenizer init with cached lazy getter
- Fix ModelProvider import to use config layer instead of engine
- Update test mock paths to match new import locations

Reduces CLI import-time from ~1.67s to ~0.46s.

* perf: defer pandas/numpy in io_helpers and add config_list benchmark

- Replace eager `from lazy_heavy_imports import pd, np` in io_helpers
  with module-level __getattr__ (for backwards-compatible external
  access / test mocks) and function-level imports in the 3 functions
  that actually use them (read_parquet_dataset, smart_load_dataframe,
  _convert_to_serializable). Importing io_helpers no longer triggers
  pandas/numpy loading.
- Defer heavy imports in list and reset CLI commands into function
  bodies to avoid loading repositories, Rich, and prompt_toolkit at
  module import time.
- Add `config_list` (data-designer config list) measurement to the
  CLI startup benchmark with isolated cold measurement in a separate
  venv and a --skip-config-list-check flag.
- Update test mock paths to match new import locations.

* Refine lazy import usage and TYPE_CHECKING cleanup

* Run license header updater on PR-touched files

* fix: update sqlfluff mock target for lazy imports in test_sql

* perf: cache globals() in lazy __getattr__ to avoid repeated lookups

Add globals() caching and explanatory comment to all three lazy
__getattr__ implementations (lazy_heavy_imports, config/__init__,
interface/__init__) so subsequent attribute accesses bypass __getattr__.

* perf: lazy CLI command loading and deferred heavy import evaluations

- Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files

- Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes

- Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks

- Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use

- Update test mock targets to patch at usage-site for module-level imports

* refactor: use direct pandas import in seed_source_dataframe

Drop lazy-loading for pandas in DataFrameSeedSource; use direct import
for simplicity.

* update lazy import pattern

* update tests to use lazy import namespace

Switch test modules to import data_designer.lazy_heavy_imports as lazy
and reference heavy libraries through that namespace. This keeps heavy
imports deferred during module import and aligns tests with the new
lazy-import usage pattern.

* tighten import perf test thresholds

Document recent baseline timings and lower the allowed average
import time and timeout so regressions are detected sooner.

* document pandas import requirement

Clarify that Pydantic needs DataFrame resolved at module load and
that keeping the direct import preserves IDE typing support.

* increase timeout time

* use lazy pandas imports in visualization tests

- replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports
- add TYPE_CHECKING pandas import and keep CLI controller imports sorted

* fix lazy pandas runtime usage and preview mocks

Switch sample-record handling to lazy pandas types so runtime paths no longer
depend on TYPE_CHECKING imports. Align preview controller tests to patch the
module-local DataDesigner symbol, preventing real engine invocation in save
results scenarios.
2026-02-18 16:24:15 -05:00

158 lines
4.7 KiB
Python

# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
from __future__ import annotations
import json
from pathlib import Path
from typing import TYPE_CHECKING
from unittest.mock import Mock, patch
from pytest import fixture
import data_designer.lazy_heavy_imports as lazy
from data_designer.config.analysis.column_statistics import (
CategoricalHistogramData,
ColumnDistributionType,
NumericalDistribution,
)
from data_designer.config.column_configs import LLMJudgeColumnConfig, Score
from data_designer.config.column_types import ColumnConfigT
from data_designer.config.models import ModelConfig
from data_designer.engine.analysis.dataset_profiler import (
DataDesignerDatasetProfiler,
DatasetProfilerConfig,
)
from data_designer.engine.analysis.utils.judge_score_processing import JudgeScoreDistributions
from data_designer.engine.models.registry import ModelRegistry
from data_designer.engine.registry.data_designer_registry import DataDesignerRegistry
from data_designer.engine.resources.resource_provider import ResourceProvider
from data_designer.engine.storage.artifact_storage import ArtifactStorage
if TYPE_CHECKING:
import pandas as pd
@fixture
def test_data_path() -> Path:
return Path(__file__).parent / "test_data"
@fixture
def stub_artifact_path(test_data_path: Path) -> Path:
return test_data_path / "artifacts"
@fixture
def stub_dataset_path(stub_artifact_path: Path) -> Path:
return stub_artifact_path / "dataset"
@fixture
def stub_df(stub_dataset_path: Path) -> pd.DataFrame:
return lazy.pd.read_json(
stub_dataset_path / "dataset.json",
orient="records",
dtype_backend="pyarrow",
)
@fixture
def stub_dataset_metadata_path(stub_dataset_path: Path) -> Path:
return stub_dataset_path / "metadata.json"
@fixture
def column_configs(dataset_profiler: DataDesignerDatasetProfiler) -> list[ColumnConfigT]:
return dataset_profiler.config.column_configs
@fixture
def dataset_profiler(
stub_dataset_path: Path,
artifact_storage: ArtifactStorage,
) -> DataDesignerDatasetProfiler:
# Ensure the final dataset path exists
with open(stub_dataset_path / "column_configs.json", "r") as f:
column_configs = json.load(f)
model_config = Mock(spec=ModelConfig)
model_config.alias = "nano"
model_registry = Mock(spec=ModelRegistry)
model_registry.model_configs = {"nano": model_config}
profiler = DataDesignerDatasetProfiler(
config=DatasetProfilerConfig(column_configs=column_configs),
resource_provider=ResourceProvider(artifact_storage=artifact_storage, model_registry=model_registry),
)
return profiler
@fixture
def stub_df_with_mixed_column_types():
data = {
"int_column": [1, 2, 3, 4, 5],
"float_column": [1.1, 2.2, 3.3, 4.4, 5.5],
"string_column": ["a", "b", "c", "d", "e"],
"int_with_nulls_column": [1, 2, None, 4, None],
}
return lazy.pa.Table.from_pydict(data).to_pandas(types_mapper=lazy.pd.ArrowDtype)
@fixture
def mock_prompt_renderer_render():
with patch(
"data_designer.engine.analysis.utils.column_statistics_calculations.RecordBasedPromptRenderer.render"
) as mock:
yield mock
@fixture
def data_designer_registry() -> DataDesignerRegistry:
return DataDesignerRegistry()
@fixture
def stub_score():
"""Create a sample rubric for testing."""
return Score(
name="Quality",
description="Quality assessment score",
options={
4: "Excellent quality",
3: "Good quality",
2: "Fair quality",
1: "Poor quality",
0: "Very poor quality",
},
)
@fixture
def stub_judge_column_config(stub_score):
"""Create a sample LLMJudgeColumnConfig for testing."""
return LLMJudgeColumnConfig(
name="judge_scores",
prompt="Evaluate the quality",
model_alias="test_model",
scores=[stub_score],
)
@fixture
def stub_judge_distributions():
return JudgeScoreDistributions(
scores={"Quality": [4, 3, 2, 1, 0]},
reasoning={"Quality": ["Excellent", "Good", "Fair", "Poor", "Very Poor"]},
distribution_types={"Quality": ColumnDistributionType.NUMERICAL},
distributions={"Quality": NumericalDistribution(min=0, max=4, mean=2.0, stddev=1.4, median=2.0)},
histograms={"Quality": CategoricalHistogramData(categories=[4, 3, 2, 1, 0], counts=[1, 1, 1, 1, 1])},
)
@fixture
def stub_resource_provider_no_model_registry(tmp_path):
"""Create a mock ResourceProvider for testing."""
return ResourceProvider(artifact_storage=ArtifactStorage(artifact_path=tmp_path))