*Issue #, if available:*
*Description of changes:*
Rename all variables and methods that refer to "task" in `dataset.py` to
use `input` instead:
- `PreparedTask` → `PreparedInput`
- `self.tasks` → `self.inputs`
- `prepare_tasks` → `prepare_inputs`
- `validate_and_prepare_single_dict_task` →
`validate_and_prepare_single_dict_input`
- All `task_` prefixed local variables renamed (e.g., `task_target` →
`target`, `task_context` → `context`, `task_past_tensor` →
`past_tensor`, etc.)
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
This PR enables fine-tuning on datasets that don't fit into memory. The
main idea is to decouple the preprocessing logic from the
`Chronos2Dataset` class.
The workflow is as follows:
1. Preprocess raw data once using `prepare_tasks()` → save to any
`Sequence[PreparedTask]` container like `datasets.Dataset`
2. During training, load tasks lazily via memory-mapped Arrow files
3. Pass data to `pipeline.fit(..., convert_inputs=False)` to skip
redundant preprocessing
*Description of changes:*
- Added `PreparedTask` `TypedDict` defining the preprocessed task schema
(context, future_covariates, n_targets, n_covariates,
n_future_covariates as torch.Tensor/int)
- Extracted `prepare_tasks()` as a standalone function so it can be used
in preprocessing scripts
- Added `convert_inputs` parameter to `Chronos2Dataset` and
`Chronos2Pipeline.fit()` to toggle between raw input preprocessing
(default) and pre-processed input passthrough.
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:* #425
*Description of changes:*
- Add `freq: str | None` parameter to `predict_df` methods. This can
only be set in combination with `validate_inputs=False`. If specified,
the user-provided `freq` will be used instead of the tryin to infer the
`freq` from the data.
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
- Refactor the notebook to cover real-time inference (CPU & GPU),
serverless inference and batch prediction options for Chronos-2 on
SageMaker
- Update README
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
- Remove for-loop with numpy operations + single pd.DataFrame
construction
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:* Adds support for custom callbacks after each
batch is processed during prediction. This allows for keeping track of
the time limit in AutoGluon.
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:* This PR improves test coverage by adding unit
tests for `df_utils`. Previously these methods were only being tested as
part of Chronos-2 integration tests.
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:* Previously, only the returned pipeline had
correct configuration but it was not being saved to disk.
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
- Rename `predict_batches_jointly` to `cross_learning`
- Add deprecation warning
- Add cross_learning to predict_df docstring
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:* 0 is a better default than 1.
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:* Fixes#403
*Description of changes:*
- Update the `future_df` validation logic to only check that
`prediction_length` values are provided for each item.
- Update unit tests for DF-based methods in `test_chronos2.py`
- Ignore fine-tuned checkpoint folders with `.gitignore`
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:* This PR adds a `validate_inputs ` argument to
`predict_df` (defaults to `True`), which allows the user to disable
dataframe validation when they know that their dataframe is in the right
format. This reduces runtime by removing the input validation component,
e.g., when calling this method from
[AutoGluon](https://github.com/autogluon/autogluon/pull/5427), and also
handles series with shorter than 3 timesteps.
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:* Adds support for LoRA fine-tuning.
- [x] Move peft/pandas dependency to an extra
- [x] Add tests for LoRA
- [x] Update notebook with LoRA info
- [x] Enable automatic recognition and loading of LoRA adapters
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:* Addresses #391
*Description of changes:*
- Speed up `convert_df_input_to_list_of_dicts_input` and
`validate_df_inputs` via a few tricks:
- Replace `df.iloc[start_idx:end_idx][col]` with
`df[col].iloc[start_idx:end_idx]` to avoid copying data on each slice
- Vectorize computation of future timestamps using numpy
- Work with `dict[str, np.ndarray]` instead of `pd.DataFrame` when
working with covariates to avoid repeated `.to_numpy()` calls.
**Before**
```
Benchmarking 20000 series, 200 steps, 0 covariates...
Average runtime: 27.33s
Benchmarking 20000 series, 200 steps, 5 covariates...
Average runtime: 44.69s
```
**After**
```
Benchmarking 20000 series, 200 steps, 0 covariates...
Average runtime: 4.60s
Benchmarking 20000 series, 200 steps, 5 covariates...
Average runtime: 8.92s
```
<details>
```python
import time
import numpy as np
import pandas as pd
from chronos.df_utils import convert_df_input_to_list_of_dicts_input
def benchmark_convert_df_input(
num_items: int, num_steps: int, num_covariates: int = 0, num_trials: int = 10, freq: str = "D"
) -> None:
"""
Benchmark convert_df_input_to_list_of_dicts_input function.
Args:
num_items: Number of time series
num_steps: Number of observations per series
num_covariates: Number of covariates to include
num_trials: Number of benchmark trials
freq: Frequency string for timestamps
"""
prediction_length = 24
# Generate context DataFrame
item_ids = np.repeat(np.arange(num_items), num_steps)
timestamps = np.tile(pd.date_range("2020-01-01", periods=num_steps, freq=freq), num_items)
df_data = {"item_id": item_ids, "timestamp": timestamps, "target": np.random.randn(num_items * num_steps)}
df_data.update({f"cov_{i}": np.random.randn(num_items * num_steps) for i in range(num_covariates)})
df = pd.DataFrame(df_data)
# Generate future_df with covariates
future_df = None
if num_covariates > 0:
future_item_ids = np.repeat(np.arange(num_items), prediction_length)
offset = pd.tseries.frequencies.to_offset(freq)
future_start = pd.Timestamp("2020-01-01") + num_steps * offset
future_timestamps = np.tile(pd.date_range(start=future_start, periods=prediction_length, freq=freq), num_items)
future_data = {"item_id": future_item_ids, "timestamp": future_timestamps}
future_data.update({f"cov_{i}": np.random.randn(num_items * prediction_length) for i in range(num_covariates)})
future_df = pd.DataFrame(future_data)
times = []
print(f"Benchmarking {num_items} series, {num_steps} steps, {num_covariates} covariates...")
for _ in range(num_trials):
start = time.perf_counter()
convert_df_input_to_list_of_dicts_input(
df=df,
future_df=future_df,
id_column="item_id",
timestamp_column="timestamp",
target_columns=["target"],
prediction_length=prediction_length,
)
end = time.perf_counter()
times.append(end - start)
print(f"Average runtime: {sum(times) / len(times):.2f}s")
if __name__ == "__main__":
# Test without covariates
benchmark_convert_df_input(20_000, 200, num_covariates=0, num_trials=1)
# Test with covariates
benchmark_convert_df_input(20_000, 200, num_covariates=5, num_trials=1)
```
</details>
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:*
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:* Lower learning rates generally appear to be
working better. This is probably because we are doing full fine-tuning
of a model with 120M params.
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:* This PR masks rows corresponding to all
covariates in the future target. Specifically, this is to avoid the
contribution of past-only covariates in loss computation. The previous
setup was correct from the perspective of pretraining but I think this
makes more sense for fine-tuning.
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:* #354
*Description of changes:* This PR adds `Chronos2Pipeline.embed` to
enable users to extract embeddings from the last encoder layer in an
easy way. The API and behavior is similar to what Chronos and
Chronos-Bolt provides.
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:*
*Description of changes:* This PR adds `predict_df` to the base pipeline
which enables pandas support for the univariate Chronos and Chronos-Bolt
models.
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
By default, the `transformers` library sets the `num_workers` argument
of the PyTorch DataLoader to `0`, ensuring out-of-the-box compatibility
across different platforms.
*Issue #, if available:*
*Description of changes:*
Set the DataLoader `num_workers` argument to `0` to improve
cross-platform compatibility, particularly on Windows systems where
multiprocessing requires guarded execution.
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
---------
Co-authored-by: Abdul Fatir <Abdulfatirs@gmail.com>