*Description of changes:* This PR removes casting to `fp32` for the
`cumsum` operation and upgrades `mlx` to `~=0.10.0` which adds `bf16`
support for `cumsum`.
Related: https://github.com/ml-explore/mlx/issues/959
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
Co-authored-by: Abdul Fatir Ansari <ansarnd@amazon.com>
*Description of changes:* Minor simplification to how the tokenizer is
constructed from the config
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Description of changes:* Speed up GH workflow by installing CPU-only
version of torch
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Description of changes:* Fix some type checking issues, add mypy to
github workflow, apply black
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Description of changes:* This PR adds `pipeline.embed` which extracts
encoder embeddings from the model. These embeddings may be useful for
some downstream tasks such as classification, so this is useful to have.
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
---------
Co-authored-by: Abdul Fatir Ansari <ansarnd@amazon.de>
*Issue #, if available:* Unnecessary context padding slows down
inference. We evaluated the models from HF with this change, and found
no concerning issue with accuracy.
Test code for a context of length 200:
```python
import torch
from chronos import ChronosPipeline
import time
pipeline = ChronosPipeline.from_pretrained(
"amazon/chronos-t5-large",
device_map="cuda",
torch_dtype=torch.bfloat16,
)
context = torch.ones((8, 200))
prediction_length = 24
num_runs = 10
t0 = time.time()
for _ in range(num_runs):
forecast = pipeline.predict(
context,
prediction_length,
num_samples=20,
)
t1 = time.time()
print(f"total time: {t1 - t0}")
```
Before the change:
```
total time: 20.005481481552124
```
After the change:
```
total time: 9.82350754737854
```
*Description of changes:* Remove padding in case the provided batch is
shorter than `context_length`.
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Description of changes:* This PR adds optional inference params such as
`num_samples`, `top_k`, etc. to the example in the README for clarity.
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
*Issue #, if available:* N/A
*Description of changes:*
Thanks for the very clean impl of the Model, Tokenizer, and Pipeline.
I was curios about it and found a minor improvement in the API - what do
you think about it? Feel free to close. Change is untested.
By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.