unsloth

mirror of https://github.com/unslothai/unsloth synced 2026-04-21 13:37:39 +00:00

Author	SHA1	Message	Date
Ricardo-M-L	d5525e8bbb	fix: check find() return value before adding offset in try_fix_tokenizer (#4923 ) * fix: check find() return value before adding offset in try_fix_tokenizer The `str.find()` result was checked for -1 only after adding `len(find_text)`, turning the guard into dead code. When the substring is absent, `start` becomes `len(find_text) - 1` (a positive number), so the `if start == -1: continue` never triggers and the subsequent slice extracts garbage from the tokenizer string. Split the find and offset into two steps so the -1 check works correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add defensive guards for token_id None and end find() returning -1 - Skip loop iteration early when token_id is None to avoid constructing a find_text that can never match valid JSON - Guard end = tokenizer_string.find('",', start) against -1 to prevent silent garbage extraction from malformed tokenizer strings * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-09 06:15:46 -07:00
kiankyars	ad5972492d	Fix raw text paragraph break normalization (#4884 ) * Fix raw text paragraph break normalization * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Normalize horizontal whitespace before stripping non-ASCII and collapse leftover doubles Run the [^\S\n]+ horizontal-whitespace collapse before the non-ASCII strip so that Unicode whitespace (\u00A0, \u202F, \u2009, \u3000, \v, \f, etc.) becomes a single ASCII space instead of being deleted outright. The prior ordering silently merged adjacent words on HTML/PDF/OCR-sourced text: "hello\u00a0world" used to produce "helloworld" after this PR; it now produces "hello world". Also drop \t from the allow-list since the horizontal-whitespace collapse already normalizes tabs to a single space, and add a targeted [ ]{2,} pass right after the non-ASCII strip so that a non-whitespace non-ASCII character sitting between two spaces ("word1 (c) word2") does not leave an interior double space. Without this extra pass, clean_text was not idempotent on such inputs: the first call produced "word1 word2" and only the second call collapsed it to "word1 word2". Fuzz testing over 10000 random inputs now satisfies the idempotence invariant in every case. * Add regression tests for Unicode/control whitespace and non-ASCII edge cases Cover: - Unicode horizontal whitespace separators (NBSP, narrow NBSP, thin space, en/em space, ideographic space, vertical tab, form feed) normalizing to a single ASCII space instead of being deleted. - Mixed paragraph + Unicode whitespace realistic input ("Section\u00a01\r\n\r\nBody\ftext\u202Fhere"). - Tab collapsing and space trimming around newlines. - Non-whitespace non-ASCII characters (copyright, accented letters, emoji) sitting between spaces: must not leave an interior double space, and clean_text must be idempotent on these inputs. - Non-ASCII characters adjacent to a newline: stripping must not leave stray leading or trailing spaces on the neighbouring line, and must not swallow an adjacent paragraph break. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-04-09 04:45:43 -07:00
pre-commit-ci[bot]	3620564025	[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci	2026-01-08 11:35:21 +00:00
Daniel Han	16a2d901fa	Fix bugs and add improvements to RawTextDataLoader - Fix test file: use return_tokenized instead of return_tensors - Fix test file: use text_dataset instead of undefined dataset variable - Move parameter validation to constructor (fail fast on invalid params) - Add labels field in tokenized output for causal LM training - Add empty file handling with clear error message - Add tests for constructor validation and labels field	2026-01-08 11:35:00 +00:00
pre-commit-ci[bot]	3bf8ca7da2	[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci	2025-11-20 13:09:08 +00:00
vangmay	f05169e56a	Make the chunk function efficient	2025-11-20 21:08:33 +08:00
pre-commit-ci[bot]	d429363c23	[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci	2025-11-20 12:51:18 +00:00
vangmay	ee37dd9f92	Write simple test	2025-11-18 22:36:38 +08:00

8 commits