Update tokenizer mappings to use TokenizersBackend for additional models#46091
Update tokenizer mappings to use TokenizersBackend for additional models#46091itazap wants to merge 15 commits into
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
68f0144 to
3d7abee
Compare
| pad_len = len(expected_input_ids_2) - len(expected_input_ids_1) | ||
|
|
||
| expected_attention_mask = [ [0] * pad_len + [1] * len(expected_input_ids_1), [1] * (len(expected_input_ids_2))] | ||
| expected_attention_mask = [ [1] * len(expected_input_ids_1) + [0] * pad_len, [1] * (len(expected_input_ids_2))] |
There was a problem hiding this comment.
this test was changed a few months ago and im changing it back
05a70e5 to
2ba20bb
Compare
|
run-slow: aria, auto |
|
This comment contains models: ["models/aria", "models/auto"] |
CI ResultsCommit Info
The test failure analysis could not be completed. Please check the workflow run for details. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Nice! Well given how many used gpt2 I am not really surprised
| if ( | ||
| tokenizer_auto_map is None | ||
| and TokenizersBackend is not None | ||
| and _config_name_or_path.startswith("deepseek-ai/deepseek-r1") |
There was a problem hiding this comment.
no no let's support a new kind of entry "deepseek-ai/deepseek-r1" not only this one.
| "rhymes-ai/Aria", | ||
| "Salesforce/blip2-flan-t5-xl", | ||
| "google/bigbird-pegasus-large-pubmed", | ||
| "microsoft/kosmos-2-patch14-224", | ||
| "allenai/OLMo-2-0425-1B", | ||
| "stabilityai/tiny-random-stablelm-2", |
There was a problem hiding this comment.
these are the ones we should have in PER_MODEL_ID_TOKENIZER_FIX no?
There was a problem hiding this comment.
these are ones that don't have a dedicated tokenizer class so the mapping was updated to map the model_types (aria, blip, bigbird_pegasus, etc.) always to TokenzersBackend. so its not just these checkpoints in theory! these are just the checkpoints that we use in transformers to test these model_types. lmk if that makes sense!
|
FYI these are all the model checkpoints we use in transformers that we dont directly test tokenization for. We should add a test somewhere that wont clog the CI to do a simple check, would have saved us a lot of trouble! self.assertEqual(
tokenizer_tok(text, add_special_tokens=False)["input_ids"],
tokenizer_auto(text, add_special_tokens=False)["input_ids"],
)BAAI/Emu3-Chat-hf |
765e75b to
be52825
Compare
d551fe0 to
5e96632
Compare
…tokenizer_class Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5e96632 to
76ff958
Compare
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46091&sha=76ff95 |
|
^ idk how the new PR CI git works bc it is green https://app.circleci.com/pipelines/github/huggingface/transformers?branch=tokenizers_backend_update |
ArthurZucker
left a comment
There was a problem hiding this comment.
Ty! 🤗 I would go as far as to add tests for the checkpoints in TOKENIZERS_BACKEND_AUTO_MAPPING_CHECKPOINTS ! 🤗
|
|
||
| @slow | ||
| @require_tokenizers | ||
| @parameterized.expand(TOKENIZERS_BACKEND_AUTO_MAPPING_CHECKPOINTS) |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: aria, auto |
see the description / report here: #45936
this PR is just the auto changes from the PR above. it's for models that don't have their own Tokenizer class so we don't have
test_tokenization_*.pyfor these models.fixes: #45920, #46710, #45488, #46489