tokenizers 0.22.2_2
textproc/py-tokenizers
Fast state-of-the-art tokenizers optimized for research and production
Description
Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: - Train new vocabularies and tokenize, using today's most used tokenizers. - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU. - Easy to use, but also extremely versatile. - Designed for research and production. - Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token. - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
Dependencies
- build devel/pkgconf
- build devel/py-build
- build devel/py-installer
- build devel/py-maturin
- build lang/python311
- build lang/rust
- lib devel/oniguruma
- run lang/python311
- run misc/py-huggingface-hub
Commit History
may be incomplete — full history at freebsd-ports on GitHub
Loading commit history — this may take up to a minute on first view. Reload the page in a moment.