tokenizers 0.22.2_2

textproc/py-tokenizers

Fast state-of-the-art tokenizers optimized for research and production

Category: textproc
Maintainer: tagattie@FreeBSD.org
WWW: https://github.com/huggingface/tokenizers
License: APACHE20
USES: cargo python

Description

Provides an implementation of today's most used tokenizers, with a
focus on performance and versatility.

Main features:
- Train new vocabularies and tokenize, using today's most used
  tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust
  implementation. Takes less than 20 seconds to tokenize a GB of text
  on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for research and production.
- Normalization comes with alignments tracking. It's always possible
  to get the part of the original sentence that corresponds to a given
  token.
- Does all the pre-processing: Truncate, Pad, add the special tokens
  your model needs.

Dependencies

Commit History

may be incomplete — full history at freebsd-ports on GitHub

Loading commit history — this may take up to a minute on first view. Reload the page in a moment.