Skip to content

goliajp/rust-tiktoken

Repository files navigation

rust-tiktoken

tiktoken on crates.io tiktoken-wasm on npm CI License

English | 简体中文 | 日本語

The fastest Rust BPE tokenizer, plus its WebAssembly bindings. Drop-in compatible with OpenAI tiktoken and the mainstream open models (Llama 3, DeepSeek, Qwen, Mistral). A hand-written ASCII fast-path makes ASCII text 15–40x faster than tiktoken-rs (≈2x on CJK/Unicode).

Crates in this workspace

Path Crate / Package Description Version
tiktoken/ tiktoken Rust BPE tokenizer — 11 encodings, 68 models, multi-provider pricing crates.io
tiktoken-wasm/ tiktoken-wasm (Rust) WASM binding crate for the above crates.io
tiktoken-wasm/ @goliapkg/tiktoken-wasm (npm) Same, published to npm via wasm-pack npm

The two crates live in one workspace and are versioned in lockstep — every release bumps and publishes both at the same version number.

Highlights

  • ASCII fast-path pre-tokenizer — resolves the common ASCII pieces (letters, digits, punctuation, contractions) without invoking the regex engine; 2.3–5.5x faster encode / count on ASCII text for cl100k / o200k / qwen2 / deepseek. Unicode/CJK transparently falls back to the regex.
  • 11 encodings · 68 models · 7 providers — OpenAI (GPT-4/4o/4.1/4.5, GPT-5.x, o1/o3/o4-mini, gpt-oss), Llama 3/4, DeepSeek, Qwen, Mistral; plus USD cost estimation (Anthropic & Google included for pricing).
  • Lean & portable — arena-based vocabulary, hybrid linear/heap BPE merge, optional rayon parallelism, allocation-free count(), pure Rust with zero C dependencies, a tiny wasm build, and zstd-compressed vocab embedded at compile time.

Full API, supported-model tables, and benchmarks live in the per-crate READMEs: tiktoken/ · tiktoken-wasm/.

Quick start

Rust

[dependencies]
tiktoken = "3.5"
// by encoding name
let enc = tiktoken::get_encoding("cl100k_base").unwrap();
let tokens = enc.encode("hello world");
assert_eq!(enc.decode_to_string(&tokens).unwrap(), "hello world");

// count without allocating a token vector
let n = enc.count("The quick brown fox.");

// or resolve by model name
let enc = tiktoken::encoding_for_model("gpt-4o").unwrap();

WebAssembly (browser / Node.js)

npm install @goliapkg/tiktoken-wasm
import init, { getEncoding } from '@goliapkg/tiktoken-wasm'
await init()
const enc = getEncoding('o200k_base')
const tokens = enc.encode('hello world')

Performance

On an Apple M4 Mac mini, ASCII encode / count is 15–40x faster than tiktoken-rs and ~20–40x faster than Python tiktoken (≈2x on CJK/Unicode, which defers to the regex). Full per-encoding tables and methodology: tiktoken/README.md#performance.

Build

cargo test -p tiktoken
cargo fmt --all --check
cargo clippy --workspace --lib -- -D warnings

# WASM (requires wasm-pack: cargo install wasm-pack)
cd tiktoken-wasm
wasm-pack build --target web --release --scope goliapkg

Release

tiktoken and tiktoken-wasm are versioned in lockstep and released together via git-flow (no PRs):

git flow release start X.Y.Z
# bump versions to X.Y.Z: tiktoken/Cargo.toml, tiktoken-wasm/Cargo.toml
# (and its tiktoken path-dep); finalize both CHANGELOGs
git flow release finish X.Y.Z                       # merge → master, tag vX.Y.Z, back-merge develop
git tag -a tiktoken-wasm-vX.Y.Z vX.Y.Z^{commit} -m "tiktoken-wasm X.Y.Z"
git push origin master develop vX.Y.Z tiktoken-wasm-vX.Y.Z
# tag `v*` publishes the tiktoken crate; `tiktoken-wasm-v*` publishes the wasm crate + npm

License

Licensed under either of MIT or Apache-2.0, at your option.

About

High-performance Rust tiktoken BPE tokenizer + WASM bindings (@goliapkg/tiktoken-wasm)

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages