Tokens are artificial abstractions that limit LLM understanding below the token level.
Tokenization is a separate pre-processing step requiring standalone training, which can bias models toward languages (like English) dominant in the training data.
Non-English languages suffer from oversegmentation, leading to longer, less meaningful sequences and reduced performance.
Minor typos or language variations can heavily impact tokenization, causing confusion and easier jailbreaks.
Computational resources are allocated equally across tokens, inefficient for simple punctuation versus dense information.
Introduction to BLT (Byte Latent Transformer) 03:59
BLT is a tokenizer-free architecture that processes raw bytes directly instead of tokens.
It forms "patches," dynamically grouped units without a fixed vocabulary, allocating more computation to semantically significant content.
Patches are defined dynamically so compute is focused where necessary, improving efficiency.
BLT segments byte sequences into patches based on the entropy of predicting the next byte (entropy-based patching).
Global constraint: starts a new patch if entropy exceeds a threshold (high uncertainty).
Monotonic constraint: starts a new patch if there’s a sudden change in entropy.
Predictable sequences get longer patches (fewer steps); unpredictable, information-rich segments get shorter patches (more computational focus).
Building Meaningful Patches and Architecture Details 05:51
Each byte is embedded alongside its local context (n-grams) for richer semantic representation.
N-grams are mapped to a fixed hash size using roll polyhash to prevent an unmanageable vocabulary size.
Architecture involves a local encoder to create patches, a lightweight transformer for patches, and a latent global transformer for higher-level representation.
Instead of next-byte prediction, the model predicts the next patch representation.
The model concludes with a local decoder to reconstruct the byte sequence back to text.
This research represents the first FLOPs-controlled scaling study for byte-level models up to 8B parameters and 4 trillion training bytes, with an open-source release.
BLT improves performance on tasks needing subword awareness: orthographic knowledge, phonology, and low-resource machine translation.
BLT avoids tokenization issues, improving rare word handling and multilingual capabilities—areas problematic for current models.