SUMM

AI chatbots process text as a sequence of tokens—discrete units from a predefined vocabulary—not as raw characters.
Using individual characters is semantically weak, while whole words are inefficient for handling rare words, typos, or new terms.
Tokenization breaks text into manageable subword units, balancing semantic meaning and size.
The subword token approach enables efficient processing but introduces challenges such as difficulty in character counting and basic math.

Tokens are artificial abstractions that limit LLM understanding below the token level.
Tokenization is a separate pre-processing step requiring standalone training, which can bias models toward languages (like English) dominant in the training data.
Non-English languages suffer from oversegmentation, leading to longer, less meaningful sequences and reduced performance.
Minor typos or language variations can heavily impact tokenization, causing confusion and easier jailbreaks.
Computational resources are allocated equally across tokens, inefficient for simple punctuation versus dense information.

BLT is a tokenizer-free architecture that processes raw bytes directly instead of tokens.
It forms "patches," dynamically grouped units without a fixed vocabulary, allocating more computation to semantically significant content.
Patches are defined dynamically so compute is focused where necessary, improving efficiency.

BLT segments byte sequences into patches based on the entropy of predicting the next byte (entropy-based patching).
Global constraint: starts a new patch if entropy exceeds a threshold (high uncertainty).
Monotonic constraint: starts a new patch if there’s a sudden change in entropy.
Predictable sequences get longer patches (fewer steps); unpredictable, information-rich segments get shorter patches (more computational focus).

Each byte is embedded alongside its local context (n-grams) for richer semantic representation.
N-grams are mapped to a fixed hash size using roll polyhash to prevent an unmanageable vocabulary size.
Architecture involves a local encoder to create patches, a lightweight transformer for patches, and a latent global transformer for higher-level representation.
Instead of next-byte prediction, the model predicts the next patch representation.
The model concludes with a local decoder to reconstruct the byte sequence back to text.

BLT matches Llama 3 performance while using up to 50% fewer FLOPs at inference.
Dynamic patching and representation prediction introduce substantial inference efficiency without loss of quality.
Patch size can be adjusted for further efficiency gains, which is not possible in traditional token-based LLMs.

This research represents the first FLOPs-controlled scaling study for byte-level models up to 8B parameters and 4 trillion training bytes, with an open-source release.
BLT improves performance on tasks needing subword awareness: orthographic knowledge, phonology, and low-resource machine translation.
BLT avoids tokenization issues, improving rare word handling and multilingual capabilities—areas problematic for current models.

New research proposes compressed chunks of raw bytes as units but struggles with languages lacking explicit spaces (e.g., Chinese).
These architectures offer the ability to predict multiple bytes or words at once.
For more details, viewers are directed to the findmypapers.ai website.

What If We Remove Tokenization In LLMs?