How Tokenization Works: BPE and the Algorithm Behind Your LLM

Every time you send a message to GPT-4 or Claude, an algorithm from 1994 decides how much you'll pay.

That algorithm is Byte Pair Encoding — BPE for short. It's not glamorous, but it's running under the hood of nearly every modern LLM. Once you understand how it works, a lot of tokenization mysteries start making sense.

Why You Should Care About the Algorithm

In the previous post, we covered what tokens are and why they matter for costs. But we left a question hanging: how does the tokenizer decide that "playing" becomes ['play', 'ing'] instead of ['pla', 'ying'] or just ['playing']?

The answer is BPE. And understanding it helps you:

Debug weird tokenization behavior
Understand why newer models are more efficient
Know why some text costs more than expected
Make sense of vocabulary sizes like "50,257" or "200,000"

A note on model references

This post mentions GPT-2, GPT-4, and GPT-4o rather than the very latest releases. That's intentional.

Tokenization internals — vocabulary size, merge strategies, encodings — are only reliable when publicly documented or verifiable via tooling like tiktoken. For newer models, those details are often abstracted away.

The core mechanics haven't changed: modern models still use subword vocabularies learned via BPE, and newer encodings generally expand vocabulary to reduce token counts (especially for code and multilingual text).

Model names evolve. The principles here stay accurate.

A Brief History

BPE wasn't invented for language models. Philip Gage introduced it in 1994 as a data compression technique — a way to shrink files by replacing common byte sequences with shorter codes.

In 2015, researchers adapted it for machine translation. The insight: instead of compressing files, use BPE to break words into subword pieces. This let translation models handle rare and compound words without an exploding vocabulary.

Then OpenAI used it for GPT. And GPT-2. And GPT-3. And GPT-4. Today, nearly every major LLM — GPT, Claude, LLaMA, Mistral — uses some form of BPE.

How BPE Works: The Core Idea

BPE has two phases:

Training — Learn which character pairs to merge by analyzing a corpus
Tokenization — Apply those learned merges to new text

The training phase is where the magic happens. Let's walk through it.

Phase 1: Training the Tokenizer

Step 1: Pre-tokenize

Start with some training text. Let's use a simple example:

"low low low low low lower lower newest newest newest newest newest newest widest widest widest"

First, split into words and add an end-of-word marker _. This marker prevents the algorithm from merging across word boundaries.

(low_: 5, lower_: 2, newest_: 6, widest_: 3)

The numbers are frequencies — how often each word appears.

Step 2: Create Base Vocabulary

Start with every unique character as a separate token:

vocab = {l, o, w, e, r, n, s, t, i, d, _}

Now represent each word as a sequence of these characters:

(l, o, w, _): 5
(l, o, w, e, r, _): 2
(n, e, w, e, s, t, _): 6
(w, i, d, e, s, t, _): 3

Step 3: Merge the Most Frequent Pair

This is the heart of BPE. Count every adjacent pair of characters across all words, weighted by frequency.

The pair (e, s) appears in "newest_" (6 times) and "widest_" (3 times) = 9 total occurrences. That's the most frequent.

Merge (e, s) into a new token es. Update the vocabulary and all word representations:

vocab = {l, o, w, e, r, n, s, t, i, d, _, es}

(l, o, w, _): 5
(l, o, w, e, r, _): 2
(n, e, w, es, t, _): 6
(w, i, d, es, t, _): 3

Step 4: Repeat

Keep merging the most frequent pair:

Merge 2: (es, t) → est (appears 9 times)

vocab = {..., es, est}

Merge 3: (est, _) → est_ (appears 9 times)

vocab = {..., est, est_}

Merge 4: (l, o) → lo (appears 7 times)

vocab = {..., lo}

Merge 5: (lo, w) → low (appears 7 times)

vocab = {..., low}

After 5 merges, our vocabulary and merge rules are:

vocab = {l, o, w, e, r, n, s, t, i, d, _, es, est, est_, lo, low}

Merge rules (in order):
(e, s) → es
(es, t) → est
(est, _) → est_
(l, o) → lo
(lo, w) → low

This continues until the vocabulary reaches a target size — 50,000 merges for GPT-2, around 100,000 for GPT-4.

Phase 2: Tokenizing New Text

Now we have a trained tokenizer. Let's use it on new text:

"newest binded lowers"

Step 1: Pre-tokenize

(newest_, binded_, lowers_)

Step 2: Break into Characters

(n, e, w, e, s, t, _)
(b, i, n, d, e, d, _)
(l, o, w, e, r, s, _)

Step 3: Apply Merge Rules in Order

Apply the learned merges in the exact order they were learned:

Apply (e, s) → es:
(n, e, w, es, t, _), (b, i, n, d, e, d, _), (l, o, w, e, r, s, _)

Apply (es, t) → est:
(n, e, w, est, _), (b, i, n, d, e, d, _), (l, o, w, e, r, s, _)

Apply (est, _) → est_:
(n, e, w, est_), (b, i, n, d, e, d, _), (l, o, w, e, r, s, _)

Apply (l, o) → lo:
(n, e, w, est_), (b, i, n, d, e, d, _), (lo, w, e, r, s, _)

Apply (lo, w) → low:
(n, e, w, est_), (b, i, n, d, e, d, _), (low, e, r, s, _)

Step 4: Handle Unknown Characters

Notice b wasn't in our vocabulary. In word-level BPE, unknown characters become [UNK]:

Final tokens: [n, e, w, est_, [UNK], i, n, d, e, d, _, low, e, r, s, _]

And that's BPE. Deceptively simple, but it explains a lot about how tokenization works.

Byte-Level BPE: What Modern LLMs Actually Use

The example above was "word-level" BPE — starting from characters. But GPT-2 and later models use byte-level BPE, which is slightly different.

Instead of starting with characters, byte-level BPE starts with 256 raw bytes (0x00 to 0xFF) as the base vocabulary.

text = "Hello"
bytes_list = list(text.encode("utf-8"))
# [72, 101, 108, 108, 111]

Every byte is a number from 0-255. This means:

No unknown tokens — any text can be represented as bytes
Works for any language without special handling
Emojis, special characters — all just byte sequences

The tradeoff: non-ASCII characters use multiple bytes. An emoji like 😀 is 4 bytes in UTF-8, which means more base tokens before merging.

"Hello".encode("utf-8")   # 5 bytes
"你好".encode("utf-8")     # 6 bytes for 2 characters
"😀".encode("utf-8")       # 4 bytes for 1 emoji

This is why non-English text and emojis cost more tokens — they start with more bytes before BPE merging happens.

The Regex Trick That Makes BPE Actually Work

Here's something most tutorials skip: raw byte-level BPE creates garbage tokens.

Consider this text appearing many times in training:

"barking. barking. barking."

Without any preprocessing, BPE might learn to merge g and . into a g. token — because they appear together frequently. But g. is useless. It's not a meaningful subword.

GPT-2 solved this with regex pre-tokenization. Before applying BPE, split the text into chunks using a regex pattern:

import re
pattern = r"""'s|'t|'re|'ve|'m|'ll|'d| ?\w+| ?\d+| ?[^\s\w\d]+|\s+"""

text = "Dog is barking. barking."
chunks = re.findall(pattern, text)
# ['Dog', ' is', ' barking', '.', ' barking', '.']

Now BPE merges happen within each chunk, not across them:

barking is one chunk
. is a separate chunk
The pair (g, .) never appears together — no garbage g. token

This is why GPT tokenizers learn useful subwords like ing, ed, pre instead of junk like g. or the.

Why Different Models Have Different Tokenizers

A question that confused me early on: why can't you use GPT-2's tokenizer for GPT-4?

Because the merge rules are learned from training data. Different training data → different merges → different tokenization.

GPT-2 was trained on web text from 2019. GPT-4 was trained on much more data, including tons of code. So GPT-4's tokenizer learned merges for patterns like def , import , return that GPT-2 never saw enough to merge.

Same code, fewer tokens in GPT-4:

code = "def calculate():"

# GPT-2 tokenizer: ~6 tokens
# GPT-4 tokenizer: ~3 tokens (learned "def " as one token)

Also, tokenizers are frozen with model weights. During training, the model learned that token ID 256 means a specific thing. If you swap tokenizers, ID 256 now means something else — complete gibberish.

Other Tokenization Algorithms (Brief Overview)

BPE isn't the only approach. Here's the landscape:

WordPiece — Used by BERT. Similar to BPE, but chooses merges by likelihood score instead of raw frequency. Uses ## to mark subword continuations (un##believ##able). Mostly legacy now.

SentencePiece — Not an algorithm, but a library. Implements BPE and Unigram. Key feature: treats text as a raw stream, so it works for languages without spaces (Chinese, Japanese). Used by T5, LLaMA, Mistral.

Unigram — The opposite of BPE. Starts with a huge vocabulary and prunes down instead of building up. Niche use.

The production reality: About 90% of modern LLMs use some form of BPE. If you understand BPE, you understand most tokenizers.

Vocabulary Sizes Explained

You'll see numbers like 50,257 or 200,000 for vocabulary sizes. Here's what they mean:

GPT-2: 50,257 tokens

256 base bytes
50,000 learned merges
1 special token (<|endoftext|>)

GPT-4 (cl100k_base): ~100,000 tokens

More training data → more merges learned
Better coverage of code, multilingual text

GPT-4o (o200k_base): ~200,000 tokens

Even more merges
Significantly better for code and non-English
Same text = fewer tokens = lower cost

Larger vocabulary = more merges = longer tokens for common patterns = fewer tokens for the same text. But also slower embedding lookup. It's a tradeoff.

Things to Ponder

Take a moment to think through these. They're designed to check if the core ideas stuck, and you'll find the answers in what we covered above.

BPE merges by frequency. WordPiece merges by likelihood. Both work. Why did most modern LLMs pick BPE?
GPT-4 uses cl100k_base encoding. GPT-4o uses o200k_base. Why create a new encoding instead of updating the old one?
LLaMA has 32K vocabulary. GPT-4 has 100K. Smaller vocab means what trade-off?
You fine-tune a model on medical documents. Can you teach the tokenizer to treat "myocardial infarction" as one token?
GPT-4o made Chinese significantly cheaper than GPT-4. Same BPE algorithm. What changed?

Key Takeaways

BPE learns merge rules by repeatedly combining the most frequent adjacent pairs in training data
Tokenization applies those merges in order to new text
Modern LLMs use byte-level BPE — starting with 256 bytes instead of characters
Regex pre-tokenization prevents garbage tokens like g. or the
Different models have different tokenizers because they learned different merges from different training data
Vocabulary size = 256 base bytes + N merges + special tokens

That 1994 compression algorithm is still deciding your API bill. Now you know how.

Want to discuss this further or have questions? Hit me up on LinkedIn.

How Tokenization Works: BPE and the Algorithm Behind Your LLM

Why You Should Care About the Algorithm

A note on model references

A Brief History

How BPE Works: The Core Idea

Phase 1: Training the Tokenizer

Step 1: Pre-tokenize

Step 2: Create Base Vocabulary

Step 3: Merge the Most Frequent Pair

Step 4: Repeat

Phase 2: Tokenizing New Text

Step 1: Pre-tokenize

Step 2: Break into Characters

Step 3: Apply Merge Rules in Order

Step 4: Handle Unknown Characters

Byte-Level BPE: What Modern LLMs Actually Use

The Regex Trick That Makes BPE Actually Work

Why Different Models Have Different Tokenizers

Other Tokenization Algorithms (Brief Overview)

Vocabulary Sizes Explained

Things to Ponder

Key Takeaways

Comments

LLM Foundations

What Are Embeddings and How Vector Similarity Actually Works

More from this blog

Anatomy of a Prompt — System, User, and Assistant Explained

Choosing Embedding Models and Dimensions: Why 1536 Isn't Always Better Than 384

What Are Embeddings and How Vector Similarity Actually Works

What Are Tokens and Why Your LLM Bill Depends on Them

Command Palette

Why You Should Care About the Algorithm

A note on model references

A Brief History

How BPE Works: The Core Idea

Phase 1: Training the Tokenizer

Step 1: Pre-tokenize

Step 2: Create Base Vocabulary

Step 3: Merge the Most Frequent Pair

Step 4: Repeat

Phase 2: Tokenizing New Text

Step 1: Pre-tokenize

Step 2: Break into Characters

Step 3: Apply Merge Rules in Order

Step 4: Handle Unknown Characters

Byte-Level BPE: What Modern LLMs Actually Use

The Regex Trick That Makes BPE Actually Work

Why Different Models Have Different Tokenizers

Other Tokenization Algorithms (Brief Overview)

Vocabulary Sizes Explained

Things to Ponder

Key Takeaways

Comments

LLM Foundations

What Are Embeddings and How Vector Similarity Actually Works

More from this blog