Skip to main content

Command Palette

Search for a command to run...

How Tokenization Works: BPE and the Algorithm Behind Your LLM

Learn how Byte Pair Encoding (BPE) actually works — the algorithm that powers GPT, Claude, and LLaMA tokenizers. Step-by-step with examples.

Updated
9 min read
How Tokenization Works: BPE and the Algorithm Behind Your LLM

Every time you send a message to GPT-4 or Claude, an algorithm from 1994 decides how much you'll pay.

That algorithm is Byte Pair Encoding — BPE for short. It's not glamorous, but it's running under the hood of nearly every modern LLM. Once you understand how it works, a lot of tokenization mysteries start making sense.


Why You Should Care About the Algorithm

In the previous post, we covered what tokens are and why they matter for costs. But we left a question hanging: how does the tokenizer decide that "playing" becomes ['play', 'ing'] instead of ['pla', 'ying'] or just ['playing']?

The answer is BPE. And understanding it helps you:

  • Debug weird tokenization behavior

  • Understand why newer models are more efficient

  • Know why some text costs more than expected

  • Make sense of vocabulary sizes like "50,257" or "200,000"

A note on model references

This post mentions GPT-2, GPT-4, and GPT-4o rather than the very latest releases. That's intentional.

Tokenization internals — vocabulary size, merge strategies, encodings — are only reliable when publicly documented or verifiable via tooling like tiktoken. For newer models, those details are often abstracted away.

The core mechanics haven't changed: modern models still use subword vocabularies learned via BPE, and newer encodings generally expand vocabulary to reduce token counts (especially for code and multilingual text).

Model names evolve. The principles here stay accurate.


A Brief History

BPE wasn't invented for language models. Philip Gage introduced it in 1994 as a data compression technique — a way to shrink files by replacing common byte sequences with shorter codes.

In 2015, researchers adapted it for machine translation. The insight: instead of compressing files, use BPE to break words into subword pieces. This let translation models handle rare and compound words without an exploding vocabulary.

Then OpenAI used it for GPT. And GPT-2. And GPT-3. And GPT-4. Today, nearly every major LLM — GPT, Claude, LLaMA, Mistral — uses some form of BPE.


How BPE Works: The Core Idea

BPE has two phases:

  1. Training — Learn which character pairs to merge by analyzing a corpus

  2. Tokenization — Apply those learned merges to new text

The training phase is where the magic happens. Let's walk through it.


Phase 1: Training the Tokenizer

Step 1: Pre-tokenize

Start with some training text. Let's use a simple example:

"low low low low low lower lower newest newest newest newest newest newest widest widest widest"

First, split into words and add an end-of-word marker _. This marker prevents the algorithm from merging across word boundaries.

(low_: 5, lower_: 2, newest_: 6, widest_: 3)

The numbers are frequencies — how often each word appears.

Step 2: Create Base Vocabulary

Start with every unique character as a separate token:

vocab = {l, o, w, e, r, n, s, t, i, d, _}

Now represent each word as a sequence of these characters:

(l, o, w, _): 5
(l, o, w, e, r, _): 2
(n, e, w, e, s, t, _): 6
(w, i, d, e, s, t, _): 3

Step 3: Merge the Most Frequent Pair

This is the heart of BPE. Count every adjacent pair of characters across all words, weighted by frequency.

The pair (e, s) appears in "newest_" (6 times) and "widest_" (3 times) = 9 total occurrences. That's the most frequent.

Merge (e, s) into a new token es. Update the vocabulary and all word representations:

vocab = {l, o, w, e, r, n, s, t, i, d, _, es}

(l, o, w, _): 5
(l, o, w, e, r, _): 2
(n, e, w, es, t, _): 6
(w, i, d, es, t, _): 3

Step 4: Repeat

Keep merging the most frequent pair:

Merge 2: (es, t)est (appears 9 times)

vocab = {..., es, est}

Merge 3: (est, _)est_ (appears 9 times)

vocab = {..., est, est_}

Merge 4: (l, o)lo (appears 7 times)

vocab = {..., lo}

Merge 5: (lo, w)low (appears 7 times)

vocab = {..., low}

After 5 merges, our vocabulary and merge rules are:

vocab = {l, o, w, e, r, n, s, t, i, d, _, es, est, est_, lo, low}

Merge rules (in order):
(e, s) → es
(es, t) → est
(est, _) → est_
(l, o) → lo
(lo, w) → low

This continues until the vocabulary reaches a target size — 50,000 merges for GPT-2, around 100,000 for GPT-4.


Phase 2: Tokenizing New Text

Now we have a trained tokenizer. Let's use it on new text:

"newest binded lowers"

Step 1: Pre-tokenize

(newest_, binded_, lowers_)

Step 2: Break into Characters

(n, e, w, e, s, t, _)
(b, i, n, d, e, d, _)
(l, o, w, e, r, s, _)

Step 3: Apply Merge Rules in Order

Apply the learned merges in the exact order they were learned:

Apply (e, s) → es:
(n, e, w, es, t, _), (b, i, n, d, e, d, _), (l, o, w, e, r, s, _)

Apply (es, t) → est:
(n, e, w, est, _), (b, i, n, d, e, d, _), (l, o, w, e, r, s, _)

Apply (est, _) → est_:
(n, e, w, est_), (b, i, n, d, e, d, _), (l, o, w, e, r, s, _)

Apply (l, o) → lo:
(n, e, w, est_), (b, i, n, d, e, d, _), (lo, w, e, r, s, _)

Apply (lo, w) → low:
(n, e, w, est_), (b, i, n, d, e, d, _), (low, e, r, s, _)

Step 4: Handle Unknown Characters

Notice b wasn't in our vocabulary. In word-level BPE, unknown characters become [UNK]:

Final tokens: [n, e, w, est_, [UNK], i, n, d, e, d, _, low, e, r, s, _]

And that's BPE. Deceptively simple, but it explains a lot about how tokenization works.


Byte-Level BPE: What Modern LLMs Actually Use

The example above was "word-level" BPE — starting from characters. But GPT-2 and later models use byte-level BPE, which is slightly different.

Instead of starting with characters, byte-level BPE starts with 256 raw bytes (0x00 to 0xFF) as the base vocabulary.

text = "Hello"
bytes_list = list(text.encode("utf-8"))
# [72, 101, 108, 108, 111]

Every byte is a number from 0-255. This means:

  • No unknown tokens — any text can be represented as bytes

  • Works for any language without special handling

  • Emojis, special characters — all just byte sequences

The tradeoff: non-ASCII characters use multiple bytes. An emoji like 😀 is 4 bytes in UTF-8, which means more base tokens before merging.

"Hello".encode("utf-8")   # 5 bytes
"你好".encode("utf-8")     # 6 bytes for 2 characters
"😀".encode("utf-8")       # 4 bytes for 1 emoji

This is why non-English text and emojis cost more tokens — they start with more bytes before BPE merging happens.


The Regex Trick That Makes BPE Actually Work

Here's something most tutorials skip: raw byte-level BPE creates garbage tokens.

Consider this text appearing many times in training:

"barking. barking. barking."

Without any preprocessing, BPE might learn to merge g and . into a g. token — because they appear together frequently. But g. is useless. It's not a meaningful subword.

GPT-2 solved this with regex pre-tokenization. Before applying BPE, split the text into chunks using a regex pattern:

import re
pattern = r"""'s|'t|'re|'ve|'m|'ll|'d| ?\w+| ?\d+| ?[^\s\w\d]+|\s+"""

text = "Dog is barking. barking."
chunks = re.findall(pattern, text)
# ['Dog', ' is', ' barking', '.', ' barking', '.']

Now BPE merges happen within each chunk, not across them:

  • barking is one chunk

  • . is a separate chunk

  • The pair (g, .) never appears together — no garbage g. token

This is why GPT tokenizers learn useful subwords like ing, ed, pre instead of junk like g. or the.


Why Different Models Have Different Tokenizers

A question that confused me early on: why can't you use GPT-2's tokenizer for GPT-4?

Because the merge rules are learned from training data. Different training data → different merges → different tokenization.

GPT-2 was trained on web text from 2019. GPT-4 was trained on much more data, including tons of code. So GPT-4's tokenizer learned merges for patterns like def , import , return that GPT-2 never saw enough to merge.

Same code, fewer tokens in GPT-4:

code = "def calculate():"

# GPT-2 tokenizer: ~6 tokens
# GPT-4 tokenizer: ~3 tokens (learned "def " as one token)

Also, tokenizers are frozen with model weights. During training, the model learned that token ID 256 means a specific thing. If you swap tokenizers, ID 256 now means something else — complete gibberish.


Other Tokenization Algorithms (Brief Overview)

BPE isn't the only approach. Here's the landscape:

WordPiece — Used by BERT. Similar to BPE, but chooses merges by likelihood score instead of raw frequency. Uses ## to mark subword continuations (un##believ##able). Mostly legacy now.

SentencePiece — Not an algorithm, but a library. Implements BPE and Unigram. Key feature: treats text as a raw stream, so it works for languages without spaces (Chinese, Japanese). Used by T5, LLaMA, Mistral.

Unigram — The opposite of BPE. Starts with a huge vocabulary and prunes down instead of building up. Niche use.

The production reality: About 90% of modern LLMs use some form of BPE. If you understand BPE, you understand most tokenizers.


Vocabulary Sizes Explained

You'll see numbers like 50,257 or 200,000 for vocabulary sizes. Here's what they mean:

GPT-2: 50,257 tokens

  • 256 base bytes

  • 50,000 learned merges

  • 1 special token (<|endoftext|>)

GPT-4 (cl100k_base): ~100,000 tokens

  • More training data → more merges learned

  • Better coverage of code, multilingual text

GPT-4o (o200k_base): ~200,000 tokens

  • Even more merges

  • Significantly better for code and non-English

  • Same text = fewer tokens = lower cost

Larger vocabulary = more merges = longer tokens for common patterns = fewer tokens for the same text. But also slower embedding lookup. It's a tradeoff.


Things to Ponder

Take a moment to think through these. They're designed to check if the core ideas stuck, and you'll find the answers in what we covered above.

  1. BPE merges by frequency. WordPiece merges by likelihood. Both work. Why did most modern LLMs pick BPE?

  2. GPT-4 uses cl100k_base encoding. GPT-4o uses o200k_base. Why create a new encoding instead of updating the old one?

  3. LLaMA has 32K vocabulary. GPT-4 has 100K. Smaller vocab means what trade-off?

  4. You fine-tune a model on medical documents. Can you teach the tokenizer to treat "myocardial infarction" as one token?

  5. GPT-4o made Chinese significantly cheaper than GPT-4. Same BPE algorithm. What changed?


Key Takeaways

  • BPE learns merge rules by repeatedly combining the most frequent adjacent pairs in training data

  • Tokenization applies those merges in order to new text

  • Modern LLMs use byte-level BPE — starting with 256 bytes instead of characters

  • Regex pre-tokenization prevents garbage tokens like g. or the

  • Different models have different tokenizers because they learned different merges from different training data

  • Vocabulary size = 256 base bytes + N merges + special tokens

That 1994 compression algorithm is still deciding your API bill. Now you know how.


Want to discuss this further or have questions? Hit me up on LinkedIn.

LLM Foundations

Part 4 of 5

Breaking down how LLMs actually work — tokens, embeddings, context windows, and the fundamentals you need before building anything serious.

Up next

What Are Tokens and Why Your LLM Bill Depends on Them

Learn what tokens really are, why they're not words, and how understanding tokenization saves you money on LLM API costs.