What Are Tokens and Why Your LLM Bill Depends on Them

"Hello" is 1 token. "你好" is 2 tokens. Same meaning. Double the cost.

That little fact tripped me up when I first started working with LLMs. I assumed tokens were just... words. They're not. And that misunderstanding quietly inflates API bills everywhere.

The Problem: We Think in Words, LLMs Don't

Here's what happens to most of us when we start out:

Someone asks for a cost estimate. We count words. We multiply by the API's price-per-token, assuming 1 word ≈ 1 token. We're confident in our math.

Then the actual bill arrives. It's 30% higher. Sometimes 300% higher.

The issue? LLMs don't see words. They see tokens. And tokens follow their own rules, rules that have nothing to do with how we read text.

Once this clicked for me, a lot of other things started making sense: why context windows fill up faster than expected, why non-English apps cost more, why some prompts are mysteriously expensive.

So What Are Tokens, Really?

Think of tokens as the atoms of text for an LLM. The smallest units it works with.

But here's the thing that confused me initially: tokens aren't words, and they aren't characters. They sit somewhere in between.

Character level: "playing" = ['p','l','a','y','i','n','g']  → 7 units
Token level:     "playing" = ['play', 'ing']               → 2 units  
Word level:      "playing" = ['playing']                   → 1 unit

See that middle row? That's what the model actually sees. Not the full word "playing", but two pieces: "play" and "ing".

The technical definition: A token is a subword unit from a fixed vocabulary that the model learned during training. This vocabulary is typically 32,000 to 200,000 tokens, depending on the model.

Why subwords? Because it's a sweet spot. Pure character-level tokenization creates absurdly long sequences. Pure word-level tokenization can't handle new or rare words. Subwords give you the best of both — common words stay whole, rare words get split into recognizable pieces.

What Actually Happens When You Send Text to an LLM

When you send "I heard a dog bark loudly at a cat" to GPT, your text goes on a little journey:

The tokenizer converts your text into a sequence of integers:

"I heard a dog bark loudly at a cat"
     ↓
[40, 5765, 257, 3290, 14187, 27967, 379, 257, 3797]

Each number is a token ID — basically a lookup into the model's vocabulary. The model crunches these numbers, spits out new numbers, and a detokenizer converts them back to text.

The model never sees your actual text. It only sees numbers. This is why tokenization matters so much — it's the translation layer between human language and what the model actually processes.

The Common Ways Text Gets Tokenized

Not all tokenizers work the same way. Here's the landscape:

Word Tokenization — Split on spaces and punctuation. Simple and intuitive. Falls apart with compound words, technical terms, and languages that don't use spaces (like Chinese or Japanese).

Character Tokenization — Split into individual letters. Handles anything, but "Hello" becomes 5 tokens instead of 1. Sequences get very long, very fast.

Subword Tokenization — What modern LLMs actually use. Common words stay whole. Rare words get broken into meaningful pieces. "unbelievable" becomes ["un", "believ", "able"] — each piece still carries meaning.

Sentence Tokenization — Keeps full sentences as units. You'll see this in RAG pipelines where preserving semantic boundaries matters more than character-level precision.

Most production LLMs — GPT-4, Claude, LLaMA, Mistral — use a subword algorithm called BPE (Byte Pair Encoding). How BPE actually works is a topic for another post. For now, just know it learns which character sequences appear together frequently and merges them into single tokens.

Not All Tokens Are Visible Text

This one surprised me at first. Some tokens aren't words at all — they're control signals.

Text tokens — Your actual content. Words, numbers, punctuation.

Special tokens — Behind-the-scenes markers:

<|endoftext|> or </s> — "This is the end of the input"
[PAD] — Filler when batching inputs of different lengths
[MASK] — Used during training to hide tokens the model must predict
[UNK] — "I've never seen this character before"

Different models use different special tokens. BERT has [CLS] and [SEP]. GPT models have <|endoftext|>. LLaMA uses <s> and </s>.

Why does this matter? Because these tokens count toward your context limit and your bill, even though you never typed them.

Why This Token ≠ Word Thing Actually Matters

Okay, theory is nice. Here's where it hits your wallet.

You're Charged Per Token, Not Per Word

text = "The developer implemented antidisestablishmentarianism"

# Word count: 4 words
# Token count: 9 tokens
# ['The', ' developer', ' implemented', ' ant', 'idis', 'establish', 'ment', 'arian', 'ism']

That one long word costs 5 tokens by itself. A word-based estimate would be off by 125%.

Context Windows Are Token Budgets

GPT-4's "128K context window" isn't 128,000 words. It's 128,000 tokens. (Newer models like GPT-4.1 and Claude Sonnet 4 now support up to 1 million tokens — but same principle applies.)

Rough math: 100,000 words ≈ 133,000 tokens. That document you thought would fit? Might not.

Non-English Text Is Token-Expensive

This is the one that catches people building global products:

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")

english = "Hello, how are you today?"
chinese = "你好，你今天好吗？"  # Same meaning

print(len(enc.encode(english)))  # 7 tokens
print(len(enc.encode(chinese)))  # 11 tokens

Same meaning. 57% more tokens. 57% higher cost.

Why? BPE tokenizers are trained mostly on English text. English words get efficiently merged into single tokens. Chinese characters appear less frequently in training data, so they stay as separate tokens or get split further.

If you're building for multiple markets: Chinese, Japanese, Arabic, Thai — all cost more per equivalent message. Your pricing model might need to account for this.

A Quick Note on "Encodings"

You'll see terms like cl100k_base and o200k_base in tokenizer code. These confused me at first.

They're not algorithms. They're vocabulary names.

Encoding	Used By	Vocab Size
`cl100k_base`	GPT-4, GPT-3.5-turbo	~100K tokens
`o200k_base`	GPT-4o, GPT-4o-mini	~200K tokens
`p50k_base`	Codex, text-davinci-002	~50K tokens

Newer models like GPT-4.1, GPT-5, and the o-series reasoning models (o1, o3, o4-mini) all use o200k_base as well. When in doubt, let encoding_for_model() figure it out for you.

All of these use the same BPE algorithm under the hood. The difference is which vocabulary — learned from which training data — they're using.

GPT-4o's larger vocabulary means it learned more merges, especially for code and non-English text. Same text, fewer tokens, lower cost. That's why o200k_base exists.

import tiktoken

# Let tiktoken pick the right encoding automatically:
enc = tiktoken.encoding_for_model("gpt-4")    # Uses cl100k_base
enc = tiktoken.encoding_for_model("gpt-4o")   # Uses o200k_base

tokens = enc.encode("your text here")
print(len(tokens))

Always use encoding_for_model(). Don't hardcode encoding names unless you have a specific reason.

Different Models, Different Tokenizers

Using GPT-2's tokenizer to estimate GPT-4 costs? The numbers won't match.

Why?

Each tokenizer is trained on that model's training data. Different data → different vocabulary → different token counts for the same text.

GPT-4 saw way more code than GPT-2, so it learned efficient merges for programming patterns. Same Python snippet might be 13 tokens in GPT-2's tokenizer and 8 tokens in GPT-4's.

Also important: You can't swap tokenizers between models. The model's weights are tied to specific token IDs. Token ID 256 in GPT-4 might mean "ing". In LLaMA, it might mean "the". Mix them up and you get nonsense output.

# WRONG — using GPT-2 tokenizer for GPT-4 API:
from transformers import GPT2Tokenizer
tok = GPT2Tokenizer.from_pretrained("gpt2")
tokens = tok.encode("Hello world")  # ❌ Wrong count for GPT-4!

# RIGHT — use the model's actual tokenizer:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Hello world")  # ✓ Accurate

How to Actually Count Tokens

For OpenAI models:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")
text = "Your text here"
print(f"Tokens: {len(enc.encode(text))}")

For open-source models (LLaMA, Mistral, etc.):

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")
tokens = tokenizer.encode("Your text here")
print(f"Tokens: {len(tokens)}")

For Claude: Use Anthropic's /v1/messages/count_tokens API endpoint. They don't publish their tokenizer publicly.

For Gemini: Use Google's countTokens endpoint. Same situation — no public tokenizer.

The rule I follow: Never estimate when I can count. It takes two lines of code.

Quick Estimation (When You're in a Hurry)

Sometimes you just need a ballpark. For English text:

1 token ≈ 4 characters
1 token ≈ 0.75 words

Quick math:
- 100 words ≈ 133 tokens
- 1000 characters ≈ 250 tokens

But treat these as rough guides, not facts. Code, technical jargon, and non-English text will blow these estimates apart. Use actual token counts for anything where accuracy matters.

Mistakes I've Learned to Avoid

Counting words for cost estimates — The "1000 words ≈ 1000 tokens" assumption is wrong. Actual ratio varies wildly depending on content.

Using the wrong tokenizer — GPT-2 tokenizer for GPT-4 estimates. LLaMA tokenizer for Mistral. Every model needs its own tokenizer.

Forgetting message overhead — API messages have hidden tokens: role markers, separators. A 10-message conversation might add 30+ invisible tokens.

ALL CAPS for emphasis —

enc = tiktoken.encoding_for_model("gpt-4")
print(len(enc.encode("hello")))  # 1 token
print(len(enc.encode("HELLO")))  # 2 tokens — gets split into ["HEL", "LO"]

The model understands emphasis just fine in lowercase. Save your tokens.

Emoji overuse —

text1 = "I am happy"      # 3 tokens
text2 = "I am happy 😀"   # 5 tokens (the emoji alone is 2 tokens)

Emojis are 4 bytes in UTF-8. Tokenizers split them up. A chatbot that uses 👍 and ❌ everywhere is paying 2-3x more than one that uses "yes" and "no".

Things to Ponder

Take a moment to think through these. They're designed to check if the core ideas stuck, and you'll find the answers in what we covered above.

Two sentences: "I love AI" and "I LOVE AI" — same words, same meaning. Why might one cost more than the other?
Your app serves users in English and Chinese. Same conversation length, same features. Why might Chinese users cost you 2x more?
A support bot uses 👍 and ❌ in responses. Your colleague suggests switching to "yes" and "no". Overthinking or real savings?
You're counting tokens using GPT-2's tokenizer but calling GPT-4 or higher API. Your estimates are always off. Why?
"1000 words ≈ 1000 tokens" — your PM uses this for cost estimation. What's the flaw in this thinking?

Key Takeaways

Tokens are subword units — not words, not characters. They're what LLMs actually process.
APIs charge per token. Context limits are in tokens. Everything that costs you money is measured in tokens.
Different models have different tokenizers. Always use the right one for accurate counts.
Non-English text and emojis are token-expensive. Plan for this in multilingual products.
Don't estimate when you can count. tiktoken for OpenAI, AutoTokenizer for open-source.

"Hello" is 1 token. "你好" is 2 tokens. Now you know why — and what to do about it.

Want to discuss this further about “Things to Ponder” or have questions? Hit me up on LinkedIn.

What Are Tokens and Why Your LLM Bill Depends on Them

The Problem: We Think in Words, LLMs Don't

So What Are Tokens, Really?

What Actually Happens When You Send Text to an LLM

The Common Ways Text Gets Tokenized

Not All Tokens Are Visible Text

Why This Token ≠ Word Thing Actually Matters

You're Charged Per Token, Not Per Word

Context Windows Are Token Budgets

Non-English Text Is Token-Expensive

A Quick Note on "Encodings"

Different Models, Different Tokenizers

How to Actually Count Tokens

Quick Estimation (When You're in a Hurry)

Mistakes I've Learned to Avoid

Things to Ponder

Key Takeaways

Comments

LLM Foundations

How Tokenization Works: BPE and the Algorithm Behind Your LLM

More from this blog

Anatomy of a Prompt — System, User, and Assistant Explained

Choosing Embedding Models and Dimensions: Why 1536 Isn't Always Better Than 384

What Are Embeddings and How Vector Similarity Actually Works

How Tokenization Works: BPE and the Algorithm Behind Your LLM

Command Palette

The Problem: We Think in Words, LLMs Don't

So What Are Tokens, Really?

What Actually Happens When You Send Text to an LLM

The Common Ways Text Gets Tokenized

Not All Tokens Are Visible Text

Why This Token ≠ Word Thing Actually Matters

You're Charged Per Token, Not Per Word

Context Windows Are Token Budgets

Non-English Text Is Token-Expensive

A Quick Note on "Encodings"

Different Models, Different Tokenizers

How to Actually Count Tokens

Quick Estimation (When You're in a Hurry)

Mistakes I've Learned to Avoid

Things to Ponder

Key Takeaways

Comments

LLM Foundations

How Tokenization Works: BPE and the Algorithm Behind Your LLM

More from this blog