What Are Tokens and Why Your LLM Bill Depends on Them
Learn what tokens really are, why they're not words, and how understanding tokenization saves you money on LLM API costs.

"Hello" is 1 token. "你好" is 2 tokens. Same meaning. Double the cost.
That little fact tripped me up when I first started working with LLMs. I assumed tokens were just... words. They're not. And that misunderstanding quietly inflates API bills everywhere.
The Problem: We Think in Words, LLMs Don't
Here's what happens to most of us when we start out:
Someone asks for a cost estimate. We count words. We multiply by the API's price-per-token, assuming 1 word ≈ 1 token. We're confident in our math.
Then the actual bill arrives. It's 30% higher. Sometimes 300% higher.
The issue? LLMs don't see words. They see tokens. And tokens follow their own rules, rules that have nothing to do with how we read text.
Once this clicked for me, a lot of other things started making sense: why context windows fill up faster than expected, why non-English apps cost more, why some prompts are mysteriously expensive.
So What Are Tokens, Really?
Think of tokens as the atoms of text for an LLM. The smallest units it works with.
But here's the thing that confused me initially: tokens aren't words, and they aren't characters. They sit somewhere in between.
Character level: "playing" = ['p','l','a','y','i','n','g'] → 7 units
Token level: "playing" = ['play', 'ing'] → 2 units
Word level: "playing" = ['playing'] → 1 unit
See that middle row? That's what the model actually sees. Not the full word "playing", but two pieces: "play" and "ing".
The technical definition: A token is a subword unit from a fixed vocabulary that the model learned during training. This vocabulary is typically 32,000 to 200,000 tokens, depending on the model.
Why subwords? Because it's a sweet spot. Pure character-level tokenization creates absurdly long sequences. Pure word-level tokenization can't handle new or rare words. Subwords give you the best of both — common words stay whole, rare words get split into recognizable pieces.
What Actually Happens When You Send Text to an LLM
When you send "I heard a dog bark loudly at a cat" to GPT, your text goes on a little journey:
The tokenizer converts your text into a sequence of integers:
"I heard a dog bark loudly at a cat"
↓
[40, 5765, 257, 3290, 14187, 27967, 379, 257, 3797]
Each number is a token ID — basically a lookup into the model's vocabulary. The model crunches these numbers, spits out new numbers, and a detokenizer converts them back to text.
The model never sees your actual text. It only sees numbers. This is why tokenization matters so much — it's the translation layer between human language and what the model actually processes.
The Common Ways Text Gets Tokenized
Not all tokenizers work the same way. Here's the landscape:
Word Tokenization — Split on spaces and punctuation. Simple and intuitive. Falls apart with compound words, technical terms, and languages that don't use spaces (like Chinese or Japanese).
Character Tokenization — Split into individual letters. Handles anything, but "Hello" becomes 5 tokens instead of 1. Sequences get very long, very fast.
Subword Tokenization — What modern LLMs actually use. Common words stay whole. Rare words get broken into meaningful pieces. "unbelievable" becomes ["un", "believ", "able"] — each piece still carries meaning.
Sentence Tokenization — Keeps full sentences as units. You'll see this in RAG pipelines where preserving semantic boundaries matters more than character-level precision.
Most production LLMs — GPT-4, Claude, LLaMA, Mistral — use a subword algorithm called BPE (Byte Pair Encoding). How BPE actually works is a topic for another post. For now, just know it learns which character sequences appear together frequently and merges them into single tokens.
Not All Tokens Are Visible Text
This one surprised me at first. Some tokens aren't words at all — they're control signals.
Text tokens — Your actual content. Words, numbers, punctuation.
Special tokens — Behind-the-scenes markers:
<|endoftext|>or</s>— "This is the end of the input"[PAD]— Filler when batching inputs of different lengths[MASK]— Used during training to hide tokens the model must predict[UNK]— "I've never seen this character before"
Different models use different special tokens. BERT has [CLS] and [SEP]. GPT models have <|endoftext|>. LLaMA uses <s> and </s>.
Why does this matter? Because these tokens count toward your context limit and your bill, even though you never typed them.
Why This Token ≠ Word Thing Actually Matters
Okay, theory is nice. Here's where it hits your wallet.
You're Charged Per Token, Not Per Word
text = "The developer implemented antidisestablishmentarianism"
# Word count: 4 words
# Token count: 9 tokens
# ['The', ' developer', ' implemented', ' ant', 'idis', 'establish', 'ment', 'arian', 'ism']
That one long word costs 5 tokens by itself. A word-based estimate would be off by 125%.
Context Windows Are Token Budgets
GPT-4's "128K context window" isn't 128,000 words. It's 128,000 tokens. (Newer models like GPT-4.1 and Claude Sonnet 4 now support up to 1 million tokens — but same principle applies.)
Rough math: 100,000 words ≈ 133,000 tokens. That document you thought would fit? Might not.
Non-English Text Is Token-Expensive
This is the one that catches people building global products:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
english = "Hello, how are you today?"
chinese = "你好,你今天好吗?" # Same meaning
print(len(enc.encode(english))) # 7 tokens
print(len(enc.encode(chinese))) # 11 tokens
Same meaning. 57% more tokens. 57% higher cost.
Why? BPE tokenizers are trained mostly on English text. English words get efficiently merged into single tokens. Chinese characters appear less frequently in training data, so they stay as separate tokens or get split further.
If you're building for multiple markets: Chinese, Japanese, Arabic, Thai — all cost more per equivalent message. Your pricing model might need to account for this.
A Quick Note on "Encodings"
You'll see terms like cl100k_base and o200k_base in tokenizer code. These confused me at first.
They're not algorithms. They're vocabulary names.
| Encoding | Used By | Vocab Size |
cl100k_base | GPT-4, GPT-3.5-turbo | ~100K tokens |
o200k_base | GPT-4o, GPT-4o-mini | ~200K tokens |
p50k_base | Codex, text-davinci-002 | ~50K tokens |
Newer models like GPT-4.1, GPT-5, and the o-series reasoning models (o1, o3, o4-mini) all use
o200k_baseas well. When in doubt, letencoding_for_model()figure it out for you.
All of these use the same BPE algorithm under the hood. The difference is which vocabulary — learned from which training data — they're using.
GPT-4o's larger vocabulary means it learned more merges, especially for code and non-English text. Same text, fewer tokens, lower cost. That's why o200k_base exists.
import tiktoken
# Let tiktoken pick the right encoding automatically:
enc = tiktoken.encoding_for_model("gpt-4") # Uses cl100k_base
enc = tiktoken.encoding_for_model("gpt-4o") # Uses o200k_base
tokens = enc.encode("your text here")
print(len(tokens))
Always use encoding_for_model(). Don't hardcode encoding names unless you have a specific reason.
Different Models, Different Tokenizers
Using GPT-2's tokenizer to estimate GPT-4 costs? The numbers won't match.
Why?
Each tokenizer is trained on that model's training data. Different data → different vocabulary → different token counts for the same text.
GPT-4 saw way more code than GPT-2, so it learned efficient merges for programming patterns. Same Python snippet might be 13 tokens in GPT-2's tokenizer and 8 tokens in GPT-4's.
Also important: You can't swap tokenizers between models. The model's weights are tied to specific token IDs. Token ID 256 in GPT-4 might mean "ing". In LLaMA, it might mean "the". Mix them up and you get nonsense output.
# WRONG — using GPT-2 tokenizer for GPT-4 API:
from transformers import GPT2Tokenizer
tok = GPT2Tokenizer.from_pretrained("gpt2")
tokens = tok.encode("Hello world") # ❌ Wrong count for GPT-4!
# RIGHT — use the model's actual tokenizer:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Hello world") # ✓ Accurate
How to Actually Count Tokens
For OpenAI models:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
text = "Your text here"
print(f"Tokens: {len(enc.encode(text))}")
For open-source models (LLaMA, Mistral, etc.):
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")
tokens = tokenizer.encode("Your text here")
print(f"Tokens: {len(tokens)}")
For Claude: Use Anthropic's /v1/messages/count_tokens API endpoint. They don't publish their tokenizer publicly.
For Gemini: Use Google's countTokens endpoint. Same situation — no public tokenizer.
The rule I follow: Never estimate when I can count. It takes two lines of code.
Quick Estimation (When You're in a Hurry)
Sometimes you just need a ballpark. For English text:
1 token ≈ 4 characters
1 token ≈ 0.75 words
Quick math:
- 100 words ≈ 133 tokens
- 1000 characters ≈ 250 tokens
But treat these as rough guides, not facts. Code, technical jargon, and non-English text will blow these estimates apart. Use actual token counts for anything where accuracy matters.
Mistakes I've Learned to Avoid
Counting words for cost estimates — The "1000 words ≈ 1000 tokens" assumption is wrong. Actual ratio varies wildly depending on content.
Using the wrong tokenizer — GPT-2 tokenizer for GPT-4 estimates. LLaMA tokenizer for Mistral. Every model needs its own tokenizer.
Forgetting message overhead — API messages have hidden tokens: role markers, separators. A 10-message conversation might add 30+ invisible tokens.
ALL CAPS for emphasis —
enc = tiktoken.encoding_for_model("gpt-4")
print(len(enc.encode("hello"))) # 1 token
print(len(enc.encode("HELLO"))) # 2 tokens — gets split into ["HEL", "LO"]
The model understands emphasis just fine in lowercase. Save your tokens.
Emoji overuse —
text1 = "I am happy" # 3 tokens
text2 = "I am happy 😀" # 5 tokens (the emoji alone is 2 tokens)
Emojis are 4 bytes in UTF-8. Tokenizers split them up. A chatbot that uses 👍 and ❌ everywhere is paying 2-3x more than one that uses "yes" and "no".
Things to Ponder
Take a moment to think through these. They're designed to check if the core ideas stuck, and you'll find the answers in what we covered above.
Two sentences: "I love AI" and "I LOVE AI" — same words, same meaning. Why might one cost more than the other?
Your app serves users in English and Chinese. Same conversation length, same features. Why might Chinese users cost you 2x more?
A support bot uses 👍 and ❌ in responses. Your colleague suggests switching to "yes" and "no". Overthinking or real savings?
You're counting tokens using GPT-2's tokenizer but calling GPT-4 or higher API. Your estimates are always off. Why?
"1000 words ≈ 1000 tokens" — your PM uses this for cost estimation. What's the flaw in this thinking?
Key Takeaways
Tokens are subword units — not words, not characters. They're what LLMs actually process.
APIs charge per token. Context limits are in tokens. Everything that costs you money is measured in tokens.
Different models have different tokenizers. Always use the right one for accurate counts.
Non-English text and emojis are token-expensive. Plan for this in multilingual products.
Don't estimate when you can count.
tiktokenfor OpenAI,AutoTokenizerfor open-source.
"Hello" is 1 token. "你好" is 2 tokens. Now you know why — and what to do about it.
Want to discuss this further about “Things to Ponder” or have questions? Hit me up on LinkedIn.





