Skip to main content

Command Palette

Search for a command to run...

Choosing Embedding Models and Dimensions: Why 1536 Isn't Always Better Than 384

Learn how to choose embedding models and dimensions for production RAG systems. Compare OpenAI, Voyage AI, and Google's free options for embedding

Updated
11 min read
Choosing Embedding Models and Dimensions: Why 1536 Isn't Always Better Than 384

You're building a RAG system and need to pick an embedding model. The options are overwhelming: OpenAI, Voyage, Google, Cohere, or self-hosted open-source. Prices range from free to $0.13 per million tokens. Dimensions range from 256 to 3072.

How do you choose?

This post breaks down the actual options, the real costs, and the trade-offs you need to understand.


The Problem

Most teams don't actually choose an embedding model. They default.

Default to whatever the tutorial uses (usually OpenAI). Default to the maximum dimensions the API returns. Default to assumptions about what "production-ready" means.

Here's what gets missed:

Cost varies by more than 6x between providers for similar tasks. At scale, this compounds.

Dimensions and quality aren't linearly related. You can reduce dimensions significantly with minimal quality loss using techniques like Matryoshka learning.

There are free options that work. Google offers a completely free embedding API. Voyage gives 200M free tokens. These aren't prototypes, they're production-grade.

The goal isn't to find the "best" model. It's to understand what you're optimizing for and make an informed choice.


The Model Landscape

Embedding models fall into two categories: managed APIs (you pay per token) and open-source models (you run yourself).

Managed API Options

OpenAI

OpenAI offers two current embedding models, both supporting Matryoshka Representation Learning (dimension flexibility):

text-embedding-3-small

  • Default: 1536 dimensions

  • Can reduce to: 256-1536 dimensions via dimensions parameter

  • Cost: $0.02 per 1M tokens

  • Quality: 62.3% MTEB, 44.0% MIRACL

  • Context: 8,191 tokens

text-embedding-3-large

  • Default: 3072 dimensions

  • Can reduce to: 256-3072 dimensions via dimensions parameter

  • Cost: $0.13 per 1M tokens

  • Quality: 64.6% MTEB, 54.9% MIRACL

  • Context: 8,191 tokens

Note for beginners: MTEB and MIRACL are benchmark suites used to compare embedding models across many tasks. A higher score usually means a stronger model, but what matters most is how it performs on your data.

The key feature: dimension flexibility. You can request 768-dim embeddings from text-embedding-3-small and get the same per-token cost ($0.02/1M) with half the storage. A 256-dim version of text-embedding-3-large outperforms the full 1536-dim ada-002 model on benchmarks.

OpenAI is the most widely deployed embedding model in production. The ecosystem support is extensive, Every vector database has examples, every framework has built-in integration, and troubleshooting resources are abundant.

Voyage AI

Voyage AI specializes in embeddings optimized for retrieval. They're Anthropic's recommended partner for embeddings.

voyage-4-large (1024 dimensions)

  • Uses mixture-of-experts architecture

  • Cost: ~$0.12 per 1M tokens

  • Quality: 72.3% MTEB (state-of-the-art)

  • Free tier: First 200M tokens

voyage-4 (1024 dimensions)

  • Balanced performance

  • Same pricing and free tier

voyage-4-lite (1024 dimensions)

  • Optimized for speed

  • Same pricing and free tier

voyage-3.5 (1024 dimensions)

  • Previous generation

  • Same pricing structure

Voyage's v4 series introduces shared embedding spaces. You can index with voyage-4-large and query with voyage-4-lite without re-indexing. The embeddings are compatible across the v4 family.

Benchmark performance is strong, particularly on retrieval tasks. The 200M free token tier covers most initial projects entirely.

Google Gemini

text-embedding-004 (768 dimensions)

  • Cost: Completely free

  • Quality: 61.2% MTEB

  • Good multilingual support

gemini-embedding-001 (3072 dimensions, supports 768/1536/3072)

  • Cost: $0.15 per 1M tokens

  • Matryoshka support for dimension flexibility

  • 100+ languages

Google's free tier is production-grade, not a prototype. The trade-off: no SLAs, no guaranteed uptime, and terms could change.

Cohere

embed-v4 (1536 dimensions, supports 256/512/1024/1536)

  • Cost: $0.12 per 1M text tokens

  • Multimodal: supports text and images ($0.47/1M image tokens)

  • Strong multilingual performance

  • Matryoshka support

Cohere targets enterprise use cases and offers multimodal capabilities for visual search applications.

Open-Source Options

Open-source models are free to use but require infrastructure. Expect GPU costs for acceptable performance.

BGE (BAAI General Embedding)

BGE-M3 (1024 dimensions)

  • Multi-lingual (100+ languages)

  • Multi-functionality (dense, sparse, multi-vector retrieval)

  • Context: 8192 tokens

  • Quality: 68.9% MTEB

bge-large-en-v1.5 (1024 dimensions)

  • English-only

  • High quality for open-source

bge-small-en-v1.5 (384 dimensions)

  • Lightweight, fast inference

E5 (Microsoft)

Multiple sizes (384-1024 dimensions), strong MTEB performance, well-documented.

Nomic Embed

nomic-embed-text (768 dimensions)

  • Apache 2.0 license

  • Fully open-source

  • Good for transparency requirements

Open-source makes sense when you have privacy requirements, massive scale where API costs become prohibitive, or ML ops expertise with existing GPU infrastructure.


How to Decide

Start with your primary constraint.

If Privacy Is Non-Negotiable

Use open-source models on your infrastructure.

If your data can't leave your servers (healthcare, finance, government), you're self-hosting. BGE-M3 is a strong default: multilingual, actively maintained, proven in production.

Expect GPU costs of $100-300/month depending on query volume. This is often cheaper than APIs at scale, but you're trading money for operational complexity.

If You're Optimizing for Cost

Test the free options first.

For moderate scale (5M documents, 100K queries/month):

  • Google free: ~$20/year (storage only)

  • Voyage (within free tier): $0/year

  • OpenAI 3-small: ~$160/year

  • Self-hosted BGE: ~$1,200/year (GPU costs)

The free tiers aren't toys. They're production-capable. Test them before paying.

At high scale (100M+ documents, 1M+ queries/month), API costs compound. Self-hosting becomes cheaper, but only if you have the team to run it.

If You're Optimizing for Quality

Check benchmarks, then test on your data.

February 2026 MTEB scores:

  • Voyage-4-large (1024-dim): 72.3%

  • BGE-M3 (1024-dim): 68.9%

  • OpenAI 3-large (3072-dim): 64.6%

  • OpenAI 3-small (1536-dim): 62.3%

  • Google text-embedding-004 (768-dim): 61.2%

Benchmarks are averages across many tasks. Your domain might differ. A model scoring 68% overall might score 73% on your legal documents, or 63% on your customer support tickets.

Test on a sample of your actual documents before committing.

If You're Optimizing for Speed to Production

Use OpenAI text-embedding-3-small.

It's in every tutorial. Every vector database has examples. Every framework has built-in support. When you hit issues, Stack Overflow has answers.

The ecosystem support reduces risk. For teams shipping products, "just works" has real value.

If You're Prototyping

Use free tiers: Google or Voyage.

Embedding costs should be zero during validation. Google's completely free. Voyage gives 200M tokens free.

Once validated, reevaluate based on production requirements.


Understanding Dimensions and Cost

Dimensions affect three things: storage, query speed, and retrieval quality.

The Matryoshka Advantage

Modern embedding models (OpenAI 3-series, Cohere v4, Voyage v4, gemini-embedding-001) support Matryoshka Representation Learning. This means you can reduce dimensions without retraining.

How it works: earlier dimensions encode more important information, later dimensions add refinement. You can truncate to smaller sizes with minimal quality loss.

Example from OpenAI's data: text-embedding-3-large at 256 dimensions outperforms ada-002 at 1536 dimensions on MTEB benchmarks. That's a 6x reduction in size with better quality.

This changes the cost calculation fundamentally. You're not locked into default dimensions.

Storage Math

Embeddings are arrays of floating-point numbers (4 bytes each).

For 1 million documents:

  • 256 dimensions: 1 GB

  • 384 dimensions: 1.5 GB

  • 512 dimensions: 2 GB

  • 768 dimensions: 3 GB

  • 1024 dimensions: 4 GB

  • 1536 dimensions: 6 GB

  • 3072 dimensions: 12 GB

Vector database storage typically costs $0.10-0.20/GB/month. At 10M documents with 1536-dim embeddings, that's $6-12/month in storage. Cut to 768-dim and it's $3-6/month.

Storage scales linearly with documents × dimensions.

Speed Impact

Higher-dimensional vectors take longer to compare during similarity search.

DimensionsTypical Query LatencyRelative Speed
256<5msFastest
384<10msVery fast
51210-20msFast
76810-30msGood
102430-50msModerate
153650-100msSlower
3072100-500msSlowest

These numbers vary based on vector database, hardware, and ANN algorithm. The trend holds: more dimensions = slower queries unless you add compute.

Cost Comparison: Real Scenario

Scenario: 5 million documents, 100,000 queries per month, 768 dimensions.

Note: These are approx. costs, provided for better understanding.

OpenAI text-embedding-3-small (reduced to 768-dim)

  • Indexing: 5M docs × $0.02/1M = $100 (one-time)

  • Queries: 100K × $0.02/1M = $2/month

  • Storage: 15 GB × $0.10/GB = $1.50/month

  • First year: $142

Voyage-4 (1024-dim, close to 768)

  • Indexing: FREE (200M token free tier)

  • Queries (after free tier): 100K × $0.12/1M = $12/month

  • Storage: 20 GB × $0.10/GB = $2/month

  • First year: $168

Google text-embedding-004 (768-dim)

  • Indexing: FREE

  • Queries: FREE

  • Storage: 15 GB × $0.10/GB = $1.50/month

  • First year: $18

BGE-M3 self-hosted (1024-dim)

  • Indexing: FREE (you run it)

  • Queries: FREE (you run it)

  • Storage: 20 GB × $0.10/GB = $2/month

  • GPU: ~$100/month (AWS g4dn.xlarge)

  • First year: $1,224

At this scale, Google is cheapest. OpenAI and Voyage are similar. Self-hosting is most expensive until you hit massive scale.

The break-even for self-hosting: around 100M documents or when compliance requirements justify the infrastructure cost.

Quality vs Dimensions

More dimensions don't automatically mean better quality.

Voyage-4-large at 1024-dim scores 72.3% MTEB. OpenAI text-embedding-3-large at 3072-dim scores 64.6% MTEB. The 1024-dim model wins because it's trained specifically for retrieval.

Even within the same model family, dimension reduction works surprisingly well. OpenAI's data shows text-embedding-3-large at 256-dim beating ada-002 at 1536-dim.

Test dimension trade-offs on your data. You might find 512-dim performs identically to 1536-dim for your use case.

Practical Guidance

Start with 768-1024 dimensions. This range balances quality, cost, and speed for most production systems.

Use 256-512 dimensions when:

  • Optimizing for speed and storage

  • Domain is narrow (not general search)

  • You've tested and confirmed quality is acceptable

Use 1536+ dimensions when:

  • You're in specialized domains (legal, medical, research)

  • You've tested and measured quality improvement

  • Storage and compute aren't constraints

Test dimension reduction. If you're using OpenAI or another Matryoshka-enabled model, try reducing dimensions by 50% and measure quality impact. Often it's negligible.


Common Mistakes

Mistake 1: Not testing dimension reduction

If your model supports Matryoshka (OpenAI 3-series, Cohere v4, Google gemini-001), you can often cut dimensions in half with minimal quality loss.

Test this before committing to default dimensions. The storage and speed savings compound at scale.

Mistake 2: Mixing embedding models

❌ WRONG: Index with Model A → Query with Model B
Result: Garbage (different vector spaces)

✅ CORRECT: Index with Model A → Query with Model A
Result: Works

Different models create different vector spaces. Vectors from different models aren't comparable. If you switch models, you must re-index everything.

Exception: Models with shared embedding spaces (Voyage v4 series). You can index with voyage-4-large and query with voyage-4-lite.

Mistake 3: Assuming embeddings and LLMs must match

They're independent pieces:

✅ OpenAI embeddings + Claude for generation
✅ Voyage embeddings + GPT-4 for generation
✅ BGE embeddings + Llama for generation
✅ Google embeddings + Any LLM

The embedding model finds documents. The LLM reads them and generates answers. They don't need to match providers.

Mistake 4: Ignoring total cost of ownership

People see "$0.02 per million tokens" and stop there.

Calculate over 12 months including:

  • Storage (documents × dimensions × $0.10/GB/month)

  • Re-indexing frequency (weekly updates = 52× indexing cost)

  • Query volume growth

  • Infrastructure costs for self-hosted

Do the full TCO, not just first month.

Mistake 5: Choosing based on benchmarks alone

MTEB scores are averaged across many tasks. Your specific domain might behave differently.

A model scoring 68% overall might score 73% on your data, while a 72% model might score 65%.

Benchmarks narrow your options. Testing on your data makes the final decision.

Mistake 6: Not evaluating free options

If you're budget-constrained or at high volume, test Google's free tier before assuming you need a paid option.

The quality might be sufficient. If not, you've lost a few hours. If it works, you've saved real money.


Things to Ponder

Take a moment to think through these. They're designed to check if the core ideas stuck, and you'll find the answers in what we covered above.

  1. You're using OpenAI text-embedding-3-small at its default 1536 dimensions. Your vector DB storage costs are $300/month. Could reducing dimensions help? What would you test first, and what's the potential savings?

  2. Your RAG system indexes documents with BGE-M3, but you decide to query using OpenAI text-embedding-3-small to "get better quality." What breaks, and why? How would you fix it?

  3. A 256-dimension version of text-embedding-3-large outperforms 1536-dimension ada-002 on benchmarks. Does this mean 256 dimensions is always better than 1536 dimensions? What's missing from that conclusion?

  4. You're embedding 10M documents. OpenAI costs $200/year total. Google free tier costs $30/year (storage only). What factors beyond cost would influence which one you choose for production?

  5. Voyage-4 offers a "shared embedding space" across v4 models. You index with voyage-4-large and query with voyage-4-lite. Why does this work? Why can't you do the same with OpenAI 3-small and 3-large?


Key Takeaways

Modern embedding models support dimension reduction via Matryoshka learning. You can often cut dimensions in half with minimal quality loss, test this before defaulting to maximum dimensions.

Cost per token varies by 6.5x ($0.02 to $0.13 for APIs), but free options exist (Google, Voyage free tier). Calculate total cost including storage, not just API calls.

For most production systems, 768-1024 dimensions balances quality, cost, and speed. Go higher only after testing confirms the improvement is worth it.

Embeddings and LLMs are independent—mix and match based on what works best for each piece. Use any embedding model with any LLM.

If you switch embedding models, you must re-index everything. Different models create different vector spaces (exception: shared spaces like Voyage v4).

Benchmark scores help narrow options, but your domain might perform differently. Test on your actual data before deciding.

Quality doesn't scale linearly with dimensions. A well-trained 1024-dim model can beat a poorly-trained 3072-dim model. Training matters more than size.

OpenAI is the safest choice for speed to production (ecosystem support, proven reliability). Google is best for budget-constrained projects. Voyage offers strong quality with generous free tier. Open-source makes sense for privacy requirements or massive scale.

There's no universal "best" embedding model. Choose based on your constraints: cost, quality, latency, privacy, operational complexity.


Want to discuss this further or have questions? Hit me up on LinkedIn.

LLM Foundations

Part 2 of 5

Breaking down how LLMs actually work — tokens, embeddings, context windows, and the fundamentals you need before building anything serious.

Up next

What Are Embeddings and How Vector Similarity Actually Works

Learn what embeddings are, how vector similarity works, and why understanding magnitude vs direction matters for semantic search and RAG systems.