In Part 1, I mentioned that tokens become "embeddings" before the model processes them. In Part 3, we covered how output is generated.
Now we reach the missing bridge: how text becomes something a system can compare.
Embeddings are the answer. And understanding them is essential for building anything with semantic search, RAG, or vector databases.
What Is an Embedding?
An embedding is a list of numbers (a vector) that represents the meaning of a piece of text.
The idea: text that means similar things ends up as similar vectors.
"The cat sat on the mat" → [0.12, -0.34, 0.56, ..., 0.89] (hundreds to thousands of numbers)
"A feline rested on a rug" → [0.11, -0.33, 0.55, ..., 0.88] (similar numbers)
"Stock market performance" → [-0.45, 0.78, -0.12, ..., 0.23] (very different numbers)
The "distance" between vectors corresponds (roughly) to semantic similarity.
Important: For a fixed model snapshot and preprocessing pipeline, embeddings are typically deterministic: the same input yields the same vector. (Hosted APIs can change behavior across model updates.) This contrasts with LLM generation, where sampling introduces randomness.
Why This Matters
Embeddings unlock:
- Semantic search - Find relevant documents even without keyword matches
- RAG systems - Retrieve context for LLM prompts
- Clustering - Group similar content automatically
- Classification - Categorize text by meaning
- Deduplication - Find near-duplicates in large datasets
Without embeddings, you're stuck with keyword matching. With embeddings, you can search by meaning.
How Similarity Works: Cosine Similarity
The most common way to measure embedding similarity is cosine similarity.
The Intuition
Think of vectors as arrows pointing in a direction. Cosine similarity measures how similar the directions are:
- Same direction (cosine ≈ 1) → Very similar meaning
- Perpendicular (cosine ≈ 0) → Unrelated
- Opposite (cosine ≈ -1) → Opposite meaning (rarely cleanly represented in practice)
Why Cosine (and Sometimes Dot Product)
Cosine similarity compares direction, not magnitude.
In practice:
- Many systems use cosine similarity directly.
- Some use dot product, especially when embeddings are normalized to unit length (then dot product and cosine similarity are equivalent).
- Defaults vary by provider and database, so confirm what metric you're actually using.
Engineer takeaway: Treat similarity thresholds (like 0.85) as model- and domain-specific. Calibrate them on your own data.
Embedding Models: The Landscape
Embeddings come from specialized models trained specifically for this task.
Key Options (Examples)
Provider/model offerings change frequently, so treat this as a starting map (verify current docs before committing to a choice):
| Model / Family | Dimensions | Notes |
|---|---|---|
OpenAI text-embedding-3-small | 1536 (default) | Supports a dimensions parameter to reduce vector size |
OpenAI text-embedding-3-large | 3072 (default) | Supports a dimensions parameter to reduce vector size |
Cohere embed-v4.0 | Configurable output dimension | Supports different input_type values (e.g., query vs document); supports multilingual inputs per provider docs |
| Sentence Transformers (local) | Varies by model | Open source; runs locally; no API cost |
| Other providers (Voyage, Jina, etc.) | Varies | Useful for specialized constraints (domain, latency, licensing, deployment) |
Dimensions Matter
Higher dimensions = richer representation = more storage/compute.
The tradeoff:
- More dimensions → potentially better retrieval accuracy → higher costs
- Fewer dimensions → faster, cheaper → acceptable accuracy for many tasks
Some providers let you choose the output dimension, which can be useful for balancing quality vs cost.
Engineer takeaway: Start with a strong general-purpose embedding model. Increase dimensions only if your evaluation shows retrieval quality is the bottleneck.
Where Semantic Search Fails
Embeddings are powerful, but they have blind spots:
1. Negation
"I love this product" and "I don't love this product" can end up with surprisingly similar embeddings.
Why? The words are almost identical - only "don't" differs. The embedding model may not capture the semantic flip.
2. Rare Terms and Proper Nouns
If a term is rare in training data, its embedding may not be meaningful.
An internal product code, SKU, or rare proper noun can embed poorly if it rarely appeared in training.
3. Short Queries vs Long Documents
A 3-word query and a 500-word document live in the same vector space. But their embeddings are qualitatively different.
Some models handle this better than others. Query-document asymmetry is a real issue.
4. Conceptual Similarity ≠ Answer Similarity
"What is the capital of France?" is semantically similar to "What is the capital of Germany?"
But they have different answers. Semantic similarity isn't necessarily what you want.
Engineer takeaway: Test your specific failure modes. Don't assume "similar" means "useful."
Embedding for RAG: What Goes In
In a RAG system, you embed:
- Documents (chunked) → stored in vector database
- Queries → embedded at query time, compared to document embeddings
Critical decisions:
What to Embed
- Full document? Usually too long - embeddings have a max input length
- Chunks? Yes - but chunking strategy matters (covered in Post 06)
- Metadata? Some systems embed metadata alongside content
Consistency
Use the same embedding model for documents and queries. Different models produce incompatible vector spaces.
Refresh Strategy
If you update your embedding model, you need to re-embed your entire corpus. Plan for this.
Local vs API Embeddings
| Aspect | API (OpenAI, Cohere) | Local (Sentence Transformers) |
|---|---|---|
| Setup | API key, pay per token | Install library, run on CPU/GPU |
| Cost | Per-request pricing | Compute cost only |
| Latency | Network + inference | Inference only |
| Privacy | Data leaves your system | Stays local |
| Quality | Generally higher | Good, improving rapidly |
For production systems with sensitive data, local embeddings may be required.
For quick prototyping or when quality is paramount, API models are convenient.
Engineer takeaway: Local models (like
all-MiniLM-L6-v2) are surprisingly good for many use cases. Don't assume you need paid APIs.
Debug Checklist: Embedding Issues
When semantic search isn't working:
- Are you using the same model for indexing and querying? (Mismatch = broken)
- Is your similarity threshold appropriate? (e.g., 0.7 might be too strict or too loose; thresholds are model-specific)
- Are queries too short? (Add context or use query expansion)
- Is the failure a negation or rare term issue? (Keyword hybrid search helps)
- Are your chunks too long or too short? (Chunking affects embedding quality)
- Did the embedding model see this domain? (Technical/niche content may embed poorly)
Try This Yourself
Experiment 1: Visualize Similarity
Using any embedding API or library:
-
Embed these sentences:
- "The quick brown fox jumps over the lazy dog"
- "A fast auburn fox leaps above a sleepy canine"
- "Stock prices rose sharply yesterday"
-
Calculate pairwise cosine similarity
-
Verify: the first two should be similar, the third should be different
Experiment 2: Test Failure Modes
- Embed: "I love this product" and "I hate this product"
- Calculate similarity - how close are they?
- Try: "The FrogWidget3000 is excellent" - does it cluster with positive sentiment?
Experiment 3: Build Mini Search
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
docs = [
"Python is a programming language",
"Machine learning uses algorithms to learn from data",
"The weather is sunny and warm",
]
doc_embeddings = model.encode(docs, convert_to_tensor=True, normalize_embeddings=True)
query = "What is ML?"
query_embedding = model.encode(query, convert_to_tensor=True, normalize_embeddings=True)
scores = util.cos_sim(query_embedding, doc_embeddings)[0]
best_idx = int(scores.argmax())
print(f"Best match: {docs[best_idx]}")
15 lines. That's semantic search.
Key Takeaways
- Embeddings turn text into vectors that capture semantic meaning
- Cosine similarity measures how similar two embeddings are
- Embedding models vary in dimensions, quality, and cost - choose based on needs
- Semantic search has blind spots - negation, rare terms, query-document asymmetry
- Same model for indexing and querying - mismatched models = broken retrieval
- Local models work well for many use cases - don't assume you need APIs
Key Terms
| Term | Meaning |
|---|---|
| Embedding | A vector (list of numbers) representing the meaning of text |
| Vector | An ordered list of numbers, representing a point in high-dimensional space |
| Cosine Similarity | Measure of how similar two vectors are (based on angle, not distance) |
| Dimensions | The number of values in an embedding vector (e.g., 1024, 3072) |
| Semantic Search | Finding relevant items by meaning, not keyword matching |
| Vector Database | Database optimized for storing and searching embeddings |
What's Next
Now you understand how text becomes searchable geometry. But how do you actually use this in a full system?
In the next post, we'll cover RAG End-to-End - the complete pipeline from query to cited answer, and how all these pieces fit together.
Further Reading
- OpenAI embeddings guide (dimensions + normalization notes): https://platform.openai.com/docs/guides/embeddings
- Cohere embeddings docs (
input_type, multilingual support, output dimensions): https://docs.cohere.com/docs/embeddings - Sentence-Transformers similarity helpers (
cos_sim, normalization): https://www.sbert.net/docs/package_reference/util.html
In This Series
- What is an LLM? - the fundamentals
- Tokenization - why wording matters
- Decoding & Sampling - temperature, top-p, determinism
- Embeddings (You are here) - text as searchable geometry
- RAG End-to-End - query to cited answer (coming soon)
Leave a Comment
Comments (0)
Be the first to comment on this post.
Comments are approved automatically.