The GPS Coordinates Analogy
How do you describe where a city is? You could list landmarks, but that's hard to compare. GPS coordinates are better — every location becomes a small set of numbers that you can measure and compare.
Embeddings do the same thing for words, sentences, images, or any data.
An embedding converts something human-understandable (like the word "dog") into a list of numbers (like [x1, x2, x3, ...]) that computers can compare and calculate with.
Things that are similar end up close together in this number space. "Dog" and "puppy" are close. "Dog" and "refrigerator" are far apart.
How Embeddings Actually Work
An embedding model takes input (text, image, audio) and outputs a fixed-size vector - a list of numbers called the embedding or representation.
What an Embedding Looks Like
embedding = embedding_model.encode("The quick brown fox")
print(len(embedding)) # N dimensions (varies by model)
print(embedding[:5]) # [v1, v2, v3, v4, v5]
Each dimension captures some aspect of meaning. The model learned these dimensions from millions of examples during training.
Measuring Similarity
To find how similar two texts are, compare their embeddings using cosine similarity:
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
embed1 = get_embedding("dog")
embed2 = get_embedding("puppy")
embed3 = get_embedding("car")
print(cosine_similarity(embed1, embed2)) # higher = more similar
print(cosine_similarity(embed1, embed3)) # lower = less similar
Types of Embeddings
| Type | Input | Output Size | Use Case |
|---|---|---|---|
| Word embeddings | Single word | Smaller | Word similarity, analogies |
| Sentence embeddings | Sentence/paragraph | Medium | Semantic search, RAG |
| Document embeddings | Full document | Medium | Document clustering |
| Image embeddings | Image | Medium to large | Image search, classification |
| Multimodal | Text + Image | Medium to large | Image-text matching |
Evolution of Text Embeddings
Word2Vec → One vector per word, context-free
GloVe → Global word statistics
BERT → Context-aware embeddings
Sentence-BERT → Efficient sentence embeddings
Modern embeddings → High-quality embedding APIs and open-source models
Real-World Examples
1. Semantic Search
# Index documents
documents = [
"How to train a neural network",
"Great pizza recipes in Italy",
"Machine learning fundamentals"
]
doc_embeddings = [get_embedding(doc) for doc in documents]
# Search by meaning, not keywords
query = "deep learning basics"
query_embedding = get_embedding(query)
# Find most similar document
similarities = [cosine_similarity(query_embedding, de) for de in doc_embeddings]
best_match = documents[np.argmax(similarities)]
# Returns: "Machine learning fundamentals" (semantically similar)
2. Recommendation System
# Find similar products
product_embedding = get_embedding("wireless bluetooth headphones")
all_product_embeddings = load_product_embeddings()
# Find 5 most similar products
similar_products = vector_db.search(product_embedding, top_k=5)
3. Duplicate Detection
# Check if two support tickets are duplicates
ticket1 = "My order hasn't arrived"
ticket2 = "I haven't received my package"
similarity = cosine_similarity(get_embedding(ticket1), get_embedding(ticket2))
if similarity is very high:
print("Likely duplicate")
4. Clustering Documents
from sklearn.cluster import KMeans
# Group news articles by topic
embeddings = [get_embedding(article) for article in articles]
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(embeddings)
# Each article assigned to a topic cluster
Common Mistakes and Gotchas
Using the Wrong Embedding Model
Different models excel at different tasks:
For semantic search: sentence-style embedding models
For clustering: general-purpose embedding models
For multilingual: multilingual embedding models
For code: code-focused embedding models
Ignoring Dimension Costs
More dimensions means:
- More storage required
- Slower similarity search
- Higher memory usage
A smaller embedding can be enough for many use cases. Start small and scale up if needed.
Not Normalizing Embeddings
Some models return unnormalized vectors. Normalize before using cosine similarity:
embedding = embedding / np.linalg.norm(embedding)
Truncating Long Text
Embedding models have token limits (which vary by model). Long documents can get truncated. Handle this by:
- Chunking documents
- Using models with longer context
- Summarizing before embedding
Word Embedding Arithmetic
One fascinating property of embeddings: you can do math with them.
king - man + woman ≈ queen
Paris - France + Italy ≈ Rome
walking - walk + swim ≈ swimming
This works because embeddings capture relationships, not just definitions.
result = embeddings["king"] - embeddings["man"] + embeddings["woman"]
most_similar = find_nearest(result) # Returns "queen"
FAQ
Q: What is the difference between embeddings and tokens?
Tokens are how text is split up (words or subwords). Embeddings are the numerical representations. An embedding model takes tokens as input and produces a single embedding vector as output.
Q: How do I choose embedding dimensions?
Higher dimensions can capture more nuance but cost more. A smaller embedding size is often enough, and you can increase it if you need extra quality and can afford the cost.
Q: Can I train my own embedding model?
Yes, but it requires significant data and compute. For most use cases, fine-tuning an existing model or using a commercial API is more practical.
Q: Why do embeddings work for so many things?
Because training pushes the model to learn representations that help with its task. For many tasks (like predicting text or matching text to images), learning semantics tends to help, not just surface-level patterns.
Q: What is the difference between dense and sparse embeddings?
Dense embeddings are fixed-size vectors (like what we've discussed). Sparse embeddings (like TF-IDF or BM25) have mostly zeros and match exact words. Many systems combine both.
Q: How do I store embeddings efficiently?
Use a vector database (Pinecone, Weaviate, Qdrant) for search, or simple formats (numpy, parquet) for batch processing. Vector databases provide fast approximate nearest neighbor search.
Summary
Embeddings are the bridge between human language and mathematical computation. They convert words, sentences, and images into numbers that preserve meaning and allow similarity comparisons.
Key Points:
- Embeddings are fixed-size vectors representing meaning
- Similar things have similar embeddings (close in vector space)
- Essential for semantic search, RAG, recommendations, clustering
- Choose embedding model based on your use case
- Cosine similarity measures how similar two embeddings are
- Store and search embeddings with vector databases
Understanding embeddings unlocks the ability to build AI applications that truly understand meaning, not just keywords.
Related Concepts
Leave a Comment
Comments (0)
Be the first to comment on this concept.
Comments are approved automatically.