Skip to main content

🔮 Vector DBs

Finding needles in a haystack by meaning

The Constellation Analogy

Traditional databases find exact matches:

"Show me books titled 'The Great Gatsby'"
→ Exact match on title field

Vector databases find similar things:

"Show me books like The Great Gatsby"
→ Jazz Age, American Dream, tragedy...

Like stars in the sky, similar items are positioned close together in a multi-dimensional space. Finding "similar" means finding the nearest neighbors.


Why Vector Databases?

User searches: "How do I fix my cracked laptop screen?"

Keyword search:
  Matches: "cracked", "laptop", "screen"
  Misses: "display repair", "monitor replacement"

User searches: "good restaurants for a date"

Keyword search:
  Matches documents with "date" (calendar date?)
  Misses: "romantic dining", "candlelit dinner"

Semantic Search with Vectors

User searches: "How do I fix my cracked laptop screen?"

Vector search:
  Understands MEANING
  Finds: "laptop display replacement guide"
         "screen repair tutorial"
         "monitor fix instructions"

Even without matching keywords!

How It Works

Step 1: Convert to Vectors (Embeddings)

"The cat sat on the mat"
        ↓ AI Model
[v1, v2, v3, v4, v5, ...]
  hundreds of dimensions

"A kitten rested on the rug"
        ↓ AI Model
[w1, w2, w3, w4, w5, ...]
        Similar vectors! (similar meaning)

"The stock market crashed today"
        ↓ AI Model
[z1, z2, z3, z4, z5, ...]
        Very different vector (different meaning)

Step 2: Store Vectors in Database

ID    | Text                     | Vector
------|--------------------------|--------------------
1     | "The cat sat on the mat" | [v1, v2, ...]
2     | "Dogs love to play"      | [u1, u2, ...]
3     | "The weather is nice"    | [t1, t2, ...]

Step 3: Search by Similarity

Query: "feline on floor covering"
        ↓ Same AI Model
Query vector: [q1, q2, q3, ...]

Find nearest neighbors:
  Vector 1: very close
  Vector 2: somewhat close
  Vector 3: far away

Result: "The cat sat on the mat"

Distance Metrics

How do we measure "closeness"?

Cosine Similarity

Measures angle between vectors:

    A →   θ   → B

cos(θ) = 1: Same direction (identical meaning)
cos(θ) = 0: Perpendicular (unrelated)
cos(θ) = -1: Opposite (opposite meaning)

Euclidean Distance

Straight-line distance between points:

    A ●───────────● B
         distance

Smaller = more similar

Which to Use?

Text/semantics: Cosine (direction matters more than magnitude)
Images/features: Euclidean often works well
Recommendations: Experiment with both!

The Indexing Challenge

Brute Force Doesn't Scale

1 million vectors, 512 dimensions
Query time: compare against ALL 1 million

That's 1 million × 512 = 512 million operations!
Too slow for real-time.

Approximate Nearest Neighbors (ANN)

Trade-off: 99% accuracy for 100x speed

Algorithms:
  HNSW: Hierarchical graph navigation
  IVF: Cluster-based partitioning
  LSH: Hash similar vectors together

Result: Millisecond queries over billions of vectors

Common Use Cases

User: "romantic comedy movies from the nineties"
Find: Movies SIMILAR in meaning, not just keywords
Results: "You've Got Mail", "Notting Hill", etc.

2. RAG (Retrieval-Augmented Generation)

Store documents as vectors
User asks question → Find relevant chunks
Feed to LLM → Generate answer with context

This powers ChatGPT-like apps with custom knowledge!
Upload image of a red dress
→ Convert to vector
→ Find similar product images
→ Show matching items for sale

4. Recommendation Systems

User likes: Movie A, Movie B, Movie C
→ Average their vectors
→ Find movies near that average
→ Recommend: Movie D, Movie E

5. Anomaly Detection

Normal transactions cluster together
Outliers (far from cluster) = suspicious

Fraud detection, network security, quality control

DatabaseKey FeatureOften Used For
PineconeFully managedProduction apps
WeaviateBuilt-in ML modelsEasy integration
MilvusHigh performanceLarge scale
QdrantRust-based, fastPerformance-critical
ChromaSimple, Python-nativePrototyping
pgvectorPostgreSQL extensionExisting Postgres users

Vector DB Architecture

┌─────────────────────────────────────────┐
│              Application                 │
└───────────────────┬─────────────────────┘
                    │ Query: "Find similar"
                    ▼
┌─────────────────────────────────────────┐
│          Embedding Model                 │
│      (OpenAI, Sentence-BERT, etc.)       │
└───────────────────┬─────────────────────┘
                    │ Vector: [v1, v2, ...]
                    ▼
┌─────────────────────────────────────────┐
│          Vector Database                 │
│    ┌─────────────────────────────┐      │
│    │  ANN Index (HNSW, IVF, etc) │      │
│    └─────────────────────────────┘      │
│    ┌─────────────────────────────┐      │
│    │  Vectors + Metadata Storage │      │
│    └─────────────────────────────┘      │
└───────────────────┬─────────────────────┘
                    │ Top K similar results
                    ▼
            Response to application

Metadata Filtering

Combine Vector Search with Filters

Find similar to "romantic comedy"
  WHERE year > YYYY
  AND language = "English"

Vector search narrows by meaning
Filters narrow by attributes

Both work together!

Common Mistakes

1. Wrong Embedding Model

Different models for different use cases. Code embeddings ≠ text embeddings.

2. Not Chunking Long Documents

Bad: Embed entire 100-page document as one vector
Good: Chunk into paragraphs, embed each

3. Ignoring Metadata

Combine vector similarity with traditional filters for better results.

4. Expecting Exact Matches

Vector search finds SIMILAR, not exact. Use traditional DB for exact lookups.


FAQ

Q: Vector database vs regular database?

Regular: exact matches, structured queries (SQL). Vector: similarity search, unstructured data (text, images).

Q: Do I need a separate vector database?

Not necessarily. PostgreSQL with pgvector, or Elasticsearch with dense vectors, can work well at smaller scale.

Q: How much does it cost?

Depends on vector count and dimensions. Cloud services charge per million vectors stored.

Q: Can I update vectors?

Yes, but re-embedding might be needed if source content changes.


Summary

Vector databases store data as high-dimensional vectors and find similar items through nearest neighbor search.

Key Takeaways:

  • Embeddings convert text/images to vectors
  • Similar meaning = similar vectors
  • Enables semantic search (meaning, not keywords)
  • Core technology for RAG and AI applications
  • ANN algorithms enable fast search at scale
  • Combine with metadata filtering for precise results

Vector databases power the AI revolution in search and recommendations!

Leave a Comment

Comments (0)

Be the first to comment on this concept.

Comments are approved automatically.