Eric TechBlog
AI

Embeddings

A practical introduction to embeddings, cosine similarity, and how they compare with BM25

Embeddings are one of the core ideas behind semantic search and modern RAG. Instead of treating text as plain strings, embeddings convert text into vectors so systems can compare meaning mathematically.

In this series, embeddings come after BM25. BM25 explains lexical retrieval, while embeddings explain semantic retrieval. Together, they form the most common first-stage retrieval pair in modern RAG systems.

In short:

  • keyword search asks: do the same words appear?
  • embeddings ask: do the meanings match?

For example, these two sentences may be semantically close even if they do not share the same wording:

  • Cats like sleeping
  • Kittens often rest during the day

With embeddings, a model can place them near each other in vector space.

What Are Embeddings?

An embedding is a vector representation of data such as a word, sentence, paragraph, or document.

[0.12, -0.83, 0.44, 0.91, ...]

The individual dimensions are usually not interpretable by humans, but the overall position captures semantic information.

Common properties:

  • semantically similar text tends to have nearby vectors
  • semantically different text tends to have distant vectors
  • embeddings can be used for search, clustering, recommendation, and classification

Why Embeddings Matter

Traditional keyword search works well when the query and the document share the same terms. But real users often search with:

  • synonyms
  • different phrasing
  • natural language questions
  • vague descriptions

Example:

Query:

How can I improve React initial page load performance?

Document:

Next.js initial rendering optimization strategies

The wording is different, but the meaning is similar. This is where embeddings are useful.

Typical Embedding Search Flow

The basic flow looks like this:

  1. split documents into chunks
  2. convert each chunk into a vector with an embedding model
  3. store those vectors in a vector index
  4. embed the user query
  5. compare the query vector with stored vectors and rank the closest matches

Common vector stores include:

  • pgvector
  • Pinecone
  • Qdrant
  • Weaviate
  • Milvus

A very common similarity metric here is cosine similarity.

What Is Cosine Similarity?

Cosine similarity measures how similar two vectors are by comparing their direction rather than their magnitude.

Intuition:

  • close to 1: very similar direction
  • close to 0: unrelated or orthogonal
  • close to -1: opposite direction

For text embeddings, similar direction often means similar meaning.

Cosine Similarity Formula

Given two vectors AA and BB:

cosineSimilarity(A,B)=ABAB\text{cosineSimilarity}(A, B) = \frac{A \cdot B}{\|A\| \cdot \|B\|}
  • ABA \cdot B is the dot product
  • A\|A\| is the magnitude of AA
  • B\|B\| is the magnitude of BB

Simple Example

Let:

A=[1,2,3]A = [1, 2, 3] B=[2,4,6]B = [2, 4, 6]

First, compute the dot product:

AB=1×2+2×4+3×6=28A \cdot B = 1 \times 2 + 2 \times 4 + 3 \times 6 = 28

Then compute magnitudes:

A=12+22+32=14\|A\| = \sqrt{1^2 + 2^2 + 3^2} = \sqrt{14} B=22+42+62=56\|B\| = \sqrt{2^2 + 4^2 + 6^2} = \sqrt{56}

So:

cosineSimilarity(A,B)=281456=1\text{cosineSimilarity}(A, B) = \frac{28}{\sqrt{14} \cdot \sqrt{56}} = 1

This means the two vectors point in exactly the same direction. In practice, text vectors are higher-dimensional, but the intuition is the same: closer direction usually means closer meaning.

Why Cosine Similarity Is Common

Cosine similarity is popular for embeddings because in many cases we care more about semantic direction than vector length.

Benefits:

  • good for semantic comparison
  • less sensitive to magnitude
  • widely supported in vector search systems

In practice, some systems also use:

  • dot product
  • euclidean distance
  • inner product

The best choice depends on the model and whether vectors are normalized.

Embeddings vs BM25

BM25 is a classic lexical search algorithm used in systems like Elasticsearch and OpenSearch. It ranks documents based on term matching, term frequency, inverse document frequency, and document length normalization.

The key difference is:

  • BM25 focuses on word overlap
  • embeddings focus on semantic similarity

Strengths of Embeddings

  • better for synonyms and paraphrases
  • better for natural language queries
  • useful for knowledge bases, FAQ, and RAG
  • stronger for semantic retrieval

Weaknesses of Embeddings

  • not always reliable for exact keyword matches
  • harder to explain ranking
  • higher infrastructure and model cost
  • requires tuning such as chunking, top-k, and reranking

Strengths of BM25

  • excellent for exact matches
  • easy to explain
  • mature and relatively cheap to deploy
  • strong for error codes, model numbers, API names, and identifiers

Weaknesses of BM25

  • poor at handling paraphrases
  • may miss relevant content with different wording
  • less effective for semantic search and question-style queries

Which One Should You Use?

Use BM25 first when queries are mostly precise keywords, such as:

  • ERR_CONNECTION_RESET
  • useEffect cleanup
  • SKU-12345

Use embeddings when users ask broader questions, such as:

  • How do I reduce Next.js initial load time?
  • How can I avoid unnecessary React rerenders?

In many real systems, the best answer is not choosing one over the other, but combining both with hybrid search, often using methods like RRF:

  • BM25 for exact keyword matching
  • embeddings for semantic matching

This usually gives more stable retrieval quality.

Conclusion

Embeddings turn text into vectors that capture meaning, making semantic search possible. Cosine similarity is one of the most common ways to compare these vectors by measuring how closely their directions align.

Compared with BM25:

  • BM25 is stronger for exact term matching
  • embeddings are stronger for semantic matching
  • hybrid search often works best in practice

If you are building search or RAG, the practical question is usually not "Embeddings or BM25?", but "How should I combine them for my data and query patterns?"

That is exactly why the next step in this series is RRF: once you understand lexical and semantic retrieval separately, the natural question becomes how to merge both ranked lists into one strong candidate set.

Last updated on

On this page