Product and Engineering Leader

When working with LLM embeddings, it's crucial to be able to compare them effectively. If you've been working with LLMs and vector embeddings, you may have encountered cosine similarity as a method to compare vectors. As someone who has been working on projects utilizing LLM embeddings, I've found that understanding how to measure the similarity between embeddings is critical. However, I've been using cosine similarity without fully grasping the internal workings, so I thought to put my learnings together and share them.

To deepen my understanding, I asked myself the following questions:

What exactly do we mean by vectors in this context?
How does the cosine similarity formula work?
What do the different components of the formula represent?
How can I effectively leverage cosine similarity to compare vectors?

Understanding Vectors & Embeddings

The concept of embeddings is a byproduct of LLM tech that powers tools like ChatGPT and many search engines. The idea is that you can transform a piece of text (or images, etc.) into a vector (array of numbers). This vector, called an embedding, represents the meaning of the text. So, in essence, "embeddings are vectors," and vectors are lists of numbers. If you're thinking in Python terms, we're talking about lists or NumPy arrays:

For example:

import numpy as np

vector = np.array([0.1, 0.2, 0.3, 0.4, 0.5])

However, LLM embeddings consist of very long arrays of numbers. For example, an embedding created by OpenAI's ada-002 model contains 1536 numbers. Mathematically, it can be described as a vector in 1536-dimensional space. While 2D vectors (arrays of 2 numbers) are less practical for embeddings, they serve well to understand how cosine similarity works.

To visualize this, imagine a 2D space where all our vectors have only two values. The angle between two vectors is the angle between the lines they represent. These lines are drawn from the origin (0,0) to the end of the vector, treating the two vector numbers as x/y coordinates.

Note: The principles of cosine similarity remain consistent regardless of the number of dimensions. Whether we're working with 2D, 3D, or 1,536D vectors, the concept applies uniformly.

Being able to compare how similar two vectors are is a key part of working with embeddings. Cosine similarity is the recommended way to do this.

The Cosine Similarity Formula

Let's examine the mathematical formula for cosine similarity:

$cos(θ) = (A · B) / (||A|| * ||B||) $

Where:

A · B is the dot product of vectors A and B
||A|| is the magnitude of vector A
||B|| is the magnitude of vector B
θ is the angle between the two vectors

sThe θ (theta) value represents the angle between two vectors. It's the angle needed to rotate one vector to align with the other. The cosine of this angle (cos(θ)) gives us the cosine similarity: a value between -1 and 1.

If the vectors point in the same direction, the cosine similarity is 1.
If they're perpendicular (at right angles), the cosine similarity is 0.
If they point in opposite directions, the cosine similarity is -1.

To implement this in Python, we need to understand a few concepts:

What is the "dot product" of two vectors?

The dot product is the sum of the products of corresponding elements in two vectors.

For example:

import numpy as np

def dot_product(a, b):
    return np.dot(a, b)

vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
print(f"Dot product: {dot_product(vector1, vector2)}")  

# Output: 32

What is the "magnitude" of a vector?

The magnitude of a vector is its length, calculated as the square root of the sum of squared elements:

def magnitude(v):
    return np.linalg.norm(v)

print(f"Magnitude of vector1: {magnitude(vector1):.4f}")  

# Output: 3.7417

Implementing Cosine Similarity in Python

Now, let's put it all together:

def cosine_similarity(a, b):
    dot_prod = np.dot(a, b)
    mag_a = np.linalg.norm(a)
    mag_b = np.linalg.norm(b)
    return dot_prod / (mag_a * mag_b)

vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
similarity = cosine_similarity(vector1, vector2)
print(f"Cosine similarity: {similarity:.4f}")  # Output: 0.9746

Why is Cosine Similarity Ideal for LLM Embeddings?

Cosine similarity is the preferred method for comparing LLM embeddings for several reasons:

It focuses on the direction of the embedding, not its magnitude. In LLMs, the direction of an embedding vector represents its "meaning".
It's computationally efficient, especially for high-dimensional vectors like those used in LLMs.
It normalizes for text length, making it suitable for comparing documents of different sizes.

The power of embeddings lies in their multidimensionality. While the vectors are long and the relationships between their numbers are complex, the principles of cosine similarity remain consistent across all dimensions.

Real-World Applications

In my work, I've found cosine similarity particularly useful for:

Semantic search: Ranking search results based on their similarity to a query embedding.
Document clustering: Grouping similar documents together based on their embedding similarities.
Recommendation systems: Finding similar items or content based on their embedding representations.

For example, in a recent project, I used cosine similarity to build an intent routing system (applied to AI agents). By comparing the embeddings of a query a user had input with the embeddings of available intents in our database, we were able to route and take an action semantically similar to the query. It's critical to establish thresholds for the cosine similarity scores to ensure accurate routing. In our case, we set a primary threshold of 0.85 for high-confidence matches and a secondary threshold of 0.80 for potential matches that required further verification.Here's how we implemented it:

We created embeddings for all our predefined intents and stored them in a vector database.
When a user query came in, we generated its embedding on the fly.
We then calculated the cosine similarity between the query embedding and all intent embeddings.
If the highest similarity score was above 0.85, we automatically routed to that intent's action.
For scores between 0.80 and 0.85, we presented the user with a confirmation prompt before routing.
Anything below 0.80 was treated as a low-confidence match, triggering a fallback mechanism or human intervention.

This approach allowed us to handle a wide range of user inputs effectively, even when they didn't exactly match our predefined intents. It was fascinating to see how the system could understand the semantic meaning behind queries and route them appropriately. However, fine-tuning these thresholds was an iterative process that required careful analysis of user interactions and feedback to strike the right balance between accuracy and user experience.

Understanding cosine similarity has been a interesting for me in working with LLM embeddings. It's a powerful tool that, once grasped, opens up a world of possibilities in natural language processing and machine learning applications.

Happy coding!

@aniltalla