ML & AI

What is Cosine Similarity?

A measure of angular similarity between two vectors, ranging from -1 to 1, used to compare embeddings regardless of their magnitude.

In Plain English

Picture two arrows drawn from the origin of a graph. The arrows might be long or short, pointing in various directions. Cosine similarity ignores how long the arrows are and only asks: how closely do they point in the same direction?

If both arrows point exactly the same way, the angle between them is 0°, and cosine(0°) = 1, perfect similarity. If they point in opposite directions, that is 180°, cosine(180°) = -1, maximum dissimilarity. Perpendicular arrows give cosine(90°) = 0, no relationship.

This directional focus is ideal for text embeddings. A short tweet and a long article about the same topic should be considered similar even though their raw vectors might differ greatly in magnitude. Cosine similarity strips out that length bias, so you compare meaning, not verbosity.

In practice, almost all embedding-based search and retrieval systems use cosine similarity (or equivalently, dot product on unit-normalized vectors) as their core distance metric. When you ask "which past earnings call is most similar to this quarter's call?", you are computing cosine similarity between the query embedding and every stored embedding, then ranking by score.

Technical Definition

For two vectors a, b ∈ ℝⁿ, cosine similarity is defined as:

text

cos(θ) = (a · b) / (‖a‖ · ‖b‖)

where a · b = Σᵢ aᵢbᵢ is the dot product and ‖a‖ = √(Σᵢ aᵢ²) is the L2 norm.

The value is bounded to [-1, 1] for any real-valued vectors. For embeddings produced by most language models, values typically fall in [0, 1] because the representations are non-negative after certain activations; however this is model-dependent.

Cosine distance is defined as 1 - cos(θ) and maps the range to [0, 2], where 0 = identical direction and 2 = opposite. When embeddings are L2-normalized (‖e‖ = 1), cosine similarity equals the dot product, enabling fast approximate nearest-neighbor search via FAISS, ScaNN, or vector database indexes.

How VectorFin Uses This

Every VectorFin embedding endpoint returns 768-dim vectors normalized for direct dot-product comparison. A common application is narrative drift: measure cosine similarity between consecutive-quarter mean-embeddings for a ticker, which you compute yourself from the raw transcript embeddings:

text

GET https://api.vectorfinancials.com/v1/embeddings/transcripts/{ticker}
X-API-Key: <your-key>

A low cosine similarity (high drift) between Q3 and Q4 embeddings signals a meaningful shift in how management discusses the business, a leading indicator of guidance changes, analyst downgrades, or strategic pivots.

Code Example

python

import numpy as np
import requests

API_KEY = "vf_your_api_key_here"
HEADERS = {"X-API-Key": API_KEY}

def get_mean_embedding(ticker: str, period: str) -> np.ndarray:
    r = requests.get(
        f"https://api.vectorfinancials.com/v1/embeddings/{ticker}",
        params={"fiscal_period": period},
        headers=HEADERS,
    )
    r.raise_for_status()
    vectors = [chunk["embedding"] for chunk in r.json()]  # response is a JSON array
    return np.mean(vectors, axis=0)

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Compare two consecutive quarters for MSFT
q3 = get_mean_embedding("MSFT", "2024-Q3")
q4 = get_mean_embedding("MSFT", "2024-Q4")

sim = cosine_similarity(q3, q4)
print(f"Q3→Q4 cosine similarity: {sim:.4f}")
print(f"Semantic drift score:     {1 - sim:.4f}")

Related Terms

embedding semantic search

External References

Put Cosine Similarity to work in your pipeline

Pull AI-ready embeddings and signals as Iceberg tables or over the REST API.

Get API Access