A measure of angular similarity between two vectors, ranging from -1 to 1, used to compare embeddings regardless of their magnitude.
In Plain English
Picture two arrows drawn from the origin of a graph. The arrows might be long or short, pointing in various directions. Cosine similarity ignores how long the arrows are and only asks: how closely do they point in the same direction?
If both arrows point exactly the same way, the angle between them is 0°, and cosine(0°) = 1 — perfect similarity. If they point in opposite directions, that is 180°, cosine(180°) = -1 — maximum dissimilarity. Perpendicular arrows give cosine(90°) = 0 — no relationship.
This directional focus is ideal for text embeddings. A short tweet and a long article about the same topic should be considered similar even though their raw vectors might differ greatly in magnitude. Cosine similarity strips out that length bias, so you compare meaning, not verbosity.
In practice, almost all embedding-based search and retrieval systems use cosine similarity (or equivalently, dot product on unit-normalized vectors) as their core distance metric. When you ask "which past earnings call is most similar to this quarter's call?", you are computing cosine similarity between the query embedding and every stored embedding, then ranking by score.
Technical Definition
For two vectors a, b ∈ ℝⁿ, cosine similarity is defined as:
cos(θ) = (a · b) / (‖a‖ · ‖b‖)where a · b = Σᵢ aᵢbᵢ is the dot product and ‖a‖ = √(Σᵢ aᵢ²) is the L2 norm.
The value is bounded to [−1, 1] for any real-valued vectors. For embeddings produced by most language models, values typically fall in [0, 1] because the representations are non-negative after certain activations; however this is model-dependent.
Cosine distance is defined as 1 − cos(θ) and maps the range to [0, 2], where 0 = identical direction and 2 = opposite. When embeddings are L2-normalized (‖e‖ = 1), cosine similarity equals the dot product, enabling fast approximate nearest-neighbor search via FAISS, ScaNN, or vector database indexes.
How VectorFin Uses This
Every VectorFin embedding endpoint returns 768-dim vectors normalized for direct dot-product comparison. The sentiment_drift signal is computed nightly by measuring cosine similarity between consecutive quarter mean-embeddings for each ticker:
gs://vectorfinancials-data/warehouse/signals/sentiment_drift/A low cosine similarity (high drift) between Q3 and Q4 embeddings signals a meaningful shift in how management discusses the business — a leading indicator of guidance changes, analyst downgrades, or strategic pivots. You can retrieve the pre-computed drift vector via:
GET https://api.vectorfinancials.com/v1/signals/{ticker}/sentiment_drift
X-API-Key: <your-key>Code Example
import numpy as np
import requests
API_KEY = "vf_your_api_key_here"
HEADERS = {"X-API-Key": API_KEY}
def get_mean_embedding(ticker: str, period: str) -> np.ndarray:
r = requests.get(
f"https://api.vectorfinancials.com/v1/embeddings/{ticker}",
params={"period": period},
headers=HEADERS,
)
r.raise_for_status()
vectors = [chunk["embedding"] for chunk in r.json()["chunks"]]
return np.mean(vectors, axis=0)
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Compare two consecutive quarters for MSFT
q3 = get_mean_embedding("MSFT", "2024-Q3")
q4 = get_mean_embedding("MSFT", "2024-Q4")
sim = cosine_similarity(q3, q4)
print(f"Q3→Q4 cosine similarity: {sim:.4f}")
print(f"Semantic drift score: {1 - sim:.4f}")Put Cosine Similarity to work in your pipeline
Access AI-ready financial data — embeddings, signals, Iceberg tables.