A dense numerical vector that encodes the semantic meaning of text, enabling machines to compare and retrieve content by meaning rather than keywords.
In Plain English
Imagine you had to describe every book in a library using only a list of numbers. Not a summary, not keywords — just 768 numbers between roughly -1 and 1. Sounds absurd, yet that is exactly what a text embedding does. A neural network reads your text and outputs a vector of floating-point numbers that captures, in a surprisingly rich way, what the text means.
The magic is in the geometry. Two chunks of text that mean roughly the same thing will produce vectors that point in similar directions in that high-dimensional space. A paragraph from an earnings call where the CFO warns about rising input costs will sit geometrically close to another paragraph where a different CFO says margins are under pressure — even though the exact words differ.
This property transforms "find text similar to X" into a simple math problem: find the vectors nearest to the vector for X. You no longer need to match keywords or write elaborate rules. The model has already encoded the meaning into coordinates.
For financial data, this is transformative. Analysts have always known that qualitative language in earnings calls and filings carries signal — but extracting that signal at scale was laborious. Embeddings make it automatic. You can cluster every earnings call paragraph by topic, detect when a company's tone shifts quarter over quarter, or build a semantic search across thousands of 10-K filings without touching a keyword index.
Technical Definition
An embedding is a function f: T → ℝⁿ that maps a discrete input (token, sentence, document) to a continuous vector in n-dimensional Euclidean space. Modern transformer-based embedding models learn this function by training on large corpora with objectives such as contrastive learning (e.g. SimCSE) or masked language modeling followed by fine-tuning.
Formally, for a sentence s, the embedding e = f(s) ∈ ℝⁿ is typically the pooled hidden state from the final transformer layer (CLS token or mean-pooled). The model is trained so that semantically similar inputs produce vectors with high cosine similarity, while dissimilar inputs produce low or negative cosine similarity.
Embedding dimensionality n varies by model: 384 (MiniLM), 768 (BERT-base, Gemini embedding preview), 1536 (OpenAI text-embedding-3-small), 3072 (text-embedding-3-large). Higher dimensionality generally captures finer semantic distinctions at the cost of storage and compute.
How VectorFin Uses This
VectorFin computes 768-dimensional embeddings for every chunk of every earnings call transcript and SEC filing using the gemini-embedding-2-preview model. These are stored as Apache Iceberg tables at:
gs://vectorfinancials-data/warehouse/embeddings/transcripts/
gs://vectorfinancials-data/warehouse/embeddings/filings/Each row stores the ticker, fiscal period, chunk index, and a 768-float array alongside effective_ts and knowledge_ts for bitemporal querying. You can retrieve embeddings directly via the API:
GET https://api.vectorfinancials.com/v1/embeddings/{ticker}?period=2024-Q4
X-API-Key: <your-key>The returned vectors can be used for semantic search across filings, quarter-over-quarter drift analysis (see sentiment_drift signals), or feeding into your own downstream models.
Code Example
import requests
import numpy as np
API_BASE = "https://api.vectorfinancials.com"
API_KEY = "vf_your_api_key_here"
def get_embedding(ticker: str, period: str) -> list[dict]:
"""Fetch all embedding chunks for a ticker/period."""
resp = requests.get(
f"{API_BASE}/v1/embeddings/{ticker}",
params={"period": period},
headers={"X-API-Key": API_KEY},
)
resp.raise_for_status()
return resp.json()["chunks"] # list of {chunk_idx, text, embedding: [768 floats]}
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Compare mean embeddings across two quarters
q3_chunks = get_embedding("AAPL", "2024-Q3")
q4_chunks = get_embedding("AAPL", "2024-Q4")
q3_mean = np.mean([c["embedding"] for c in q3_chunks], axis=0)
q4_mean = np.mean([c["embedding"] for c in q4_chunks], axis=0)
drift = 1.0 - cosine_similarity(q3_mean, q4_mean)
print(f"AAPL Q3→Q4 semantic drift: {drift:.4f}")
# Higher drift = language/tone shifted significantly quarter over quarterExternal References
Put Embedding to work in your pipeline
Access AI-ready financial data — embeddings, signals, Iceberg tables.