ML & AI

What is Embedding?

A dense numerical vector that encodes the semantic meaning of text, enabling machines to compare and retrieve content by meaning rather than keywords.

In Plain English

Imagine describing every book in a library with just 768 numbers between -1 and 1, with no summary or keywords. That's a text embedding. A neural network reads your text and outputs a vector of floating-point numbers that captures what the text means.

The useful part is geometric: similar text produces vectors pointing in similar directions. A CFO warning about rising input costs will sit close to another talking about margin pressure, even if the exact words differ.

This property transforms "find text similar to X" into a simple math problem: find the vectors nearest to the vector for X. You no longer need to match keywords or write elaborate rules. The model has already encoded the meaning into coordinates.

For financial data, embeddings change what's practical. Analysts know qualitative language carries signal, but extracting it at scale was hard. Embeddings automate this. Cluster earnings call paragraphs by topic, detect tone shifts quarter to quarter, or search thousands of 10-Ks semantically without a keyword index.

Technical Definition

An embedding is a function f: T → ℝⁿ that maps a discrete input (token, sentence, document) to a continuous vector in n-dimensional Euclidean space. Transformer-based models learn this by training on large corpora with contrastive learning (e.g. SimCSE), masked language modeling, or fine-tuning.

Formally, for a sentence s, the embedding e = f(s) ∈ ℝⁿ is typically the pooled hidden state from the final transformer layer (CLS token or mean-pooled). The model is trained so that semantically similar inputs produce vectors with high cosine similarity, while dissimilar inputs produce low or negative cosine similarity.

Embedding dimensionality n varies by model: 384 (MiniLM), 768 (BERT-base, Gemini embedding preview), 1536 (OpenAI text-embedding-3-small), 3072 (text-embedding-3-large). Higher dimensionality generally captures finer semantic distinctions at the cost of storage and compute.

Modern embedding models increasingly use Matryoshka Representation Learning (MRL) so that a single high-dim vector contains nested lower-dim prefixes that are themselves valid embeddings. gemini-embedding-2, used by VectorFin, is MRL-trained, which means a stored 768-dim vector can be truncated to 256 or 128 dims client-side (with renormalization) to trade a fraction of recall for much cheaper retrieval.

How VectorFin Uses This

VectorFin computes 768-dimensional embeddings for every chunk of every earnings call transcript using the gemini-embedding-2-preview model (SEC filing embeddings are in preview / coming soon). These are stored as Apache Iceberg tables at:

text

gs://vectorfinancials-data/warehouse/embeddings/transcripts/
gs://vectorfinancials-data/warehouse/embeddings/filings/   # preview, coming soon

Each row stores the ticker, fiscal period, chunk index, and a 768-float array alongside effective_ts and knowledge_ts for bitemporal querying. You can retrieve embeddings directly via the API:

text

GET https://api.vectorfinancials.com/v1/embeddings/{ticker}?fiscal_period=2024-Q4
X-API-Key: <your-key>

The returned vectors can be used for semantic search across filings, quarter-over-quarter narrative drift analysis (cosine distance between consecutive quarters' embeddings), or feeding into your own downstream models.

Code Example

python

import requests
import numpy as np

API_BASE = "https://api.vectorfinancials.com"
API_KEY = "vf_your_api_key_here"

def get_embedding(ticker: str, period: str) -> list[dict]:
    """Fetch all embedding chunks for a ticker/period."""
    resp = requests.get(
        f"{API_BASE}/v1/embeddings/{ticker}",
        params={"fiscal_period": period},
        headers={"X-API-Key": API_KEY},
    )
    resp.raise_for_status()
    # Each record: {ticker, fiscal_period, chunk_idx, embedding: [768 floats],
    # effective_ts, knowledge_ts, model_version}. No passage text - resolve
    # (ticker, fiscal_period, chunk_idx) -> text in your own transcript store.
    return resp.json()  # /v1/embeddings returns a JSON array of records

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Compare mean embeddings across two quarters
q3_chunks = get_embedding("AAPL", "2024-Q3")
q4_chunks = get_embedding("AAPL", "2024-Q4")

q3_mean = np.mean([c["embedding"] for c in q3_chunks], axis=0)
q4_mean = np.mean([c["embedding"] for c in q4_chunks], axis=0)

drift = 1.0 - cosine_similarity(q3_mean, q4_mean)
print(f"AAPL Q3→Q4 semantic drift: {drift:.4f}")
# Higher drift = language/tone shifted significantly quarter over quarter

Related Terms

cosine similarity semantic search

External References

Put Embedding to work in your pipeline

Pull AI-ready embeddings and signals as Iceberg tables or over the REST API.

Get API Access