What is Matryoshka Representation Learning (MRL)?

A training technique that produces embeddings whose leading prefixes are themselves valid, usable vectors — letting you trade dimensions for cost at query time without retraining.

In Plain English

A Matryoshka doll is a nesting doll — the big doll opens to reveal a smaller one inside, which opens to reveal a smaller one, and so on. Matryoshka Representation Learning applies the same idea to embeddings: a single 3,072-dimensional vector contains, inside its leading numbers, a perfectly usable 1,536-dim vector, a 768-dim vector, a 256-dim vector, and even a 128-dim vector.

With ordinary embedding models, if you want a smaller vector you have to train a whole separate model. With an MRL-trained model, you store one large vector and simply slice off the prefix you want at query time. The shorter prefixes are not random first-few-numbers — the model was explicitly trained so they remain meaningful on their own.

Why care? Cost and latency. A 128-dim vector is 24× smaller to store than a 3,072-dim one, and cosine similarity over 128 floats is much faster than over 3,072. For many retrieval tasks the recall difference is tiny — often a fraction of a percent. So you can run cheap 128-dim search for the first pass, then re-rank the top candidates with the full 3,072-dim vector for precision. One model, many operating points.

For RAG and semantic search over financial text, MRL makes it practical to run interactive, sub-millisecond retrieval on embedded earnings calls and SEC filings without paying the full memory cost of high-dimensional vectors everywhere.

Technical Definition

Standard contrastive embedding training optimises a loss L over the full d-dim vector v. MRL instead optimises a weighted sum of losses over nested prefixes:

L_MRL = Σᵢ wᵢ · L(vᵢ) where vᵢ is the first dᵢ dimensions of v, for some set {d₁ < d₂ < ... < d_k} of nested dimensions (e.g. {128, 256, 512, 768, 1536, 3072}).

The consequence is that each prefix vᵢ must independently solve the contrastive retrieval task. The model has no incentive to pack critical information past position d₁; it has a strong incentive to pack the most important information into the lowest indices. The original MRL paper (Kusupati et al. 2022) showed this could be done with essentially no loss in downstream performance at the full dimension.

At query time the operations are trivial — slice and optionally renormalize:

v_slice = v[:k] v_norm = v_slice / ||v_slice||₂

Renormalization matters because cosine similarity depends on unit norm. A truncated-but-not-renormalized vector will appear to "score correctly" on ranking (order is preserved) but absolute scores and thresholds will drift. Modern MRL-trained embedding APIs — including Google's gemini-embedding-2 — automatically renormalize the output when you request a non-default dimension via an output_dimensionality parameter. If you truncate client-side after retrieval, renormalize yourself.

One architectural caveat: approximate-nearest-neighbor (ANN) indexes such as HNSW or IVF-PQ are built for a specific dimension. Serving multiple prefix lengths from the same corpus means either one index per dimension, or a single full-dim index with query-side truncation applied only at the similarity computation step. For exact retrieval (no ANN) over moderate corpora this is a non-issue.

How VectorFin Uses This

VectorFin embeddings are produced by Google's gemini-embedding-2-preview, which was trained with Matryoshka Representation Learning and exposes an output_dimensionality parameter ranging from 128 to 3,072. Google's published MTEB numbers show the drop from 2,048-dim to 768-dim is roughly 0.17 points (68.16 → 67.99) — negligible for almost every downstream task.

Our ingestion jobs call embed_content(output_dimensionality=768), so the vectors materialised in Iceberg are MRL-trained 768-dim vectors with the property preserved: you can slice any leading prefix {128, 256, 384, 512, 640, 768} and still get a coherent embedding of the same chunk. Because gemini-embedding-2 renormalizes at the API layer, the stored 768-dim vectors are already unit-norm. Further client-side slicing (e.g. to 256 dims) loses the unit-norm property and should be renormalized before similarity scoring.

What this unlocks:

Cheap first-pass retrieval. For interactive RAG over the full 5,000-ticker corpus, query at 256 dims — the ranking of top-k candidates is almost identical to 768 at a fraction of the memory and FLOPs. Re-rank the top 50 at the full 768 dims if you need the extra precision.
Budget-appropriate deployment. A quant team running in-memory retrieval across 200K+ chunks in a notebook can keep the whole corpus in RAM at 128 dims where 768 would not fit.
No lock-in to one width. Because the nesting property is in the training, not the serving layer, you can change your mind about dimension next quarter without re-embedding anything.

Crucially, MRL composes cleanly with VectorFin's bitemporal schema. The effective_ts and knowledge_ts columns are orthogonal to the embedding column — dimension choice has no effect on point-in-time correctness. Your RAG system can answer "what did Apple's CFO say about margins as of 2023-02-01?" using 256-dim embeddings just as correctly as with 768-dim embeddings; the temporal filter is applied before vector similarity.

Code Example

import os
import numpy as np
from google import genai
from google.genai import types
import requests

API_BASE = "https://api.vectorfinancials.com"
API_KEY = os.environ["VECTORFIN_API_KEY"]

client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])

def normalize(v: np.ndarray) -> np.ndarray:
    return v / (np.linalg.norm(v) + 1e-12)

# --- Option A: embed the query natively at a smaller dim ---
# Gemini renormalizes automatically when output_dimensionality != default.
q256 = client.models.embed_content(
    model="gemini-embedding-2-preview",
    contents="management commentary on gross margin pressure",
    config=types.EmbedContentConfig(
        task_type="RETRIEVAL_QUERY",
        output_dimensionality=256,
    ),
).embeddings[0].values
q256 = np.asarray(q256)

# Fetch stored document vectors (these are 768-dim, MRL-trained, unit-norm).
resp = requests.get(
    f"{API_BASE}/v1/embeddings/AAPL",
    params={"period": "2024-Q4"},
    headers={"X-API-Key": API_KEY},
)
chunks = resp.json()["chunks"]

# --- Option B: slice the stored 768-dim vectors down to 256 dims client-side ---
# This is the prefix — MRL guarantees it is a valid 256-dim embedding of the chunk.
# MUST renormalize: slicing breaks unit norm.
doc_slices = np.stack([normalize(np.asarray(c["embedding"])[:256]) for c in chunks])

# Cosine similarity (vectors are unit-norm → dot product = cosine)
scores = doc_slices @ q256
top_k = np.argsort(-scores)[:5]

for rank, idx in enumerate(top_k, 1):
    c = chunks[idx]
    print(f"{rank}. chunk_idx={c['chunk_idx']} score={scores[idx]:.3f}")
    print(f"   {c['text'][:160]}...")

# For the precision-critical final re-rank, rerun with the full 768-dim
# vectors — no re-embedding, just use the un-sliced stored values.

External References

Put Matryoshka Representation Learning (MRL) to work in your pipeline

Access AI-ready financial data — embeddings, signals, Iceberg tables.

Get API Access Back to Glossary