Earnings Call Embeddings
AI-ready 768-dimensional vector embeddings for US equity earnings call transcripts. Beta dataset covers all S&P 500 constituents (503 tickers) from fiscal year 2020 to present — expanding to 5,000+ tickers and 2018 history on GA.
Beta tickers
S&P 500 (503)
Beta coverage
2020–present
Embedding dims
768
Model
gemini-embedding-2-preview
Latency
<24h post-call
Delivery
REST · Iceberg · BQ · Snowflake
Current beta dataset: S&P 500 earnings transcripts, 2020–present
503 S&P 500 constituents · ~10,000 transcript/quarter pairs · 768-dim Gemini embeddings · Updated weekly (Sunday). Roadmap: expand to 5,000+ tickers and backfill to 2018 on GA.
What are earnings call embeddings?
Each earnings call transcript is chunked into semantically coherent segments and vectorized using Google's gemini-embedding-2-preview model, producing 768-dimensional dense vectors optimized for cosine similarity search.
All data is bitemporal: every embedding chunk carries an effective_ts (when the earnings call occurred) and a knowledge_ts (when VectorFin ingested and vectorized the data), enabling point-in-time backtesting without look-ahead bias.
Typical use cases: semantic search across earnings calls, cross-ticker similarity, earnings surprise detection, quant factor construction, and grounding LLM answers through Retrieval-Augmented Generation (RAG).
Drop-in retrieval layer for financial RAG
Retrieval-Augmented Generation systems are only as good as their retrieval step. Generic embedding models don't understand financial language; building your own over thousands of earnings calls takes months. VectorFin ships that retrieval layer as a data product.
A typical financial RAG pipeline with VectorFin:
- Embed the user query with
gemini-embedding-2-preview. - Run vector similarity over VectorFin's transcript and filing chunks (REST or Iceberg).
- Pass the top-k chunks as context to your LLM with citations.
Because every chunk is keyed by ticker, fiscal_period, and chunk_idx, citations come free. And because everything is bitemporal, you can safely answer historical questions (“what did management say about margins in Q2 2022?”) without future-information leakage — critical for backtested RAG evaluations and regulated use cases.
One stored vector, many retrieval widths
gemini-embedding-2-preview is trained with Matryoshka Representation Learning. Our stored 768-dim vectors have a nesting property: slicing to the first 512, 256, or 128 dimensions yields a valid embedding of the same chunk. No re-embedding, no separate model.
Two ways to use it: (1) query Gemini at a smaller dimension with output_dimensionality=256 — Gemini renormalizes automatically; (2) slice the stored 768-dim vectors client-side and renormalize yourself. Either way, a cheap first-pass retrieval at 128 or 256 dims can be re-ranked against the full 768-dim vectors for precision — all from one corpus.
Interactive RAG
128–256 dims
Re-rank / default
768 dims
MTEB delta
~0.2 pts*
Storage ratio
6× smaller
*Google-reported MTEB drop between 2048-dim and 768-dim on gemini-embedding-2. Dimension is orthogonal to effective_ts / knowledge_ts — point-in-time RAG works identically at any width.
Browse by ticker
Top 100 tickers below — all available in beta. Full S&P 500 accessible via API or by navigating to /embeddings/TICKER.
Start exploring earnings embeddings
Free tier: top 100 tickers, 1,000 API calls/month. No credit card required.