embeddingsBeta · S&P 500

Earnings Call Embeddings

AI-ready 768-dimensional vector embeddings for US equity earnings call transcripts. The beta dataset covers the S&P 500 (~500 tickers) from fiscal year 2020 to present, expanding to 5,000+ tickers and 2018 history at GA.

Get API Access View Docs

Beta tickers

S&P 500 (~500)

Beta coverage

2020 to present

Embedding dims

768

Model

gemini-embedding-2-preview

Latency

<24h post-call

Delivery

REST · Iceberg · BQ · Snowflake

BETA

Current beta dataset: S&P 500 earnings transcripts, 2020 to present

2025-01-312024-12-31

The S&P 500 (~500 constituents) · 768-dim Gemini embeddings · updated weekly. Roadmap: expand to 5,000+ tickers and backfill to 2018 at GA.

What are earnings call embeddings?

Each earnings call transcript is chunked into semantically coherent segments and vectorized using Google's gemini-embedding-2-preview model, producing 768-dimensional dense vectors optimized for cosine similarity search.

All data is bitemporal: every embedding chunk carries an effective_ts (when the earnings call occurred) and a knowledge_ts (when VectorFin ingested and vectorized the data), enabling point-in-time backtesting without look-ahead bias.

Typical use cases: semantic search across earnings calls, cross-ticker similarity, earnings surprise detection, quant factor construction, and grounding LLM answers through Retrieval-Augmented Generation (RAG).

PRIMARY USE CASE

Drop-in retrieval layer for financial RAG

Retrieval-Augmented Generation systems are only as good as their retrieval step. Generic embedding models miss a lot of financial language, and building your own over thousands of earnings calls takes months. VectorFin ships that retrieval layer as a data product.

A typical financial RAG pipeline with VectorFin:

Embed the user query with gemini-embedding-2-preview.
Run vector similarity over VectorFin's transcript and filing chunks (REST or Iceberg).
Pass the top-k chunks as context to your LLM with citations.

Because every chunk is keyed by ticker, fiscal_period, and chunk_idx, citations come free. And because everything is bitemporal, you can safely answer historical questions (“what did management say about margins in Q2 2022?”) without future-information leakage. That matters for backtested RAG evaluations and regulated use cases.

MATRYOSHKA

One stored vector, many retrieval widths

The model documents Matryoshka-style truncation, so a shorter prefix of a 768-dim vector still works as a valid embedding of the same chunk. No re-embedding, no separate model.

Two ways to use it: (1) query Gemini at a smaller dimension with output_dimensionality=256, where Gemini renormalizes automatically; (2) slice the stored 768-dim vectors client-side and renormalize yourself. Either way, a cheap first-pass retrieval at a smaller width can be re-ranked against the full 768-dim vectors for precision, all from one corpus.

Interactive RAG

smaller width

Re-rank / default

768 dims

MTEB delta

small*

Storage

smaller

*A small benchmark delta between higher and lower dimensions, per Google's model card. Dimension is orthogonal to effective_ts / knowledge_ts, so point-in-time RAG works identically at any width.

Browse by ticker

Top 100 tickers below, all available in beta. Full S&P 500 accessible via API or by navigating to /embeddings/TICKER.

AAPL MSFT GOOGL AMZN NVDA META TSLA BRK.B JPM JNJ V UNH XOM WMT LLY MA AVGO HD CVX PG MRK ABBV COST BAC PEP KO ADBE AMD CRM WFC TMO MCD CSCO ACN ABT DHR NFLX DIS INTC VZ CMCSA INTU IBM AMGN PFE NKE PM TXN RTX SPGI GE MS GS LOW UPS HON BLK COP ELV ISRG CAT SYK ADP GILD MDT MDLZ T DE BMY REGN MMC VRTX CB CVS MO SCHW ZTS ADI PLD SO LRCX BSX CI PANW MU SLB DUK ETN KLAC ANET SNPS CDNS APH AJG TJX AON ICE WM FCX PSA

Start exploring earnings embeddings

Free tier: top 100 tickers, 250 API calls/month. No credit card required.

Get API Access View Pricing