A technique that grounds LLM outputs by fetching relevant documents from a knowledge base before generating an answer.
In Plain English
Large language models are trained on enormous amounts of text, which makes them fluent and knowledgeable — but their knowledge is frozen at training time. Ask an LLM about a company's most recent earnings call and it will either hallucinate details or admit it doesn't know. Retrieval-Augmented Generation, universally called RAG, solves this by giving the model a library to consult before it answers.
The process works in two steps. First, a retrieval system finds the most relevant passages from a knowledge base — using vector similarity, keyword search, or both. Second, those passages are stuffed into the prompt as context, and the LLM generates an answer grounded in what it just read. The model is not recalling from training; it is reading and summarizing.
Think of RAG as the difference between asking an expert to answer from memory versus handing them the source documents and asking them to respond. The second approach is more reliable, verifiable, and updatable without retraining anyone.
For financial analysis, RAG unlocks powerful workflows. You can build a system that answers "What did management say about supply chain risk in Q3 2024?" by retrieving the relevant earnings call chunks, then having an LLM synthesize a concise answer with citations. The retrieval layer is where VectorFin plugs in: pre-computed, high-quality embeddings over thousands of earnings calls and filings, ready for your retrieval step.
The quality ceiling of any RAG system is determined by the quality of the retrieval. Garbage in, garbage out. If the embedding model doesn't understand financial language, the retrieved chunks will be irrelevant and the LLM's answer will be wrong regardless of how capable the model is.
Technical Definition
RAG combines a retriever R and a generator G. Given a query q:
1. Encode q into a query vector v_q = f(q) using an embedding model 2. Retrieve top-k documents: D = \{d₁, ..., d_k\} with highest sim(v_q, v_dᵢ) 3. Construct a prompt p = [system instructions | d₁ | ... | d_k | q] 4. Generate answer a = G(p)
The retriever is typically a dense bi-encoder (two encoders: one for queries, one for documents) or a hybrid system combining dense retrieval with BM25 sparse retrieval. Approximate nearest-neighbor (ANN) search (HNSW, IVF-PQ) enables sub-millisecond retrieval over millions of embeddings.
Advanced RAG variants include re-ranking (a cross-encoder re-scores the top-k before passing to the generator), iterative retrieval (the generator requests additional documents mid-generation), and self-RAG (the model decides when to retrieve).
How VectorFin Uses This
VectorFin's embeddings API is designed as a drop-in retrieval layer for financial RAG systems. The chunked, 768-dimensional embeddings over earnings calls cover 5,000+ tickers from 2018 onward. Each chunk maps to a specific passage with its source ticker, fiscal period, and chunk index, making citations trivial to generate.
A typical RAG pipeline using VectorFin: 1. Embed the user's query with the same model family (Gemini embedding) 2. Call POST /v1/embeddings/search with the query vector to retrieve top-k chunks 3. Pass retrieved chunks + query to your LLM of choice 4. Return a grounded answer with source attribution
The bitemporal design (effective_ts + knowledge_ts) means you can build RAG systems that answer historical questions correctly — "what did management say about margins in Q2 2022?" — without risk of future information contaminating the retrieved context.
Code Example
import requests
import numpy as np
API_BASE = "https://api.vectorfinancials.com"
API_KEY = "vf_your_api_key_here"
def semantic_search(query: str, tickers: list[str] = None, top_k: int = 5) -> list[dict]:
"""Retrieve the most relevant earnings call chunks for a query."""
payload = {
"query": query,
"top_k": top_k,
}
if tickers:
payload["tickers"] = tickers
resp = requests.post(
f"{API_BASE}/v1/embeddings/search",
json=payload,
headers={"X-API-Key": API_KEY},
)
resp.raise_for_status()
return resp.json()["results"] # [{ticker, period, chunk_idx, text, score}, ...]
# Step 1: Retrieve relevant chunks
query = "management commentary on supply chain disruptions and input cost inflation"
chunks = semantic_search(query, tickers=["AAPL", "MSFT", "AMZN"], top_k=5)
# Step 2: Build context for LLM
context = "\n\n".join(
f"[{c['ticker']} {c['period']}]: {c['text']}"
for c in chunks
)
print("Retrieved context for RAG:")
for c in chunks:
print(f" {c['ticker']} {c['period']} (score: {c['score']:.3f}): {c['text'][:120]}...")External References
Put Retrieval-Augmented Generation (RAG) to work in your pipeline
Access AI-ready financial data — embeddings, signals, Iceberg tables.