ML & AI

What is Retrieval-Augmented Generation (RAG)?

A technique that grounds LLM outputs by fetching relevant documents from a knowledge base before generating an answer.

In Plain English

Large language models are trained on enormous amounts of text, so they're fluent and knowledgeable. But their knowledge stops at training time. Ask one about a company's latest earnings call and it'll either make something up or say it doesn't know. Retrieval-Augmented Generation (RAG) fixes this by giving the model relevant documents to read before answering.

The process has two steps. First, a retrieval system finds the most relevant passages from a knowledge base using vector similarity, keyword search, or both. Second, those passages go into the prompt, and the LLM generates an answer based on what it just read. The model is reading and summarizing, not recalling from training.

Think of it this way: asking an expert to answer from memory versus handing them documents and asking them to respond. The second approach is more reliable, verifiable, and doesn't require retraining.

For financial analysis, RAG is a practical tool. You can build a system that answers "What did management say about supply chain risk in Q3 2024?" by fetching relevant earnings call chunks and having an LLM summarize them with citations. That's where VectorFin fits in: pre-computed embeddings from thousands of earnings calls and filings, ready to use.

The quality of any RAG system depends entirely on its retrieval quality. If retrieval is poor, you get poor answers regardless of how capable the LLM is. If the embedding model doesn't understand financial language well, it'll fetch the wrong chunks, and no amount of LLM intelligence can fix that.

Technical Definition

RAG combines a retriever R and a generator G. Given a query q:

1. Encode q into a query vector v_q = f(q) using an embedding model 2. Retrieve top-k documents: D = \{d₁, ..., d_k\} with highest sim(v_q, v_dᵢ) 3. Construct a prompt p = [system instructions | d₁ | ... | d_k | q] 4. Generate answer a = G(p)

The retriever is typically a dense bi-encoder (two encoders: one for queries, one for documents) or a hybrid system combining dense retrieval with BM25 sparse retrieval. Approximate nearest-neighbor (ANN) search (HNSW, IVF-PQ) enables sub-millisecond retrieval over millions of embeddings.

Advanced RAG variants include re-ranking (a cross-encoder re-scores the top-k before passing to the generator), iterative retrieval (the generator requests additional documents mid-generation), and self-RAG (the model decides when to retrieve).

How VectorFin Uses This

VectorFin's embeddings API is the retrieval layer for financial RAG systems. The 768-dimensional embeddings from chunked earnings calls span the S&P 500 (beta), with each chunk carrying its source ticker, fiscal period, and index for easy citations.

A typical RAG pipeline using VectorFin: 1. Embed the user's query with the same model family (Gemini embedding) 2. Fetch candidate chunk vectors with GET /v1/embeddings/{ticker} for the tickers/periods in scope, then rank them locally by cosine similarity to pick the top-k (there is no server-side search endpoint; at full-corpus scale you run the same nearest-neighbor step in-warehouse over the Iceberg tables) 3. Resolve the winning (ticker, fiscal_period, chunk_idx) keys back to passage text in your own store, then pass those passages + the query to your LLM of choice 4. Return a grounded answer with source attribution

The bitemporal design (effective_ts + knowledge_ts) lets you answer historical questions accurately, like "what did management say about margins in Q2 2022?" without future information leaking into your results.

VectorFin uses gemini-embedding-2-preview embeddings trained with Matryoshka Representation Learning. You can truncate the 768-dim vectors to 256 or 128 dimensions for a fast first pass, then re-rank at full width, all from a single stored set. Dimension choice doesn't affect the bitemporal logic, so point-in-time queries work identically regardless of width.

Code Example

python

import requests
import numpy as np

API_BASE = "https://api.vectorfinancials.com"
API_KEY = "vf_your_api_key_here"
H = {"X-API-Key": API_KEY}

def semantic_search(query_vector, tickers, fiscal_period, top_k=5):
    """Rank earnings-call chunks for an already-encoded query vector.

    No server-side search: fetch the candidate chunk vectors, then rank them
    locally by cosine similarity. Records carry vectors + keys, not text.
    """
    candidates = []
    for ticker in tickers:
        chunks = requests.get(
            f"{API_BASE}/v1/embeddings/{ticker}",
            params={"fiscal_period": fiscal_period, "limit": 200},
            headers=H,
        ).json()
        candidates.extend(chunks)

    E = np.stack([c["embedding"] for c in candidates])       # (N, 768)
    sims = (E @ query_vector) / (
        np.linalg.norm(E, axis=1) * np.linalg.norm(query_vector)
    )
    order = np.argsort(sims)[::-1][:top_k]
    return [(candidates[i], float(sims[i])) for i in order]

# Step 1: encode the query yourself, then retrieve + rank candidate chunks
query_vector = np.array(...)  # "supply chain disruptions / input cost inflation" via Gemini embedding
hits = semantic_search(query_vector, tickers=["AAPL", "MSFT", "AMZN"], fiscal_period="2024-Q3", top_k=5)

# Step 2: resolve keys to passage text in your own store, then build LLM context
def load_passage(ticker, fiscal_period, chunk_idx):
    ...  # your transcript store, keyed by (ticker, fiscal_period, chunk_idx)

context = "\n\n".join(
    f"[{c['ticker']} {c['fiscal_period']}]: {load_passage(c['ticker'], c['fiscal_period'], c['chunk_idx'])}"
    for c, _ in hits
)

print("Retrieved context for RAG:")
for c, sim in hits:
    print(f"  {c['ticker']} {c['fiscal_period']} chunk {c['chunk_idx']} (score: {sim:.3f})")

Related Terms

embedding semantic search

External References

Put Retrieval-Augmented Generation (RAG) to work in your pipeline

Pull AI-ready embeddings and signals as Iceberg tables or over the REST API.

Get API Access