VectorFin/Glossary/Semantic Search
ML & AI

What is Semantic Search?

Search that retrieves results based on conceptual meaning rather than exact keyword matching, powered by embedding similarity.

In Plain English

Traditional keyword search works like an exact match game: you type "supply chain disruption," and the engine looks for documents containing exactly those words. If a filing says "logistics bottlenecks" instead, you miss it — even though the meaning is identical. This is the fundamental limitation that semantic search solves.

Semantic search converts your query and every document into embeddings — compact numerical fingerprints of meaning — and then finds the documents whose fingerprints most closely resemble your query's fingerprint. It does not care whether the words match; it cares whether the meaning matches. "Supply chain disruption," "logistics bottlenecks," "delivery delays impacting margins," and "fulfillment constraints" will all cluster together in embedding space.

For financial research, this is a significant capability upgrade. An analyst researching "interest rate sensitivity" can find every earnings call passage where management discusses refinancing risk, floating-rate debt exposure, or Fed policy impact — even if those exact words never appear. The semantic layer bridges the gap between how analysts think about a topic and the varied language executives use to describe it.

The process is: (1) embed the query, (2) compute cosine similarity against all stored document embeddings, (3) return the top-k most similar chunks. When done with approximate nearest-neighbor indexes (HNSW, IVF), this scales to millions of documents with sub-second latency.

Technical Definition

Given a query q and a corpus of documents D = \{d₁, d₂, ..., dₙ\}, semantic search finds:

argmax_{dᵢ ∈ D} cos(f(q), f(dᵢ))

where f is a bi-encoder (two separate encoder passes) or cross-encoder (joint encoding of query and document). Bi-encoders enable pre-computation of document embeddings and fast retrieval at inference time. Cross-encoders are more accurate but require re-encoding each (query, document) pair and are used in re-ranking pipelines.

Approximate nearest-neighbor (ANN) algorithms such as HNSW (Hierarchical Navigable Small World graphs) allow sub-linear retrieval over millions of vectors. Vector databases (Pinecone, Weaviate, Qdrant, pgvector) implement these indexes as first-class primitives.

How VectorFin Uses This

VectorFin's 768-dim embeddings stored in gs://vectorfinancials-data/warehouse/embeddings/ are designed for semantic search use cases. Customers on Starter and Pro tiers can pull raw embeddings and build their own semantic search layers over earnings call transcript chunks. Common patterns include:

  • Cross-company thematic search: find all S&P 500 executives who discussed AI capital expenditure in 2024-Q4
  • Historical analogy retrieval: find past quarters whose language most resembles the current macro environment
  • Risk factor extraction: retrieve all 10-K risk factor sections mentioning cybersecurity exposure
# Retrieve embeddings for building a semantic search index
curl -s "https://api.vectorfinancials.com/v1/embeddings/META?period=2024-Q4" \
  -H "X-API-Key: $VF_API_KEY" | jq '.chunks | length'

Code Example

import requests
import numpy as np
from typing import List, Dict

API_KEY = "vf_your_api_key_here"
HEADERS = {"X-API-Key": API_KEY}

def fetch_embeddings(ticker: str, period: str) -> List[Dict]:
    r = requests.get(
        f"https://api.vectorfinancials.com/v1/embeddings/{ticker}",
        params={"period": period},
        headers=HEADERS,
    )
    r.raise_for_status()
    return r.json()["chunks"]

def semantic_search(query_embedding: np.ndarray, corpus: List[Dict], top_k: int = 5):
    """Find the most semantically similar chunks to a query embedding."""
    q = query_embedding / np.linalg.norm(query_embedding)
    scored = []
    for chunk in corpus:
        v = np.array(chunk["embedding"])
        v = v / np.linalg.norm(v)
        score = float(np.dot(q, v))
        scored.append((score, chunk["text"]))
    scored.sort(reverse=True)
    return scored[:top_k]

# Build a cross-company corpus for Q4 2024
tickers = ["AAPL", "MSFT", "AMZN", "GOOGL", "META"]
corpus = []
for t in tickers:
    corpus.extend(fetch_embeddings(t, "2024-Q4"))

# Embed a natural-language query (use your own embedding model for the query)
# For demo, we use the first chunk of NVDA as a proxy query
query_chunks = fetch_embeddings("NVDA", "2024-Q4")
query_vec = np.array(query_chunks[0]["embedding"])

results = semantic_search(query_vec, corpus, top_k=5)
for rank, (score, text) in enumerate(results, 1):
    print(f"[{rank}] score={score:.4f} — {text[:100]}...")

Put Semantic Search to work in your pipeline

Access AI-ready financial data — embeddings, signals, Iceberg tables.