ML & AI

What is Semantic Search?

Search that retrieves results based on conceptual meaning rather than exact keyword matching, powered by embedding similarity.

In Plain English

Traditional keyword search works like an exact match game: you type "supply chain disruption," and the engine looks for documents containing exactly those words. If a filing says "logistics bottlenecks" instead, you miss it, even though the meaning is identical. Semantic search solves this by operating on meaning rather than keywords.

Semantic search converts your query and every document into embeddings, compact numerical fingerprints of meaning, then finds documents whose fingerprints most closely match your query's. It doesn't care if the words match; what matters is whether the meaning matches. "Supply chain disruption," "logistics bottlenecks," "delivery delays impacting margins," and "fulfillment constraints" all cluster together in embedding space.

For financial research, an analyst researching "interest rate sensitivity" can find every earnings call passage where management discusses refinancing risk, floating-rate debt exposure, or Fed policy impact, even if those exact words never appear. Semantic search closes the gap between how analysts frame a question and how executives discuss it.

The process involves embedding the query, computing cosine similarity against all stored embeddings, and returning the top-k matches. Approximate nearest-neighbor indexes like HNSW and IVF enable sub-second retrieval at scale.

Technical Definition

Given a query q and a corpus of documents D = \{d₁, d₂, ..., dₙ\}, semantic search finds:

text

argmax_{dᵢ ∈ D} cos(f(q), f(dᵢ))

where f is a bi-encoder (two separate encoder passes) or cross-encoder (joint encoding of query and document). Bi-encoders use pre-computed embeddings for speed while cross-encoders re-encode each pair but rank more accurately.

Approximate nearest-neighbor (ANN) algorithms such as HNSW (Hierarchical Navigable Small World graphs) enable fast retrieval over millions of vectors. Vector databases (Pinecone, Weaviate, Qdrant, pgvector) provide these indexes natively.

How VectorFin Uses This

VectorFin's 768-dim embeddings stored in gs://vectorfinancials-data/warehouse/embeddings/ are designed for semantic search use cases. Customers on Starter and Pro tiers can pull raw embeddings and build their own semantic search layers over earnings call transcript chunks. Common patterns:

Cross-company thematic search: find all S&P 500 executives who discussed AI capital expenditure in 2024-Q4
Historical analogy retrieval: find past quarters whose language most resembles the current macro environment
Risk factor extraction: retrieve all 10-K risk factor sections mentioning cybersecurity exposure

VectorFin returns raw vectors and chunk keys. You compute similarity yourself and resolve the matched (ticker, fiscal_period, chunk_idx) tuples back to passage text using your store or the shared raw-transcript table (Pro/Enterprise).

bash

# Retrieve embeddings for building a semantic search index
curl -s "https://api.vectorfinancials.com/v1/embeddings/META?fiscal_period=2024-Q4" \
  -H "X-API-Key: $VF_API_KEY" | jq '.chunks | length'

Code Example

python

import requests
import numpy as np
from typing import List, Dict

API_KEY = "vf_your_api_key_here"
HEADERS = {"X-API-Key": API_KEY}

def fetch_embeddings(ticker: str, period: str) -> List[Dict]:
    r = requests.get(
        f"https://api.vectorfinancials.com/v1/embeddings/{ticker}",
        params={"fiscal_period": period},
        headers=HEADERS,
    )
    r.raise_for_status()
    # Each record: {ticker, fiscal_period, chunk_idx, embedding, ...}, no text.
    return r.json()  # /v1/embeddings returns a JSON array of records

# Text comes from YOUR transcript store, joined by (ticker, fiscal_period, chunk_idx).
# On Pro/Enterprise this is the shared raw-transcript table shipped with the vectors.
def load_my_transcripts(chunk: Dict) -> str:
    """Resolve a chunk's keys to its passage text in your own store."""
    ...  # return passages[(chunk["ticker"], chunk["fiscal_period"], chunk["chunk_idx"])]

def semantic_search(query_embedding: np.ndarray, corpus: List[Dict], top_k: int = 5):
    """Find the most semantically similar chunks to a query embedding.

    Ranks on the vectors and returns (score, chunk-keys). Matched passage
    text is pulled from your own store via load_my_transcripts().
    """
    q = query_embedding / np.linalg.norm(query_embedding)
    scored = []
    for chunk in corpus:
        v = np.array(chunk["embedding"])
        v = v / np.linalg.norm(v)
        score = float(np.dot(q, v))
        scored.append((score, chunk))
    scored.sort(key=lambda x: x[0], reverse=True)
    return scored[:top_k]

# Build a cross-company corpus for Q4 2024
tickers = ["AAPL", "MSFT", "AMZN", "GOOGL", "META"]
corpus = []
for t in tickers:
    corpus.extend(fetch_embeddings(t, "2024-Q4"))

# Embed a natural-language query (use your own embedding model for the query)
# For demo, we use the first chunk of NVDA as a proxy query
query_chunks = fetch_embeddings("NVDA", "2024-Q4")
query_vec = np.array(query_chunks[0]["embedding"])

results = semantic_search(query_vec, corpus, top_k=5)
for rank, (score, chunk) in enumerate(results, 1):
    text = load_my_transcripts(chunk)
    print(f"[{rank}] {chunk['ticker']} {chunk['fiscal_period']} "
          f"chunk_idx={chunk['chunk_idx']} score={score:.4f}, {text[:100]}...")

External References

Put Semantic Search to work in your pipeline

Pull AI-ready embeddings and signals as Iceberg tables or over the REST API.

Get API Access