ML & AI

What is Vector Database?

A specialized database that stores and queries high-dimensional embedding vectors using approximate nearest-neighbor search.

In Plain English

Traditional databases are built around exact lookups: find the row where user_id = 42, or find all orders placed in January. This works perfectly when you know exactly what you're searching for. But when you want to find things that are similar to something else — semantically similar documents, images that look alike, songs that sound like another song — exact lookup is useless. You need a different kind of database.

A vector database stores data as high-dimensional numerical vectors (embeddings) and is optimized to answer the question: "What are the most similar items to this vector?" This is called approximate nearest-neighbor (ANN) search, and it's the foundation of semantic search, recommendation systems, and RAG pipelines.

The "approximate" part is a deliberate trade-off. Finding the mathematically exact nearest neighbor in a space of millions of 768-dimensional vectors would require comparing every vector, which is prohibitively slow. Algorithms like HNSW (Hierarchical Navigable Small World graphs) build index structures that get you to 95-99% accuracy with 10-100x faster queries.

Think of a vector database as a GPS for meaning-space. Instead of finding the exact address, it finds the building closest to a given coordinate — almost certainly the right answer, found in milliseconds.

Popular vector databases include Pinecone, Weaviate, Qdrant, Milvus, and pgvector (an extension for PostgreSQL). Each has different trade-offs around throughput, recall accuracy, and operational complexity.

Technical Definition

A vector database maintains an index over a collection of vectors V = \{v₁, ..., vₙ\} where each vᵢ ∈ ℝᵈ. The core operation is k-nearest-neighbor (kNN) search: given a query vector q, return the k vectors from V with the highest similarity score sim(q, vᵢ), typically cosine similarity or inner product.

The index structure determines performance. HNSW builds a multi-layer proximity graph enabling logarithmic-time approximate search. IVF-PQ (Inverted File with Product Quantization) clusters vectors and compresses them to reduce memory footprint. Both achieve sub-millisecond p99 latency at millions of vectors with appropriate hardware.

Metadata filtering (find similar vectors where ticker='AAPL' AND period='2024-Q4') is a first-class feature in modern vector DBs, typically implemented via pre-filtering or post-filtering the ANN results.

How VectorFin Uses This

VectorFin's architecture differs from a typical vector database in an important way: the primary storage layer is Apache Iceberg tables on GCS, not a purpose-built vector database. This is intentional. Iceberg optimizes for batch analytics at scale — the ability to query billions of rows with DuckDB or BigQuery, join signals with embeddings in SQL, and maintain full audit history via bitemporal timestamps.

For real-time similarity search, VectorFin's API layer handles the ANN computation at query time using DuckDB's vector similarity functions over the Iceberg tables. Pro and Enterprise customers who need sub-millisecond latency at very high QPS can use the raw vector exports to populate their own vector database.

The key insight: for systematic research workflows — backtesting, signal generation, cross-ticker analysis — batch analytics over Iceberg far outperforms a real-time vector database in both throughput and cost. The vector database is the right tool for user-facing semantic search. Iceberg is the right tool for quant research.

Code Example

import requests
import numpy as np

API_BASE = "https://api.vectorfinancials.com"
API_KEY = "vf_your_api_key_here"

# Fetch a reference embedding (NVDA Q3 2024 earnings call, chunk 0)
resp = requests.get(
    f"{API_BASE}/v1/embeddings/NVDA",
    params={"period": "2024-Q3", "chunk_idx": 0},
    headers={"X-API-Key": API_KEY},
)
reference_vector = resp.json()["chunks"][0]["embedding"]

# Search for semantically similar chunks across all tickers
# (VectorFin runs ANN search server-side over the Iceberg tables)
search_resp = requests.post(
    f"{API_BASE}/v1/embeddings/search",
    json={
        "vector": reference_vector,
        "top_k": 10,
        "filters": {"period": "2024-Q3"},  # same time window
    },
    headers={"X-API-Key": API_KEY},
)

results = search_resp.json()["results"]
for r in results:
    print(f"{r['ticker']} {r['period']} chunk {r['chunk_idx']}: {r['score']:.4f}")
    print(f"  {r['text'][:100]}...")
    print()

External References

Put Vector Database to work in your pipeline

Access AI-ready financial data — embeddings, signals, Iceberg tables.

Get API Access Back to Glossary