VectorFin/Integrations/Apache Iceberg
+
VectorFin

VectorFin Apache Iceberg Financial Data

Access VectorFin data directly via pyiceberg — open format, no vendor lock-in, bitemporal time-travel built in.

5 min
Setup time
7
Iceberg tables
5K+
US tickers
Nightly
Updates

Prerequisites

📋VectorFin Starter plan
🔑API key from app.vectorfinancials.com
☁️Apache Iceberg account

Connection Guide

1

Install pyiceberg and configure GCS catalog

Install dependencies and set up catalog connection to VectorFin's Polaris REST catalog.

bash
pip install "pyiceberg[gcs]" pandas numpy
2

Connect to the Polaris catalog

Initialize a pyiceberg catalog pointing to VectorFin's REST catalog endpoint.

python
from pyiceberg.catalog import load_catalog

catalog = load_catalog(
    "vectorfin",
    **{
        "type": "rest",
        "uri": "https://catalog.vectorfinancials.com",
        "credential": "client_id:client_secret",  # from your VectorFin dashboard
        "warehouse": "vectorfin_warehouse",
    }
)

# List available namespaces and tables
print(catalog.list_namespaces())
# [('embeddings',), ('signals',)]

print(catalog.list_tables("signals"))
# [('signals', 'whystock_score'), ('signals', 'regime'), ...]
3

Load a table and scan data

Open a VectorFin Iceberg table and scan into a pandas DataFrame.

python
import pandas as pd

# Open the whystock_score signals table
table = catalog.load_table("signals.whystock_score")

# Scan with filters (pushdown predicates)
from pyiceberg.expressions import And, GreaterThanOrEqual, EqualTo

df = table.scan(
    row_filter=And(
        EqualTo("ticker", "AAPL"),
        GreaterThanOrEqual("date", "2024-01-01"),
    ),
    selected_fields=("ticker", "date", "score", "components"),
).to_pandas()

print(df.head(10))
4

Time-travel: point-in-time query

Use Iceberg's bitemporal knowledge_ts to reconstruct what was known at a specific date.

python
from pyiceberg.expressions import And, LessThanOrEqual, EqualTo
from datetime import datetime

# Load embeddings table
emb_table = catalog.load_table("embeddings.transcripts")

# What did we know about AAPL as of Jan 1, 2024?
df = emb_table.scan(
    row_filter=And(
        EqualTo("ticker", "AAPL"),
        LessThanOrEqual("knowledge_ts", datetime(2024, 1, 1).isoformat()),
    ),
    selected_fields=("ticker", "fiscal_period", "chunk_idx", "embedding"),
).to_pandas()

import numpy as np
E = np.stack(df["embedding"].values)
print(f"Loaded {len(E)} embeddings with shape {E.shape}")

Available Tables

All 7 VectorFin data tables — bitemporal (effective_ts + knowledge_ts), append-only, nightly updates.

embeddings.transcriptsEarnings call chunk embeddings
sql
catalog.load_table("embeddings.transcripts").scan(row_filter=EqualTo("ticker", "AAPL")).to_pandas()
embeddings.filingsSEC filing section embeddings
sql
catalog.load_table("embeddings.filings").scan(row_filter=EqualTo("filing_type", "10-K")).to_pandas()
signals.whystock_scoreComposite quant score
sql
catalog.load_table("signals.whystock_score").scan().to_pandas().sort_values("score", ascending=False)
signals.regimeMarket regime classification
sql
catalog.load_table("signals.regime").scan(row_filter=EqualTo("ticker", "NVDA")).to_pandas()
signals.volatilityGARCH volatility forecasts
sql
catalog.load_table("signals.volatility").scan().to_pandas()
signals.sentiment_driftEarnings sentiment drift
sql
catalog.load_table("signals.sentiment_drift").scan().to_pandas()
signals.anomalyAnomaly detection scores
sql
catalog.load_table("signals.anomaly").scan(row_filter=GreaterThan("anomaly_score", 0.8)).to_pandas()

Start querying in 5 minutes

Sign up for VectorFin and get immediate API access.