+
VectorFin

Connect Databricks to VectorFin Financial Iceberg Data

Register VectorFin Iceberg tables in Unity Catalog and query with PySpark, SQL, or MLflow — full ML pipeline support.

15 min
Setup time
7
Iceberg tables
5K+
US tickers
Nightly
Updates

Prerequisites

📋VectorFin Pro plan
🔑API key from app.vectorfinancials.com
☁️Databricks account

Connection Guide

1

Configure GCS credential and external location

Set up Unity Catalog access to VectorFin's GCS bucket.

sql
-- Create storage credential (service account JSON from VectorFin)
CREATE STORAGE CREDENTIAL vf_gcs_cred
  WITH GCS SERVICE ACCOUNT KEY = '<base64-encoded-service-account-json>';

-- Create external location
CREATE EXTERNAL LOCATION vf_warehouse
  URL 'gs://vectorfinancials-data/warehouse'
  WITH (CREDENTIAL vf_gcs_cred);

-- Validate access
VALIDATE STORAGE CREDENTIAL vf_gcs_cred;
2

Register VectorFin Iceberg tables in Unity Catalog

Create a catalog and register all VectorFin tables.

sql
-- Create a catalog for VectorFin data
CREATE CATALOG IF NOT EXISTS vectorfin;
CREATE SCHEMA IF NOT EXISTS vectorfin.embeddings;
CREATE SCHEMA IF NOT EXISTS vectorfin.signals;

-- Register transcript embeddings table
CREATE TABLE IF NOT EXISTS vectorfin.embeddings.transcripts
USING ICEBERG
LOCATION 'gs://vectorfinancials-data/warehouse/embeddings/transcripts/';

-- Register signals tables
CREATE TABLE IF NOT EXISTS vectorfin.signals.whystock_score
USING ICEBERG
LOCATION 'gs://vectorfinancials-data/warehouse/signals/whystock_score/';

CREATE TABLE IF NOT EXISTS vectorfin.signals.volatility
USING ICEBERG
LOCATION 'gs://vectorfinancials-data/warehouse/signals/volatility/';
3

Query with PySpark

Read VectorFin data in a Databricks notebook with PySpark.

python
# Query transcript embeddings
df = spark.table("vectorfin.embeddings.transcripts")
aapl_df = df.filter(df.ticker == "AAPL").filter(df.fiscal_period.startswith("2024"))
display(aapl_df.select("ticker", "fiscal_period", "chunk_idx").limit(20))

# Load quant signals
signals = spark.table("vectorfin.signals.whystock_score")
top_signals = signals.filter(signals.date >= "2024-01-01") \
  .orderBy(signals.score.desc()) \
  .limit(50)
display(top_signals)
4

Semantic search with MLflow

Run embedding similarity search across the full corpus using numpy.

python
import numpy as np
from pyspark.sql.functions import col

# Load embeddings as numpy arrays
df = spark.table("vectorfin.embeddings.transcripts") \
  .filter(col("ticker") == "NVDA") \
  .select("fiscal_period", "chunk_idx", "embedding") \
  .toPandas()

embeddings = np.stack(df["embedding"].values)

# Your query vector (e.g., from Gemini text-embedding-004)
query_vec = np.array([...])  # 768-dim

# Cosine similarity
scores = embeddings @ query_vec / (np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_vec))
top_k = df.iloc[scores.argsort()[::-1][:5]]
print(top_k[["fiscal_period", "chunk_idx"]].to_string())

Available Tables

All 7 VectorFin data tables — bitemporal (effective_ts + knowledge_ts), append-only, nightly updates.

vectorfin.embeddings.transcriptsEarnings call chunk embeddings (768-dim)
sql
SELECT ticker, fiscal_period, chunk_idx FROM vectorfin.embeddings.transcripts WHERE ticker = 'GOOGL' LIMIT 10
vectorfin.embeddings.filingsSEC filing section embeddings
sql
SELECT ticker, filing_type, section FROM vectorfin.embeddings.filings WHERE filing_type = '10-K'
vectorfin.signals.whystock_scoreComposite quant score (0–100)
sql
SELECT * FROM vectorfin.signals.whystock_score ORDER BY score DESC LIMIT 20
vectorfin.signals.regimeMarket regime classification
sql
SELECT ticker, date, regime, confidence FROM vectorfin.signals.regime WHERE confidence > 0.8
vectorfin.signals.volatilityGARCH volatility forecasts
sql
SELECT ticker, date, garch_vol_1d, garch_vol_21d FROM vectorfin.signals.volatility
vectorfin.signals.sentiment_driftEarnings sentiment drift
sql
SELECT * FROM vectorfin.signals.sentiment_drift WHERE fiscal_period >= '2024-Q1'
vectorfin.signals.anomalyAnomaly detection scores
sql
SELECT * FROM vectorfin.signals.anomaly WHERE anomaly_score > 0.8 ORDER BY date DESC

Start querying in 15 minutes

Sign up for VectorFin and get immediate API access.