+
VectorFin
Connect Databricks to VectorFin Financial Iceberg Data
Register VectorFin Iceberg tables in Unity Catalog and query with PySpark, SQL, or MLflow — full ML pipeline support.
15 min
Setup time
7
Iceberg tables
5K+
US tickers
Nightly
Updates
Prerequisites
📋VectorFin Pro plan
🔑API key from app.vectorfinancials.com
☁️Databricks account
Connection Guide
1
Configure GCS credential and external location
Set up Unity Catalog access to VectorFin's GCS bucket.
sql
-- Create storage credential (service account JSON from VectorFin)
CREATE STORAGE CREDENTIAL vf_gcs_cred
WITH GCS SERVICE ACCOUNT KEY = '<base64-encoded-service-account-json>';
-- Create external location
CREATE EXTERNAL LOCATION vf_warehouse
URL 'gs://vectorfinancials-data/warehouse'
WITH (CREDENTIAL vf_gcs_cred);
-- Validate access
VALIDATE STORAGE CREDENTIAL vf_gcs_cred;2
Register VectorFin Iceberg tables in Unity Catalog
Create a catalog and register all VectorFin tables.
sql
-- Create a catalog for VectorFin data
CREATE CATALOG IF NOT EXISTS vectorfin;
CREATE SCHEMA IF NOT EXISTS vectorfin.embeddings;
CREATE SCHEMA IF NOT EXISTS vectorfin.signals;
-- Register transcript embeddings table
CREATE TABLE IF NOT EXISTS vectorfin.embeddings.transcripts
USING ICEBERG
LOCATION 'gs://vectorfinancials-data/warehouse/embeddings/transcripts/';
-- Register signals tables
CREATE TABLE IF NOT EXISTS vectorfin.signals.whystock_score
USING ICEBERG
LOCATION 'gs://vectorfinancials-data/warehouse/signals/whystock_score/';
CREATE TABLE IF NOT EXISTS vectorfin.signals.volatility
USING ICEBERG
LOCATION 'gs://vectorfinancials-data/warehouse/signals/volatility/';3
Query with PySpark
Read VectorFin data in a Databricks notebook with PySpark.
python
# Query transcript embeddings
df = spark.table("vectorfin.embeddings.transcripts")
aapl_df = df.filter(df.ticker == "AAPL").filter(df.fiscal_period.startswith("2024"))
display(aapl_df.select("ticker", "fiscal_period", "chunk_idx").limit(20))
# Load quant signals
signals = spark.table("vectorfin.signals.whystock_score")
top_signals = signals.filter(signals.date >= "2024-01-01") \
.orderBy(signals.score.desc()) \
.limit(50)
display(top_signals)4
Semantic search with MLflow
Run embedding similarity search across the full corpus using numpy.
python
import numpy as np
from pyspark.sql.functions import col
# Load embeddings as numpy arrays
df = spark.table("vectorfin.embeddings.transcripts") \
.filter(col("ticker") == "NVDA") \
.select("fiscal_period", "chunk_idx", "embedding") \
.toPandas()
embeddings = np.stack(df["embedding"].values)
# Your query vector (e.g., from Gemini text-embedding-004)
query_vec = np.array([...]) # 768-dim
# Cosine similarity
scores = embeddings @ query_vec / (np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_vec))
top_k = df.iloc[scores.argsort()[::-1][:5]]
print(top_k[["fiscal_period", "chunk_idx"]].to_string())Available Tables
All 7 VectorFin data tables — bitemporal (effective_ts + knowledge_ts), append-only, nightly updates.
vectorfin.embeddings.transcriptsEarnings call chunk embeddings (768-dim)▼
sql
SELECT ticker, fiscal_period, chunk_idx FROM vectorfin.embeddings.transcripts WHERE ticker = 'GOOGL' LIMIT 10vectorfin.embeddings.filingsSEC filing section embeddings▼
sql
SELECT ticker, filing_type, section FROM vectorfin.embeddings.filings WHERE filing_type = '10-K'vectorfin.signals.whystock_scoreComposite quant score (0–100)▼
sql
SELECT * FROM vectorfin.signals.whystock_score ORDER BY score DESC LIMIT 20vectorfin.signals.regimeMarket regime classification▼
sql
SELECT ticker, date, regime, confidence FROM vectorfin.signals.regime WHERE confidence > 0.8vectorfin.signals.volatilityGARCH volatility forecasts▼
sql
SELECT ticker, date, garch_vol_1d, garch_vol_21d FROM vectorfin.signals.volatilityvectorfin.signals.sentiment_driftEarnings sentiment drift▼
sql
SELECT * FROM vectorfin.signals.sentiment_drift WHERE fiscal_period >= '2024-Q1'vectorfin.signals.anomalyAnomaly detection scores▼
sql
SELECT * FROM vectorfin.signals.anomaly WHERE anomaly_score > 0.8 ORDER BY date DESCRelated Integrations
Start querying in 15 minutes
Sign up for VectorFin and get immediate API access.