VectorFin

Self-serve setup + 1× support grant

VectorFin Shared Bucket & Apache Iceberg

The shared-bucket path: VectorFin hands you a GCS bucket (or your own) holding the ready-made transcript embeddings AND the raw transcript text as Apache Iceberg tables. Read them with pyiceberg or any Iceberg client over the Polaris REST catalog, with no ETL and no copies. Pro/Enterprise; catalog auth is self-serve, raw data scans need a one-time GCS grant.

15 min

Setup time

768

Embedding dims

~500

S&P 500 (beta)

Weekly

Updates

Prerequisites

📋VectorFin Pro plan

🔑Polaris catalog credentials + one-time GCS grant

☁️Apache Iceberg account

Connection Guide

Provision a Polaris credential (Pro / Enterprise)

Sign in to the dashboard, open Data Access, and click Provision. You'll receive a client_id and a client_secret. The secret is shown ONCE, so paste it into your secret manager immediately.

bash

# From app.vectorfinancials.com/dashboard/data-access:
#   Catalog URI:  https://catalog.vectorfinancials.com/api/catalog
#   Warehouse:    vectorfin
#   Client ID:    <shown in dashboard>
#   Client Secret: <shown ONCE — copy now>
export VF_POLARIS_CLIENT_ID="..."
export VF_POLARIS_CLIENT_SECRET="..."

Connect pyiceberg to the Polaris REST catalog

Catalog metadata operations (list_namespaces, list_tables, load_table) work over the REST API alone. No GCS access required for these.

python

# pip install "pyiceberg[gcs]" pandas
from pyiceberg.catalog import load_catalog
import os

catalog = load_catalog(
    "vectorfin",
    **{
        "type": "rest",
        "uri": "https://catalog.vectorfinancials.com/api/catalog",
        "warehouse": "vectorfin",
        "credential": f"{os.environ['VF_POLARIS_CLIENT_ID']}:{os.environ['VF_POLARIS_CLIENT_SECRET']}",
        "scope": "PRINCIPAL_ROLE:ALL",
    },
)

print(catalog.list_namespaces(("vectorfin",)))
# [('vectorfin', 'embeddings'), ('vectorfin', 'signals')]

print(catalog.list_tables(("vectorfin", "signals")))
# signals

Request GCS read access for your service account

pyiceberg needs to read Parquet/metadata files from gs://vectorfinancials-data/warehouse/vectorfin/. We grant prefix-scoped roles/storage.objectViewer to a service account you control, so open a ticket with the SA email from your GCP project. Turnaround: 1 business day. (We do not vend HMAC keys; we use IAM grants so you keep audit on your side.)

bash

# 1. In your own GCP project, create a service account (or pick an existing one).
# 2. Email the SA email address (e.g. iceberg@your-proj.iam.gserviceaccount.com)
#    to support@vectorfinancials.com — subject "Iceberg GCS grant <org>".
# 3. We grant your SA prefix-scoped read on:
#       gs://vectorfinancials-data/warehouse/vectorfin/*
# 4. Authenticate locally as that SA before running pyiceberg scans:
gcloud auth application-default login --impersonate-service-account=iceberg@your-proj.iam.gserviceaccount.com

Load a table and scan with pushdown filters

Once GCS is granted, table.scan() reads Parquet directly from GCS via the Polaris-issued metadata path. Filters and column projection are pushed down.

python

from pyiceberg.expressions import And, EqualTo, GreaterThanOrEqual

table = catalog.load_table(("vectorfin", "signals"))

df = table.scan(
    row_filter=And(
        EqualTo("ticker", "AAPL"),
        GreaterThanOrEqual("date", "2024-01-01"),
    ),
    selected_fields=("ticker", "date", "score", "altman_zone", "regime"),
).to_pandas()

print(df.head())

Bitemporal time-travel via knowledge_ts

Every VectorFin table is append-only with effective_ts (when the fact applied) and knowledge_ts (when we learned it). Filtering on knowledge_ts gives a true point-in-time view, with no lookahead bias in backtests.

python

from pyiceberg.expressions import And, EqualTo, LessThanOrEqual
from datetime import datetime

# What did we know about AAPL signals as of 2024-01-01?
df = table.scan(
    row_filter=And(
        EqualTo("ticker", "AAPL"),
        LessThanOrEqual("knowledge_ts", datetime(2024, 1, 1).isoformat()),
    ),
    selected_fields=("ticker", "date", "score", "effective_ts", "knowledge_ts"),
).to_pandas()

# Latest known row per (ticker, date)
df = (df.sort_values("knowledge_ts")
        .drop_duplicates(["ticker", "date"], keep="last")
        .sort_values("date"))
print(df.tail())

Available Tables

VectorFin data tables: bitemporal (effective_ts + knowledge_ts), append-only, weekly updates.

vectorfin.embeddings.transcriptsEarnings call chunk embeddings (768-dim)

▼

sql

catalog.load_table(("vectorfin","embeddings","transcripts")).scan(row_filter=EqualTo("ticker","AAPL")).to_pandas()

vectorfin.embeddings.filingsSEC filing section embeddings (preview)

▼

sql

catalog.load_table(("vectorfin","embeddings","filings")).scan(row_filter=EqualTo("filing_type","10-K")).to_pandas()

vectorfin.signalsFlat composite quant signals, one typed column per sub-signal (no JSON blob): ticker, date, score, piotroski_f_score + 9 piotroski_* booleans, altman_z_score, altman_zone, altman_x1..x5, beneish_m_score, beneish_flag, sloan_ratio, sloan_quality, regime, regime_confidence, effective_ts, knowledge_ts

▼

sql

catalog.load_table(("vectorfin","signals")).scan(row_filter=EqualTo("ticker","NVDA")).to_pandas()

Related Integrations

Start querying in 15 minutes

Get API Access View Pricing Contact us to discuss