An open table format that brings ACID transactions, schema evolution, and time travel to large-scale analytical datasets stored in cloud object storage.
In Plain English
Object storage (AWS S3, Google Cloud Storage, Azure Blob) is cheap, durable, and virtually unlimited. Traditional data warehouses are expensive, proprietary, and lock you to a single vendor. Apache Iceberg was invented to give you the best of both worlds: store your data as ordinary files in cheap cloud storage, but access it with all the power and reliability of a database — transactions, schema changes, time travel, multi-engine access.
Without Iceberg, storing analytical data in object storage is a mess. Files are just files — there's no concept of a "transaction" when multiple jobs are writing simultaneously, no way to roll back a bad write, no efficient way to query only the rows you need without reading entire files. You end up with the "data swamp" problem: terabytes of files that are expensive and fragile to manage.
Iceberg solves this by adding a metadata layer on top of the files. The metadata layer tracks exactly which files make up the current "table," manages commits atomically (so two writers never create conflicts), records schema history, and maintains a log of all previous table snapshots. This last feature enables time travel: query what the table looked like yesterday, last week, or last year — just by pointing at an earlier snapshot.
For financial data, time travel is not just a nice feature — it's essential. Backtesting requires knowing exactly what data was available on any given historical date. Iceberg's snapshot history, combined with VectorFin's bitemporal timestamps, gives you complete auditability of exactly what you knew and when you knew it.
Technical Definition
An Iceberg table consists of three layers:
Data files: Parquet (or ORC, Avro) files stored in object storage. Organized by partition values but not necessarily by directory structure — Iceberg tracks assignments explicitly.
Manifest files: Avro files listing a subset of data files with per-file statistics (row count, min/max values per column, null counts). Enable partition pruning and file-level statistics-based filtering.
Manifest list (snapshot): Lists all manifest files for a given table snapshot. Each commit produces a new snapshot atomically by creating a new manifest list.
Catalog: The entry point — maps table names to their current metadata file location. Catalogs include Hive Metastore, AWS Glue, and REST catalogs (Apache Polaris).
Key capabilities:
- ACID transactions: Optimistic concurrency control via atomic metadata swaps
- Schema evolution: Add, rename, reorder, or drop columns without rewriting data files
- Partition evolution: Change partitioning strategy without breaking existing queries
- Time travel:
SELECT * FROM table TIMESTAMP AS OF '2024-01-01' - Row-level deletes: Merge-on-read or copy-on-write delete strategies
How VectorFin Uses This
All VectorFin data is stored as Apache Iceberg tables on GCS at:
gs://vectorfinancials-data/warehouse/
embeddings/transcripts/
embeddings/filings/
signals/whystock_score/
signals/regime/
signals/volatility/
signals/sentiment_drift/
signals/anomaly/The Apache Polaris REST catalog at catalog.vectorfinancials.com serves as the Iceberg catalog. Pro customers point Snowflake, Databricks, or BigQuery directly at this catalog to query VectorFin data using their existing SQL tools without any ETL.
Every table is append-only and bitemporal. The Iceberg snapshot history provides a third layer of temporal auditability on top of the bitemporal row timestamps.
Code Example
import duckdb
# Connect to VectorFin's Iceberg catalog via DuckDB (Pro/Enterprise plan)
conn = duckdb.connect()
# Install and load the Iceberg extension
conn.execute("INSTALL iceberg; LOAD iceberg;")
# Configure the Polaris REST catalog
conn.execute("""
CREATE SECRET vectorfin_catalog (
TYPE ICEBERG,
CATALOG_TYPE 'REST',
ENDPOINT 'https://catalog.vectorfinancials.com',
CREDENTIAL '{"token": "your_api_key_here"}',
WAREHOUSE 'vectorfinancials'
);
""")
# Query the signals table directly with SQL
df = conn.execute("""
SELECT
ticker,
date,
garch_vol_21d,
knowledge_ts
FROM vectorfinancials.signals.volatility
WHERE ticker = 'NVDA'
AND date >= '2024-01-01'
AND knowledge_ts <= '2024-10-01' -- point-in-time safety
ORDER BY date
""").df()
print(df.head(10))
print(f"Rows returned: {len(df)}")
# Time travel: see what the table looked like 30 days ago
df_historical = conn.execute("""
SELECT * FROM vectorfinancials.signals.volatility
FOR SYSTEM_TIME AS OF TIMESTAMP '2024-09-01 00:00:00'
WHERE ticker = 'NVDA' AND date = '2024-09-01'
""").df()External References
Put Apache Iceberg to work in your pipeline
Access AI-ready financial data — embeddings, signals, Iceberg tables.