Data Engineering

What is Apache Iceberg?

An open table format that brings ACID transactions, schema evolution, and time travel to large-scale analytical datasets stored in cloud object storage.

In Plain English

Object storage (AWS S3, Google Cloud Storage, Azure Blob) is cheap, durable, and virtually unlimited. Traditional data warehouses are expensive, proprietary, and lock you to a single vendor. Apache Iceberg was invented to give you the best of both worlds: store your data as ordinary files in cheap cloud storage, but access it with all the power and reliability of a database, transactions, schema changes, time travel, multi-engine access.

Without Iceberg, storing analytical data in object storage is a mess. Files are just files, there's no concept of a "transaction" when multiple jobs are writing simultaneously, no way to roll back a bad write, no efficient way to query only the rows you need without reading entire files. You end up with the "data swamp" problem: terabytes of files that are expensive and fragile to manage.

Iceberg solves this by adding a metadata layer on top of the files. The metadata layer tracks exactly which files make up the current "table," manages commits atomically (so two writers never create conflicts), records schema history, and maintains a log of all previous table snapshots. This last feature enables time travel: query what the table looked like yesterday, last week, or last year, just by pointing at an earlier snapshot.

For financial data, time travel is essential. Backtesting requires knowing exactly what data was available on any given historical date. Iceberg's snapshot history, combined with VectorFin's bitemporal timestamps, provides complete auditability of exactly what you knew and when you knew it.

Technical Definition

An Iceberg table consists of three layers:

Data files: Parquet (or ORC, Avro) files stored in object storage. Organized by partition values but not necessarily by directory structure, Iceberg tracks assignments explicitly.

Manifest files: Avro files listing a subset of data files with per-file statistics (row count, min/max values per column, null counts). Enable partition pruning and file-level statistics-based filtering.

Manifest list (snapshot): Lists all manifest files for a given table snapshot. Each commit produces a new snapshot atomically by creating a new manifest list.

Catalog: The entry point, maps table names to their current metadata file location. Catalogs include Hive Metastore, AWS Glue, and REST catalogs (Apache Polaris).

Key capabilities:

ACID transactions: Optimistic concurrency control via atomic metadata swaps
Schema evolution: Add, rename, reorder, or drop columns without rewriting data files
Partition evolution: Change partitioning strategy without breaking existing queries
Time travel: SELECT * FROM table TIMESTAMP AS OF '2024-01-01'
Row-level deletes: Merge-on-read or copy-on-write delete strategies

How VectorFin Uses This

All VectorFin data is stored as Apache Iceberg tables on GCS at:

text

gs://vectorfinancials-data/warehouse/vectorfin/vectorfin/
  embeddings/transcripts/
  embeddings/filings/   # preview, coming soon
  signals/

The Apache Polaris REST catalog at catalog.vectorfinancials.com is the Iceberg catalog. Pro customers point BigQuery (native vector search) or Snowflake (preview) directly at this catalog to query VectorFin data using their existing SQL tools without any ETL.

Every table is append-only and bitemporal. The Iceberg snapshot history provides a third layer of temporal auditability on top of the bitemporal row timestamps.

Code Example

python

import duckdb

# Connect to VectorFin's Iceberg catalog via DuckDB (Pro/Enterprise plan)
conn = duckdb.connect()

# Install and load the Iceberg extension
conn.execute("INSTALL iceberg; LOAD iceberg;")

# Configure the Polaris REST catalog
conn.execute("""
    CREATE SECRET vectorfin_catalog (
        TYPE ICEBERG,
        CATALOG_TYPE 'REST',
        ENDPOINT 'https://catalog.vectorfinancials.com',
        CREDENTIAL '{"token": "your_api_key_here"}',
        WAREHOUSE 'vectorfinancials'
    );
""")

# Query the signals table directly with SQL
df = conn.execute("""
    SELECT
        ticker,
        date,
        score,
        regime,
        regime_confidence,
        knowledge_ts
    FROM vectorfinancials.vectorfinancials_signals.signals
    WHERE ticker = 'NVDA'
      AND date >= '2024-01-01'
      AND knowledge_ts <= '2024-10-01'  -- point-in-time safety
    ORDER BY date
""").df()

print(df.head(10))
print(f"Rows returned: {len(df)}")

# Time travel: see what the table looked like 30 days ago
df_historical = conn.execute("""
    SELECT * FROM vectorfinancials.vectorfinancials_signals.signals
    FOR SYSTEM_TIME AS OF TIMESTAMP '2024-09-01 00:00:00'
    WHERE ticker = 'NVDA' AND date = '2024-09-01'
""").df()

Related Terms

polaris catalog bigquery analytics hub

External References

Put Apache Iceberg to work in your pipeline

Pull AI-ready embeddings and signals as Iceberg tables or over the REST API.

Get API Access