Data Engineering

What is Data Lakehouse?

An architecture combining the cheap, scalable storage of a data lake with the ACID transactions, governance, and SQL capabilities of a data warehouse.

In Plain English

For decades, organizations had to choose between two extremes for storing analytical data. Data warehouses (like Snowflake, Redshift, BigQuery) offered excellent performance, ACID transactions, and strong governance — but were expensive, proprietary, and required copying data in. Data lakes (raw files in S3 or GCS) were cheap and flexible — but were slow, lacked transactions, and turned into data swamps as schemas drifted and files accumulated.

The data lakehouse is an attempt to get the best of both worlds: store data as open-format files in cheap cloud storage (like a data lake), but manage it with a metadata layer that provides the warehouse features (transactions, schema enforcement, time travel) on top.

The key insight is architectural separation: the storage layer (object storage) and the compute layer (DuckDB, Spark, BigQuery, Snowflake) are decoupled. You store your data once as Parquet files on GCS. Then multiple different engines can read the same files — BigQuery can query it, Snowflake can query it, DuckDB on your laptop can query it — without any data movement. The metadata layer (Apache Iceberg, Delta Lake, Apache Hudi) coordinates transactions and schema across all these engines.

The practical benefit for data consumers is enormous: no more ETL pipelines to copy data between systems, no more vendor lock-in, no more paying twice for the same data. You own the data in open formats on your own storage, and every analytics tool speaks the same language.

Technical Definition

The lakehouse architecture stack:

Storage Layer: Object storage (GCS, S3, ADLS) with Parquet/ORC/Avro files. Virtually unlimited scale, ~$0.02/GB/month. No compute costs for storage.

Table Format Layer: Open table format (Apache Iceberg, Delta Lake, Apache Hudi) adds:

Atomic commits (ACID transactions over multiple files)
Metadata files (manifest lists, manifests) enabling partition pruning and statistics-based optimization
Schema registry and evolution support
Snapshot history enabling time travel

Catalog Layer: Metadata service that maps table names to their current metadata file locations. REST catalog (Apache Polaris), Hive Metastore, AWS Glue, Unity Catalog.

Compute Layer: Any engine that speaks the table format's protocol. DuckDB, Apache Spark, Trino, BigQuery (via BigLake), Snowflake (via Polaris catalog), Databricks, Athena.

Decoupled storage and compute: you pay for compute only when querying, and you pay for storage only at the object storage rate. This is the fundamental economics advantage over traditional warehouses which bundle the two.

How VectorFin Uses This

VectorFin is built entirely on the lakehouse architecture:

Storage: GCS (gs://vectorfinancials-data/warehouse/) holds all Parquet data files
Table format: Apache Iceberg manages all tables — embeddings, signals, and derived tables
Catalog: Apache Polaris at catalog.vectorfinancials.com serves the Iceberg REST catalog protocol
Compute: VectorFin's API uses DuckDB; Pro customers connect their own Snowflake, BigQuery, or Databricks

This means Pro customers literally point their existing Snowflake environment at VectorFin's Polaris catalog and query VectorFin data using their own Snowflake credits, in their own environment, with their own governance controls. No data movement, no API intermediary for bulk queries.

The GCS bucket policy grants read-only access to authorized Polaris principals. VectorFin's data never leaves GCS — it's read directly by the customer's compute engine.

Code Example

# BigQuery approach: query VectorFin via BigLake external Iceberg table (Pro plan)
# After VectorFin's onboarding creates the external table in your BQ dataset:

from google.cloud import bigquery

client = bigquery.Client(project="your-gcp-project")

query = """
    SELECT
        v.ticker,
        v.date,
        v.score AS whystock_score,
        v.garch_vol_21d,
        -- Join with your own portfolio data
        p.position_size
    FROM `your_project.vectorfin_signals.whystock_score` AS v
    JOIN `your_project.your_dataset.portfolio` AS p
        ON v.ticker = p.ticker
    WHERE v.date = '2024-10-01'
      AND v.knowledge_ts <= '2024-10-02 00:00:00 UTC'
      AND v.score > 0.6
    ORDER BY v.score DESC
"""

result = client.query(query).to_dataframe()
print(f"Portfolio tickers with VectorFin signals:")
print(result.to_string(index=False))

# DuckDB approach: same data, different compute engine (no data movement)
import duckdb
conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg;")
# ... (see apache-iceberg.mdx for DuckDB Polaris setup)

External References

Put Data Lakehouse to work in your pipeline

Access AI-ready financial data — embeddings, signals, Iceberg tables.

Get API Access Back to Glossary