Data Engineering

What is DuckDB?

An in-process analytical SQL engine that runs directly in your Python or R session, capable of querying Parquet and Iceberg files on GCS at high speed without a server.

In Plain English

Every analyst has opened a 2-gigabyte CSV in Excel and watched their laptop grind to a halt. Databases like PostgreSQL and MySQL are designed for transactional workloads — handling thousands of small row lookups per second — and they're awkward for the large analytical scans that data analysts actually do. Setting up Spark or a cloud data warehouse is massive overkill for a one-off analysis.

DuckDB fills the gap. It's an analytical SQL engine that runs as a library inside your Python, R, or Java process — no server to install, no credentials to configure, no cluster to spin up. You install it like any Python package and start querying. import duckdb; duckdb.query("SELECT ...").df() — that's the entire setup.

The performance is remarkable. DuckDB uses vectorized execution (processes data in batches of thousands of rows, maximizing CPU cache efficiency) and can scan hundreds of millions of rows per second on a laptop. It reads Parquet files directly — no loading required — and it understands Apache Iceberg table format. You can run a GROUP BY on a 50-million-row Parquet file faster in DuckDB on your laptop than in a traditional database on a powerful server.

For financial data work, DuckDB is the perfect middle layer. It's fast enough for interactive analysis, handles files too large for pandas, and can query data directly from GCS or S3 without downloading it first. VectorFin's own API backend uses DuckDB to serve queries against the Iceberg tables.

Technical Definition

DuckDB is a column-oriented embedded OLAP database. Its key architectural features:

Vectorized execution: Morsel-driven parallelism. Operators process 1024-row vectors at a time, maximizing SIMD CPU instruction utilization and cache efficiency.
Push-based query engine: Data flows push (not pull), enabling efficient pipelining and minimizing intermediate materialization.
In-process: Linked as a library, sharing memory with the host process. No IPC, no serialization overhead. duckdb.connect(':memory:') or persistent duckdb.connect('analysis.db').
Multi-format reader: Native readers for Parquet (with predicate pushdown), CSV, JSON, Arrow. Iceberg support via the iceberg extension. Direct GCS/S3 reads via the httpfs extension.

SQL compliance: DuckDB implements a large subset of SQL:2011 including window functions, recursive CTEs, lateral joins, and unnest for array/struct types.

Python integration: conn.execute("SELECT ...").df() returns a pandas DataFrame directly. Arrow native: conn.execute("SELECT ...").arrow() returns a PyArrow Table with zero-copy.

How VectorFin Uses This

VectorFin's REST API layer uses DuckDB as the query engine against the Iceberg tables on GCS. When you call GET /v1/signals/volatility/AAPL, the API translates this to a DuckDB query against the Iceberg REST catalog, leveraging partition pruning (date-partitioned tables) and predicate pushdown (Parquet statistics) to read only the relevant file segments.

Pro customers can run DuckDB directly against VectorFin's Polaris catalog, bypassing the REST API for bulk analytical queries:

import duckdb
conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg; INSTALL httpfs; LOAD httpfs;")
# Configure GCS and Polaris credentials (provided during Pro onboarding)
# Then query any VectorFin table directly with full SQL

This enables SQL joins between VectorFin signals and your own data, arbitrary aggregations, and full historical queries — without any per-row API call overhead.

Code Example

import duckdb
import pandas as pd

# In-memory DuckDB: query VectorFin Parquet delivery files (Starter plan)
conn = duckdb.connect()
conn.execute("INSTALL httpfs; LOAD httpfs; INSTALL iceberg; LOAD iceberg;")

# Configure GCS access
conn.execute("""
    SET s3_region = 'us-central1';
    SET s3_endpoint = 'storage.googleapis.com';
    SET s3_access_key_id = 'your_hmac_access_key';
    SET s3_secret_access_key = 'your_hmac_secret_key';
""")

# Query signals across an entire year — DuckDB scans only relevant partitions
df = conn.execute("""
    SELECT
        ticker,
        date,
        score,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY score)
            OVER (PARTITION BY strftime(date, '%Y-%m')) AS monthly_75th_pct
    FROM read_parquet(
        's3://vectorfinancials-delivery/your_org_id/signals/whystock_score/**/*.parquet',
        hive_partitioning = true
    )
    WHERE date >= '2024-01-01'
      AND date <= '2024-12-31'
    ORDER BY date DESC, score DESC
""").df()

print(f"Rows processed: {len(df):,}")
print(df.head(10).to_string(index=False))

# Performance: DuckDB processes 50M+ rows/second on a laptop
# This query on a year of data (~100M rows) typically runs in under 10 seconds

External References

Put DuckDB to work in your pipeline

Access AI-ready financial data — embeddings, signals, Iceberg tables.

Get API Access Back to Glossary