Data Engineering

What is Parquet?

A columnar binary file format optimized for analytical workloads, offering efficient compression and predicate pushdown for large datasets.

In Plain English

Imagine a spreadsheet with a million rows and 50 columns. A row-oriented format stores all the data for row 1 together, then all the data for row 2, and so on. This is efficient when you need to read entire rows — like a web application fetching all fields for a user record. But analytical queries almost never work that way. "What's the average GARCH volatility for NVDA in Q3?" only touches two columns out of 50 — reading all 50 for every row wastes enormous amounts of time and storage.

Parquet stores data column by column instead. All the values for the "ticker" column are together. All values for "garch_vol_21d" are together. When you need only those two columns for a query, Parquet reads only those two columns from disk — everything else is physically skipped. For typical analytical workloads that touch 5-10% of columns, this makes queries 10-20x faster.

Columnar storage also compresses dramatically better. A column of ticker symbols ("AAPL," "AAPL," "AAPL," "NVDA," "NVDA," ...) is highly repetitive within a column and compresses extremely well with dictionary encoding. A column of GARCH volatility values, while not as repetitive, benefits from delta encoding and Snappy or Zstd compression. Real-world Parquet files typically compress to 10-25% of the equivalent CSV size.

Parquet is the foundation of the modern data stack. Apache Spark, DuckDB, BigQuery, Snowflake, and virtually every data platform can read and write Parquet natively. When you download a VectorFin Starter plan dataset, you're getting Parquet files — a standard format that plugs into any pipeline without transformation.

Technical Definition

A Parquet file contains:

Row groups: Horizontal partitions of the data (e.g., 128MB each). Multiple row groups per file.
Column chunks: Within a row group, each column's data stored contiguously.
Pages: Column chunks are divided into pages (data pages, dictionary pages, index pages). Default page size: 1MB.
File footer: Schema definition, key-value metadata, row group statistics, column statistics (min/max/null counts per column chunk).

Encoding types: plain, dictionary (RLE/bit-packed), delta binary packed, byte stream split.

Compression codecs: Snappy (fast, moderate compression), Zstd (slower, better compression), GZIP, LZ4.

Predicate pushdown: Query engines use the footer statistics to skip entire row groups or pages that cannot contain matching rows. WHERE date >= '2024-01-01' on a date-sorted file skips every row group with max_date < 2024-01-01 — zero bytes read for those groups.

Parquet vs ORC vs Arrow: all are columnar formats. ORC is Hive-native, slightly different encoding. Arrow is in-memory, not designed for persistent storage. Parquet is the universal standard for persistent analytical storage.

How VectorFin Uses This

All VectorFin Iceberg tables are backed by Parquet files on GCS. The embedding tables are particularly storage-intensive: each 768-float embedding vector is stored as a FIXED_LEN_BYTE_ARRAY or FLOAT array in Parquet, consuming approximately 3KB per chunk row.

Starter plan customers receive GCS push delivery of raw Parquet files, enabling direct integration with any pipeline without going through the REST API:

gs://vectorfinancials-delivery/{org_id}/signals/whystock_score/date=2024-10-01/*.parquet
gs://vectorfinancials-delivery/{org_id}/embeddings/transcripts/period=2024-Q4/*.parquet

The files are partitioned by date or period, so tools like DuckDB can apply partition pruning for efficient selective reads.

Code Example

import duckdb
import pandas as pd

# DuckDB reading VectorFin Parquet delivery files directly from GCS (Starter plan)
conn = duckdb.connect()

# Configure GCS credentials
conn.execute("""
    CREATE SECRET gcs_creds (
        TYPE GCS,
        KEY_ID 'your_service_account_key_id',
        SECRET 'your_service_account_key_secret'
    );
""")

# Efficient columnar scan with predicate pushdown
# Only reads the partitions and columns needed
df = conn.execute("""
    SELECT
        ticker,
        date,
        score,
        knowledge_ts
    FROM read_parquet(
        'gs://vectorfinancials-delivery/your_org_id/signals/whystock_score/date=2024-*/*.parquet',
        hive_partitioning = true
    )
    WHERE date >= '2024-10-01'
      AND score > 0.7
    ORDER BY score DESC
    LIMIT 50
""").df()

print(f"High-scoring tickers in Q4 2024:")
print(df.to_string(index=False))

# File size efficiency: check compression ratio
conn.execute("""
    SELECT
        file_name,
        total_compressed_size / 1e6 AS size_mb,
        total_uncompressed_size / 1e6 AS uncompressed_mb,
        total_uncompressed_size / total_compressed_size AS compression_ratio
    FROM parquet_metadata('gs://vectorfinancials-delivery/your_org_id/signals/whystock_score/date=2024-10-01/part-0.parquet')
""").df()

External References

Put Parquet to work in your pipeline

Access AI-ready financial data — embeddings, signals, Iceberg tables.

Get API Access Back to Glossary