Data Engineering

What is Apache Polaris (Iceberg Catalog)?

An open-source implementation of the Iceberg REST Catalog specification that enables multiple compute engines to read and write the same Iceberg tables through a standard API.

In Plain English

Apache Iceberg tables don't manage themselves. Someone has to keep track of which files make up each table, where those files live, what the current schema is, and what the current snapshot is. That's the catalog's job: it's the table of contents for your data lake.

Apache Polaris is an open-source catalog implementing the Iceberg REST Catalog specification. It's a standard HTTP API that any Iceberg-compatible engine can use to discover and query tables. It is a data registry for your tables. When Snowflake asks "Where is the vectorfin.signals table?", Polaris responds with the metadata file location so Snowflake can read the data directly from GCS.

Before standards like the Iceberg REST Catalog, each compute engine had its own catalog format. Spark used Hive Metastore. Databricks used Unity Catalog. Snowflake used its own proprietary table registry. Synchronizing these engines required custom ETL pipelines and ongoing maintenance.

Polaris provides one catalog and single source of truth. Create a table, point it at a GCS location, and both Snowflake and Databricks can query it as a native table with full ACID consistency, schema enforcement, and time travel.

Snowflake donated Polaris to the Apache Software Foundation in 2024. It stays vendor-neutral, and no single cloud provider controls the standard.

Technical Definition

The Iceberg REST Catalog API spec defines endpoints:

text

GET  /v1/namespaces                    # list namespaces
GET  /v1/namespaces/{ns}/tables        # list tables in namespace
GET  /v1/namespaces/{ns}/tables/{tbl}  # load table metadata
POST /v1/namespaces/{ns}/tables        # create table
POST /v1/namespaces/{ns}/tables/{tbl}  # commit table update
DELETE /v1/namespaces/{ns}/tables/{tbl} # drop table

Authentication: OAuth2 token exchange or static credentials. Scopes: CATALOG, PRINCIPAL_ROLE:ALL.

Key concepts:

Warehouse: a GCS/S3 location prefix that Polaris manages
Namespace: logical grouping of tables (analogous to a database schema)
Principal role: access control, which clients can read/write which catalogs and namespaces
Vended credentials: Polaris generates short-lived GCS credentials for the compute engine to directly access data files, without the engine needing long-term storage credentials

Versus AWS Glue (AWS-specific), Hive Metastore (JVM-heavy), Nessie (version control-focused), and Unity Catalog (Databricks-only), Polaris is open-source, vendor-neutral, and purpose-built for the Iceberg REST Catalog spec.

How VectorFin Uses This

VectorFin runs Apache Polaris on Cloud Run at catalog.vectorfinancials.com, the authoritative registry for all VectorFin Iceberg tables.

Pro plan customers are provisioned as principals in the Polaris catalog with read-only access to the vectorfinancials warehouse. They configure BigQuery (native vector search) or Snowflake (preview) to use the catalog:

Snowflake catalog integration (preview):

sql

CREATE CATALOG INTEGRATION vectorfin_catalog
    CATALOG_SOURCE = ICEBERG_REST
    TABLE_FORMAT = ICEBERG
    CATALOG_NAMESPACE = 'vectorfin'
    REST_CONFIG = ('CATALOG_URI' = 'https://catalog.vectorfinancials.com')
    REST_AUTHENTICATION = ('TYPE' = 'BEARER', 'BEARER_TOKEN' = 'your_api_key');

CREATE ICEBERG TABLE signals
    CATALOG = 'vectorfin_catalog'
    CATALOG_TABLE_NAME = 'signals';

Code Example

python

import requests

POLARIS_URL = "https://catalog.vectorfinancials.com"
API_KEY = "vf_your_api_key_here"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json",
}

# List available namespaces
namespaces = requests.get(f"{POLARIS_URL}/v1/namespaces", headers=headers).json()
print("Available namespaces:")
for ns in namespaces.get("namespaces", []):
    print(f"  {'.'.join(ns)}")

# List tables in the vectorfin namespace
tables = requests.get(
    f"{POLARIS_URL}/v1/namespaces/vectorfin/tables",
    headers=headers,
).json()
print("\nSignal tables:")
for t in tables.get("identifiers", []):
    print(f"  {t['namespace']}.{t['name']}")

# Load table metadata (shows current snapshot, schema, partition spec)
table_meta = requests.get(
    f"{POLARIS_URL}/v1/namespaces/vectorfin/tables/signals",
    headers=headers,
).json()
print(f"\nsignals current snapshot ID: {table_meta['metadata']['current-snapshot-id']}")
print(f"Schema columns: {[f['name'] for f in table_meta['metadata']['schemas'][-1]['fields']]}")

Related Terms

apache iceberg bigquery analytics hub

External References

Put Apache Polaris (Iceberg Catalog) to work in your pipeline

Pull AI-ready embeddings and signals as Iceberg tables or over the REST API.

Get API Access