VectorFin/Glossary/Apache Polaris (Iceberg Catalog)
Data Engineering

What is Apache Polaris (Iceberg Catalog)?

An open-source implementation of the Iceberg REST Catalog specification that enables multiple compute engines to read and write the same Iceberg tables through a standard API.

In Plain English

Apache Iceberg tables don't manage themselves. Someone has to keep track of which files make up each table, where those files live, what the current schema is, and what the current snapshot is. That's the catalog's job — it's the table of contents for your data lake.

Apache Polaris is an open-source catalog that implements the Iceberg REST Catalog specification: a standard HTTP API that any Iceberg-compatible engine can call to discover and interact with tables. Think of it as a data registry or a DNS server for your tables. Snowflake asks Polaris, "Where is the signals.whystock_score table?" Polaris responds with the metadata file location. Snowflake then reads the data directly from GCS.

Before standards like the Iceberg REST Catalog, each compute engine had its own catalog format. Your Spark cluster used Hive Metastore. Your Databricks cluster used Unity Catalog. Snowflake used its own proprietary table registry. Getting all three to read the same underlying data required complex ETL pipelines and constant synchronization.

Polaris eliminates this. One catalog, one source of truth, multiple engines. You create a table in Polaris, point it at a GCS location, and immediately both Snowflake and Databricks can query it as a native table in their respective SQL environments — with full ACID consistency, schema enforcement, and time travel.

The "open" in open-source matters here. Snowflake donated Polaris to the Apache Software Foundation in 2024, ensuring it remains vendor-neutral and that no single cloud provider controls the standard.

Technical Definition

The Iceberg REST Catalog API spec defines endpoints:

GET  /v1/namespaces                    # list namespaces
GET  /v1/namespaces/{ns}/tables        # list tables in namespace
GET  /v1/namespaces/{ns}/tables/{tbl}  # load table metadata
POST /v1/namespaces/{ns}/tables        # create table
POST /v1/namespaces/{ns}/tables/{tbl}  # commit table update
DELETE /v1/namespaces/{ns}/tables/{tbl} # drop table

Authentication: OAuth2 token exchange or static credentials. Scopes: CATALOG, PRINCIPAL_ROLE:ALL.

Key concepts:

  • Warehouse: a GCS/S3 location prefix that Polaris manages
  • Namespace: logical grouping of tables (analogous to a database schema)
  • Principal role: access control — which clients can read/write which catalogs and namespaces
  • Vended credentials: Polaris generates short-lived GCS credentials for the compute engine to directly access data files, without the engine needing long-term storage credentials

Polaris vs alternatives: AWS Glue (AWS-specific), Hive Metastore (JVM-heavy), Nessie (version control-centric), Unity Catalog (Databricks-specific). Polaris is the only fully open-source Iceberg REST Catalog implementation.

How VectorFin Uses This

VectorFin runs Apache Polaris on Cloud Run at catalog.vectorfinancials.com. This catalog serves as the authoritative registry for all VectorFin Iceberg tables.

Pro plan customers are provisioned as principals in the Polaris catalog with read-only access to the vectorfinancials warehouse. They configure their Snowflake, Databricks, or BigQuery to use the catalog:

Snowflake external volume + catalog integration:

CREATE CATALOG INTEGRATION vectorfin_catalog
    CATALOG_SOURCE = ICEBERG_REST
    TABLE_FORMAT = ICEBERG
    CATALOG_NAMESPACE = 'signals'
    REST_CONFIG = ('CATALOG_URI' = 'https://catalog.vectorfinancials.com')
    REST_AUTHENTICATION = ('TYPE' = 'BEARER', 'BEARER_TOKEN' = 'your_api_key');

CREATE ICEBERG TABLE signals_whystock_score
    CATALOG = 'vectorfin_catalog'
    CATALOG_TABLE_NAME = 'whystock_score';

Code Example

import requests

POLARIS_URL = "https://catalog.vectorfinancials.com"
API_KEY = "vf_your_api_key_here"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json",
}

# List available namespaces
namespaces = requests.get(f"{POLARIS_URL}/v1/namespaces", headers=headers).json()
print("Available namespaces:")
for ns in namespaces.get("namespaces", []):
    print(f"  {'.'.join(ns)}")

# List tables in the signals namespace
tables = requests.get(
    f"{POLARIS_URL}/v1/namespaces/signals/tables",
    headers=headers,
).json()
print("\nSignal tables:")
for t in tables.get("identifiers", []):
    print(f"  {t['namespace']}.{t['name']}")

# Load table metadata (shows current snapshot, schema, partition spec)
table_meta = requests.get(
    f"{POLARIS_URL}/v1/namespaces/signals/tables/whystock_score",
    headers=headers,
).json()
print(f"\nwhystock_score current snapshot ID: {table_meta['metadata']['current-snapshot-id']}")
print(f"Schema columns: {[f['name'] for f in table_meta['metadata']['schemas'][-1]['fields']]}")

Put Apache Polaris (Iceberg Catalog) to work in your pipeline

Access AI-ready financial data — embeddings, signals, Iceberg tables.