Data Engineering

What is Point-in-Time Query?

A database query that retrieves the data exactly as it existed at a specific historical moment, used to reconstruct past states without contamination from later information.

In Plain English

A point-in-time query asks: "What did our database know on a specific past date?" It reconstructs the exact state of the data as it existed at that moment, not today's view of history, but the genuinely historical state.

This distinction matters enormously in financial analysis. Today's database might contain restated earnings, corrected company filings, and signals computed from data that was later revised. If you're backtesting a strategy as of January 2022, you need January 2022's data, the numbers that were actually available to investors on that date, including all their errors and incompleteness.

Imagine you're an archaeologist versus a historian. The historian can use everything we know today to reconstruct the past. The archaeologist must work only from evidence that existed at a given point in time. Financial backtesting requires the archaeologist's approach. Using current knowledge of past events generates artificially strong backtest results that collapse in live trading.

Point-in-time queries are implemented in two main ways. The simpler way is a filter on an ingestion/knowledge timestamp: WHERE knowledge_ts <= backtest_date. This gives you all records that were loaded into the database before the backtest date. The more sophisticated way (Iceberg time travel) uses table snapshots: SELECT * FROM table FOR SYSTEM_TIME AS OF '2022-01-01' queries against the table snapshot that existed at that timestamp.

For the cleanest backtests, you want both: filter on knowledge_ts to avoid using restated data, and filter on effective_ts to get the business-relevant date range.

Technical Definition

A point-in-time query over a bitemporal table T reconstructs the state of T as observed at time t_obs for facts effective at time t_eff:

sql

SELECT *
FROM T
WHERE knowledge_ts <= t_obs     -- only data known at t_obs
  AND effective_ts <= t_eff     -- only facts effective by t_eff
-- For slowly-changing dimensions, additionally:
  AND (end_knowledge_ts IS NULL OR end_knowledge_ts > t_obs)

Iceberg time travel syntax:

sql

-- By timestamp (snapshot at or before this time)
SELECT * FROM table FOR SYSTEM_TIME AS OF TIMESTAMP '2022-01-01 00:00:00'

-- By snapshot ID
SELECT * FROM table FOR VERSION AS OF 1234567890

-- Between two snapshots (incremental processing)
SELECT * FROM table CHANGES BETWEEN VERSION 1000 AND VERSION 2000

Point-in-time joins are the hard part: joining two bitemporal tables where both must be as-of the same time. For example, joining a signals table (knowledge_ts, effective_ts) to a fundamentals table (knowledge_ts, fiscal_period) requires careful AS-OF matching to avoid look-ahead in either table.

How VectorFin Uses This

VectorFin's API makes point-in-time queries a first-class feature: every signal row is bitemporal and carries both an effective date and a knowledge_ts. You scope the effective window with date_from/date_to and enforce the knowledge cutoff client-side by dropping any row whose knowledge_ts is later than your as-of time:

text

GET https://api.vectorfinancials.com/v1/signals/AAPL
    ?date_from=2024-10-01    # effective window start
    &date_to=2024-10-01      # effective window end
# then keep only rows with knowledge_ts <= 2024-10-15 (your as-of cutoff)

Filtering on knowledge_ts returns only data ingested by your as-of timestamp, correctly showing what would have been available on that date.

For bulk historical queries, the Iceberg tables support full time travel. Pro customers can reconstruct the complete state of any signal table at any point in the past using Apache Iceberg's snapshot history.

Code Example

python

import requests
import pandas as pd
from datetime import date, timedelta

API_BASE = "https://api.vectorfinancials.com"
API_KEY = "vf_your_api_key_here"

def backtest_signals(
    tickers: list[str],
    start_date: str,
    end_date: str,
    rebalance_frequency_days: int = 21,
) -> pd.DataFrame:
    """
    Collect point-in-time signal history for a universe of tickers.
    Correctly filters on knowledge_ts to prevent look-ahead bias.
    """
    current = date.fromisoformat(start_date)
    end = date.fromisoformat(end_date)
    results = []

    while current <= end:
        trade_date = current.isoformat()
        cutoff_ts = pd.Timestamp(trade_date + "T23:59:59Z")

        for ticker in tickers:
            resp = requests.get(
                f"{API_BASE}/v1/signals/{ticker}",
                params={
                    "date_from": trade_date,
                    "date_to": trade_date,
                },
                headers={"X-API-Key": API_KEY},
            )
            if not resp.ok:
                continue
            for row in resp.json().get("signals", []):
                # point-in-time: only data known by trade_date
                if row.get("score") is None:
                    continue
                if pd.to_datetime(row["knowledge_ts"]) > cutoff_ts:
                    continue
                results.append({
                    "trade_date": trade_date,
                    "ticker": ticker,
                    "score": row["score"],
                    "knowledge_ts": row["knowledge_ts"],
                })

        current += timedelta(days=rebalance_frequency_days)

    return pd.DataFrame(results)

# Run a point-in-time backtest over Q4 2024
signals_history = backtest_signals(
    tickers=["AAPL", "MSFT", "NVDA", "GOOGL", "META"],
    start_date="2024-10-01",
    end_date="2024-12-31",
    rebalance_frequency_days=21,  # monthly rebalance
)

print(f"Collected {len(signals_history)} point-in-time observations")
print(signals_history.to_string(index=False))

External References

Put Point-in-Time Query to work in your pipeline

Pull AI-ready embeddings and signals as Iceberg tables or over the REST API.

Get API Access