Data Engineering

What is Bitemporal Data?

A data modeling pattern that tracks two independent time axes, when something happened in the real world and when your system first learned about it, enabling rigorous point-in-time queries.

In Plain English

Most databases record what is true right now. Update a row and the old value vanishes. This is fine for many applications, but it's catastrophic for financial data analysis and backtesting. If you're trying to reconstruct what you knew about a company on October 1, 2022, you need more than just today's data. You need the data as it existed on that specific date, including all the revisions and corrections that have happened since.

Bitemporal data solves this by tracking two independent time dimensions for every piece of information. The first is "effective time": when did this fact become true in the real world? An earnings report that covers Q3 2024 has an effective date of the quarter end (September 30, 2024). The second is "transaction time" or "knowledge time": when did your data system first record this fact? The earnings report might be filed on October 25, 2024 and ingested by your system on October 26, 2024. That's the knowledge time.

Why do both matter? Consider a company that restates its earnings six months after the original release. A bitemporal database records both versions: the original numbers with their original knowledge time, and the restated numbers with a new knowledge time. When you backtest a strategy as of October 2024, you use only data with knowledge_ts ≤ October 2024; you don't accidentally use the restated numbers that your system didn't know about yet.

This pattern is especially important in finance because of look-ahead bias: using information in a backtest that wasn't actually available at the time of the simulated trade. Survivorship bias (using only companies that survived) is one form. Using restated financials as if they were the original reports is another. Bitemporal modeling eliminates this class of error at the data layer.

Technical Definition

A bitemporal table has two mandatory timestamp columns (in addition to any domain attributes):

effective_ts (valid time / business time): when the fact became true in the real world
knowledge_ts (transaction time / system time): when the database recorded the fact

A bitemporal query "as of time t" retrieves facts where knowledge_ts <= t, the universe of information that was available at time t.

For an earnings call embedding, the bitemporal tuple is:

effective_ts = fiscal period end date (e.g., 2024-09-30)
knowledge_ts = ingestion timestamp (e.g., 2024-10-26T03:15:00Z)

SQL point-in-time query pattern:

sql

SELECT *
FROM embeddings_transcripts
WHERE ticker = 'AAPL'
  AND knowledge_ts <= '2024-10-15 00:00:00'  -- only data known by this date
ORDER BY effective_ts DESC

The table is append-only: records are never updated, only new records are inserted with the current knowledge_ts. This enables full audit history and makes the data immutable for compliance purposes.

How VectorFin Uses This

Every table in VectorFin's data warehouse is bitemporal. The schema for every signal table includes both effective_ts and knowledge_ts:

sql

-- Signals table structure
ticker          VARCHAR
date            DATE         -- the business date (effective)
score           DOUBLE
components      STRUCT(...)
effective_ts    TIMESTAMP    -- = date at market close (e.g., 2024-10-01 21:00:00 UTC)
knowledge_ts    TIMESTAMP    -- = when signals_writer job ran (~02:00 UTC next day)

This gap between effective_ts and knowledge_ts, typically 5-8 hours for VectorFin signals, represents the processing pipeline latency and is critical for realistic backtesting: your strategy cannot trade on a signal computed at 2am on the signal's effective date.

text

GET https://api.vectorfinancials.com/v1/signals/AAPL?date=2024-10-01

The API automatically applies the knowledge_ts filter unless you explicitly request raw bitemporal data.

Code Example

python

import duckdb
import pandas as pd

conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg;")

# Point-in-time query: what signals were available for a backtest on 2024-10-15?
# CRITICAL: use knowledge_ts <= backtest_date, not effective_ts

backtest_date = "2024-10-15"

df = conn.execute(f"""
    SELECT
        ticker,
        date AS effective_date,
        score,
        knowledge_ts,
        DATEDIFF('hour', effective_ts, knowledge_ts) AS pipeline_lag_hours
    FROM vectorfin.signals
    WHERE knowledge_ts <= TIMESTAMP '{backtest_date} 23:59:59'
      AND date >= DATE '{backtest_date}' - INTERVAL 7 DAYS
      AND date <= DATE '{backtest_date}'
    ORDER BY date DESC, score DESC
    LIMIT 20
""").df()

print(f"Signals available for backtest as of {backtest_date}:")
print(df.to_string(index=False))

# Verify: no rows have knowledge_ts > backtest_date (look-ahead check)
future_data = df[df["knowledge_ts"] > backtest_date]
assert len(future_data) == 0, f"Look-ahead bias detected! {len(future_data)} rows"
print("\nLook-ahead bias check: PASSED")

External References

Put Bitemporal Data to work in your pipeline

Pull AI-ready embeddings and signals as Iceberg tables or over the REST API.

Get API Access