Data Engineering

What is Knowledge Timestamp?

The system ingestion time, when your data pipeline first recorded a fact, used to prevent look-ahead bias in backtesting by filtering to only data available at a given historical date.

In Plain English

Imagine you're backtesting a stock strategy set in October 2022. You want to simulate: "If I had run this strategy on October 15, 2022, what would I have done?" To answer correctly, you need to use only data your system would have known about on October 15, 2022. Not data from November 2022. Not restated financial data from 2023. Not signals computed in November 2022.

Knowledge timestamp (also called transaction time or ingestion time) is how you enforce this constraint. Every record in a bitemporal database has a knowledge_ts, the moment when that record was written into the database. To simulate "what did I know on October 15, 2022?", you filter WHERE knowledge_ts <= '2022-10-15 23:59:59'. Any record added to the database after that timestamp is excluded from your query.

This sounds simple but it eliminates an enormous class of backtesting errors. The most pernicious is using data as if it were available before it actually was. A 10-K filing might have an effective date of December 31, 2022 (fiscal year end) but wasn't filed with the SEC until March 2023 and wasn't ingested by your system until March 5, 2023. Using it in a January 2023 backtest would be cheating, you would have been acting on information you couldn't have had.

Every strategy that has looked "too good" in backtesting should be audited for knowledge timestamp integrity. The look-ahead bias that comes from ignoring ingestion timing is one of the most common reasons why backtested returns don't survive contact with live markets.

Technical Definition

In SQL:2011 terminology, knowledge timestamp = "system time" or "transaction time." It is managed by the database automatically in fully temporal systems, or manually by the application in most practical implementations.

For an append-only temporal table:

sql

-- Every INSERT sets knowledge_ts to the current time
INSERT INTO signals (ticker, date, score, effective_ts, knowledge_ts)
VALUES ('AAPL', '2024-10-01', 0.72, '2024-10-01T21:00:00Z', NOW());

The critical invariant: knowledge_ts is immutable once written. You can never update it to backdate when you "knew" something.

Point-in-time reconstruction:

sql

-- Reconstruct the world as seen on 2022-10-15
SELECT *
FROM signals
WHERE knowledge_ts <= TIMESTAMP '2022-10-15 23:59:59'
  -- Get the most recent version of each (ticker, date) pair
  AND (ticker, date, knowledge_ts) IN (
    SELECT ticker, date, MAX(knowledge_ts)
    FROM signals
    WHERE knowledge_ts <= TIMESTAMP '2022-10-15 23:59:59'
    GROUP BY ticker, date
  )

Knowledge timestamp lag = knowledge_ts - effective_ts. For VectorFin signals, this is typically 5-10 hours (overnight pipeline). For earnings call embeddings, it's 1-3 days (filing ingestion delay). For Piotroski F-Score, it's up to 45 days (10-Q filing deadline after quarter end).

How VectorFin Uses This

VectorFin sets knowledge_ts to the UTC timestamp when the signals_writer or transcript_embedder Cloud Run Job finishes. This is deterministic and auditable, you can verify exactly when each record was written.

The pipeline runs weekly; on each run day the pipeline executes at approximately 02:00 UTC. For a signal with effective_ts = 2024-10-01T21:00:00Z (market close), the knowledge_ts will be approximately 2024-10-02T02:30:00Z, a 5.5-hour lag. For a live trading strategy that executes at 09:30 UTC on 2024-10-02, the signal is available before market open. For a strategy executing at 09:30 ET (13:30 UTC) on 2024-10-01 (same day), the signal is not yet available.

text

GET https://api.vectorfinancials.com/v1/signals/AAPL
    ?date_from=2024-10-01
    &date_to=2024-10-01
# each row carries knowledge_ts, filter rows where knowledge_ts <= your as-of time

Code Example

python

import requests
import pandas as pd
from datetime import date, timedelta

API_BASE = "https://api.vectorfinancials.com"
API_KEY = "vf_your_api_key_here"

# Backtest runner with correct knowledge_ts enforcement
class BacktestRunner:
    def __init__(self, api_key: str):
        self.headers = {"X-API-Key": api_key}

    def get_universe_signals(self, tickers: list[str], trade_date: str) -> pd.DataFrame:
        """
        Get signals available for trading on trade_date.
        Uses knowledge_ts <= trade_date to prevent look-ahead bias.
        """
        cutoff_ts = pd.Timestamp(trade_date + "T23:59:59Z")
        results = []
        for ticker in tickers:
            resp = requests.get(
                f"{API_BASE}/v1/signals/{ticker}",
                params={
                    "date_from": trade_date,
                    "date_to": trade_date,
                },
                headers=self.headers,
            )
            if resp.ok:
                for row in resp.json().get("signals", []):
                    # key: enforce knowledge_ts constraint client-side
                    if row.get("score") is None:
                        continue
                    if pd.to_datetime(row["knowledge_ts"]) > cutoff_ts:
                        continue
                    results.append({
                        "ticker": ticker,
                        "score": row["score"],
                        "effective_ts": row["effective_ts"],
                        "knowledge_ts": row["knowledge_ts"],
                    })

        df = pd.DataFrame(results)
        # Verify no future knowledge leaked in
        if not df.empty:
            max_kt = pd.to_datetime(df["knowledge_ts"]).max()
            cutoff = pd.Timestamp(trade_date + "T23:59:59Z")
            assert max_kt <= cutoff, f"Look-ahead bias: knowledge_ts {max_kt} > {cutoff}"
        return df

runner = BacktestRunner(API_KEY)
signals = runner.get_universe_signals(
    ["AAPL", "MSFT", "NVDA", "GOOGL", "META"],
    "2024-10-01"
)
print("Knowledge-timestamp-safe signals for 2024-10-01:")
print(signals.to_string(index=False))

External References

Put Knowledge Timestamp to work in your pipeline

Pull AI-ready embeddings and signals as Iceberg tables or over the REST API.

Get API Access