NLP for Finance

What is 10-K Filing?

The comprehensive annual report that public companies file with the SEC, containing audited financials, management discussion, risk factors, and business overview.

In Plain English

Once a year, every public company in the United States files a 10-K with the Securities and Exchange Commission. This document is the most comprehensive window into a company that exists — a 100-300+ page report covering everything from audited financial statements to descriptions of competitive threats to the exact compensation of the top five executives. Required by federal law, written under penalty of fraud, and reviewed by auditors, it's the foundation of fundamental analysis.

The 10-K is organized into standardized sections that make cross-company comparison possible. Item 1 describes the business. Item 1A covers risk factors — a detailed (sometimes terrifyingly candid) list of everything that could go wrong. Item 7 is Management's Discussion and Analysis, where management explains the year's results in narrative form. Items 8 and onwards contain the financial statements and footnotes.

Risk factors deserve special attention. Under SEC guidance, companies must disclose "material" risks — those that a reasonable investor would consider important. Companies have strong legal incentives to be thorough: under-disclosing a risk that later materializes can trigger securities litigation. The result is that risk factors in 10-Ks contain some of the most candid corporate disclosure anywhere. The challenge is that they're also dense, repetitive, and lengthy.

For NLP systems, the 10-K is a goldmine. The MD&A section in particular contains narrative language about financial performance, strategic direction, and management's own assessment of risks and opportunities. Changes in this language from year to year carry signal — when risk factor language becomes more specific and urgent, or when MD&A language shifts from optimistic to hedged, something is changing.

Technical Definition

10-K filing requirements (Regulation S-K):

Part I, Item 1: Business description
Part I, Item 1A: Risk Factors
Part I, Item 1B: Unresolved Staff Comments
Part II, Item 7: Management's Discussion and Analysis (MD&A)
Part II, Item 7A: Quantitative and Qualitative Disclosures About Market Risk
Part II, Item 8: Financial Statements (audited by Big 4 or regional firm)
Part II, Item 9A: Controls and Procedures

Filing deadlines: Large accelerated filers (market cap > $700M): 60 days after fiscal year end. Accelerated filers ($75M-$700M): 75 days. Non-accelerated filers: 90 days.

XBRL tagging: Since 2009, financial data in 10-Ks must be tagged in XBRL (iXBRL since 2018), enabling machine-readable extraction of structured financial data. The narrative sections (MD&A, risk factors) remain unstructured text.

SEC EDGAR provides full-text search at efts.sec.gov/LATEST/search-index?q=.... VectorFin ingests 10-K text via EDGAR's submission API: https://data.sec.gov/submissions/{CIK}.json.

How VectorFin Uses This

VectorFin indexes the MD&A and Risk Factors sections of 10-K filings as embeddings, stored in:

gs://vectorfinancials-data/warehouse/embeddings/filings/

Each chunk row includes: ticker, filing_type (10-K), section (mda | risk_factors | business), filed_date, knowledge_ts, embedding.

The bitemporal design is especially important for 10-K filings: the effective timestamp is the filed_date (when the company submitted to SEC), and the knowledge_ts is when VectorFin ingested it — typically 1-3 days later. For backtests using 10-K data, filter by knowledge_ts <= backtest_date to avoid using filings not yet publicly available.

curl -H "X-API-Key: $VF_API_KEY" \
  "https://api.vectorfinancials.com/v1/embeddings/AAPL?source=filing&filing_type=10-K&section=mda&period=2024-Q4"

Code Example

import requests

API_BASE = "https://api.vectorfinancials.com"
API_KEY = "vf_your_api_key_here"

# Compare 10-K risk factors year-over-year to detect new or escalating risks
import numpy as np

def get_filing_embeddings(ticker, filing_type, section, period):
    resp = requests.get(
        f"{API_BASE}/v1/embeddings/{ticker}",
        params={"source": "filing", "filing_type": filing_type,
                "section": section, "period": period},
        headers={"X-API-Key": API_KEY},
    )
    resp.raise_for_status()
    return resp.json()["chunks"]

def mean_embedding(chunks):
    return np.mean([c["embedding"] for c in chunks], axis=0)

def cosine_sim(a, b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Compare risk factor language from 2023 to 2024 annual report
risk_2023 = get_filing_embeddings("MSFT", "10-K", "risk_factors", "2023-Q4")
risk_2024 = get_filing_embeddings("MSFT", "10-K", "risk_factors", "2024-Q4")

similarity = cosine_sim(mean_embedding(risk_2023), mean_embedding(risk_2024))
print(f"MSFT 10-K risk factors 2023→2024 similarity: {similarity:.4f}")
print(f"Risk language drift: {1 - similarity:.4f}")
# Lower similarity = more changed risk language = potentially new material risks

External References

Put 10-K Filing to work in your pipeline

Access AI-ready financial data — embeddings, signals, Iceberg tables.

Get API Access Back to Glossary