ML & AI

What is Tokenization?

The process of splitting text into discrete units (tokens) that a language model can process, typically using subword algorithms like BPE or WordPiece.

In Plain English

Neural language models cannot read raw text the way humans do. They need numbers. Tokenization is the process of converting a string of text into a sequence of integers that the model can process — breaking "revenue grew 12% YoY" into pieces like ["revenue", " grew", " 12", "%", " Y", "o", "Y"] and then mapping each piece to a number in a fixed vocabulary.

The tricky part is choosing how to split the text. Splitting on whitespace (one token per word) is simple but breaks down on rare words, compound words, and domain-specific jargon like "EBITDA" or "CAPM." Splitting into individual characters is universal but makes sequences absurdly long. Modern tokenizers use a middle path called subword tokenization: common words get their own token, rare words are split into meaningful fragments.

For example, "tokenization" might become ["token", "ization"] — two subword pieces that also appear in other contexts, so the model can leverage shared knowledge about what "token" means and what "-ization" as a suffix implies.

Byte-Pair Encoding (BPE) builds a vocabulary by iteratively merging the most frequent character pairs in a training corpus. WordPiece (used in BERT) does something similar but maximizes the probability of the training data. The result is a vocabulary of 30,000-50,000 subword pieces that efficiently cover most text.

Financial text poses specific tokenization challenges: ticker symbols, SEC form numbers (10-K, 8-K), accounting terms, and numeric formats like "$12.4B" or "2024-Q3" may be split in ways that obscure their meaning. This is part of why finance-specific models tend to outperform general-purpose ones.

Technical Definition

Given a text string s, a tokenizer applies a learned vocabulary V of size |V| to produce a sequence of token IDs t = [t₁, t₂, ..., t_n] where each tᵢ ∈ \{0, 1, ..., |V|-1\}.

BPE tokenization starts with individual bytes/characters and iteratively merges the most frequent adjacent pair until a target vocabulary size is reached. The resulting vocabulary covers high-frequency substrings. Special tokens are added: [CLS] for sequence classification, [SEP] for segment boundaries, [PAD] for batch padding, [MASK] for masked language modeling.

Context window = max sequence length in tokens. For most BERT-class models: 512 tokens. For GPT-4 class: 128K tokens. For Gemini embedding models: 2048 tokens. A token is approximately 0.75 words or 4 characters in English text.

Chunking strategy in production systems must account for tokenization: a "512-word chunk" may be 600-750 tokens. VectorFin processes earnings calls at the token level, splitting at 512 tokens with 64-token overlap to preserve context at chunk boundaries.

How VectorFin Uses This

VectorFin's embedding pipeline tokenizes every earnings call and SEC filing before passing chunks to gemini-embedding-2-preview. The pipeline:

1. Cleans raw HTML/XBRL from SEC EDGAR filings 2. Normalizes whitespace, removes boilerplate headers/footers 3. Splits into 512-token chunks with 64-token overlap 4. Each chunk is embedded independently

The 64-token overlap is a deliberate design choice: when a sentence spans a chunk boundary, at least 64 tokens of prior context are included in the next chunk's attention window, reducing information loss at boundaries.

Ticker symbols receive special handling — "AAPL" would otherwise tokenize to ["A", "AP", "L"], losing the identity of the symbol. The pipeline normalizes tickers to a canonical form before tokenization.

The chunk_idx field in the API response maps directly to the tokenized chunk position, making it easy to reconstruct the original document order or fetch adjacent chunks for context.

Code Example

import requests

API_BASE = "https://api.vectorfinancials.com"
API_KEY = "vf_your_api_key_here"

# Fetch all chunks for a 10-K MD&A section — each chunk = 512-token window
resp = requests.get(
    f"{API_BASE}/v1/embeddings/GOOGL",
    params={
        "source": "filing",
        "filing_type": "10-K",
        "section": "mda",
        "period": "2024-Q4",
    },
    headers={"X-API-Key": API_KEY},
)

data = resp.json()
chunks = data["chunks"]

print(f"Filing split into {len(chunks)} chunks (512 tokens each, 64-token overlap)")
print(f"Approximate word count: {len(chunks) * 384}")  # ~0.75 words/token

# Show the first two chunks — overlap region is at the seam
print("\n--- Chunk 0 (end) ---")
print(chunks[0]["text"][-200:])

print("\n--- Chunk 1 (start) ---")
print(chunks[1]["text"][:200])

# The last ~64 tokens of chunk 0 should appear at the start of chunk 1

External References

Put Tokenization to work in your pipeline

Access AI-ready financial data — embeddings, signals, Iceberg tables.

Get API Access Back to Glossary