ML & AI

What is Attention Mechanism?

The core operation in transformer models that lets each token weight how much it should focus on every other token in the sequence.

In Plain English

Before the attention mechanism, neural networks processed sequences (words, sentences) one step at a time, like reading left to right and keeping only a short memory. By the time you reached the end of a long document, the model had largely forgotten the beginning. This made understanding complex, long-range dependencies — "the CEO mentioned earlier that margins would recover, so what does this sentence about pricing mean?" — very difficult.

The attention mechanism solves this by letting every word in a sentence simultaneously ask: "Which other words in this passage are most relevant to understanding me?" Each word generates three vectors: a query (what I'm looking for), a key (what I offer to others), and a value (the information I carry). The model computes a relevance score between every query-key pair, normalizes them into weights, and produces a new representation as a weighted sum of all value vectors.

Imagine a CFO saying: "Revenue grew despite — as we discussed in the macro section — significant headwinds in EMEA." To understand what "headwinds" refers to, you need to connect it back to the "macro section" mention several clauses earlier. Attention does this by directly computing a high relevance score between "headwinds" and "macro," regardless of distance.

This ability to capture long-range dependencies without recurrence is what made transformers so powerful. Unlike RNNs, every token can attend to every other token in a single pass. The model learns which token relationships are important through training.

Technical Definition

Scaled dot-product attention takes three matrices as input: queries Q ∈ ℝ^\{n×d_k\}, keys K ∈ ℝ^\{m×d_k\}, and values V ∈ ℝ^\{m×d_v\}:

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

The division by √d_k prevents vanishingly small gradients when d_k is large. The softmax produces a probability distribution over positions, and the output is a weighted sum of values.

Multi-head attention runs h attention heads in parallel, each learning different relationship patterns (e.g., syntactic, semantic, coreference). The outputs are concatenated and linearly projected:

MultiHead(Q, K, V) = Concat(head₁, ..., head_h) × W^O

where each head_i = Attention(QW_i^Q, KW_i^K, VW_i^V) with learned projection matrices.

Self-attention is the special case where Q, K, V all come from the same sequence — each token attends to all other tokens in the same document. Cross-attention (in encoder-decoder models) has the decoder attending to encoder outputs.

How VectorFin Uses This

VectorFin uses Gemini-family embedding models, which are built on transformer architectures with multi-head self-attention. When a 512-token earnings call chunk is processed, self-attention allows the model to capture long-range relationships within the passage — connecting a reference to "the guidance we provided last quarter" back to the specific figures mentioned earlier in the same chunk.

This long-range understanding is why transformer-based embeddings dramatically outperform older bag-of-words approaches on financial text. Earnings calls are dense with cross-references, conditional statements, and management hedging language where meaning depends heavily on context that may appear many sentences away.

The 64-token overlap between adjacent chunks in VectorFin's chunking strategy is partly motivated by attention: if a sentence's meaning depends on context from the previous chunk, the overlap ensures that context is present in both chunk's attention window.

Code Example

# Demonstrating how attention context affects embedding quality
# by comparing chunk representations with and without overlap

import requests
import numpy as np

API_BASE = "https://api.vectorfinancials.com"
API_KEY = "vf_your_api_key_here"

# Fetch all chunks for a filing — each chunk is an attention window
resp = requests.get(
    f"{API_BASE}/v1/embeddings/MSFT",
    params={"period": "2024-Q4", "source": "earnings_call"},
    headers={"X-API-Key": API_KEY},
)
chunks = resp.json()["chunks"]

print(f"Total chunks: {len(chunks)}")
print(f"Embedding dimension: {len(chunks[0]['embedding'])}")

# Adjacent chunks should have high cosine similarity due to overlap
def cosine(a, b):
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

for i in range(min(3, len(chunks) - 1)):
    sim = cosine(chunks[i]["embedding"], chunks[i+1]["embedding"])
    print(f"Chunk {i} → Chunk {i+1} similarity: {sim:.4f}")

Related Terms

transformer embedding tokenization

External References

Put Attention Mechanism to work in your pipeline

Access AI-ready financial data — embeddings, signals, Iceberg tables.

Get API Access Back to Glossary