NLP for Finance

What is Lexical-Semantic Divergence?

The gap between how much a text changed in words versus in meaning, used to tell a substantive disclosure shift apart from a boilerplate reformat.

In Plain English

Two filings can differ in words without differing in meaning, and that difference is the whole problem with naive filing-diff tools. A legal team that reformats a risk-factor section, swaps an HTML template, or rephrases a sentence to mean the same thing produces a large word-level change and zero new information. If your screen reacts to word changes alone, it spends most of its attention on noise.

Lexical-semantic divergence is the fix. It compares two readings of the same change: the lexical change (how many words moved) and the semantic change (whether the meaning moved, measured with vector embeddings). Subtract one from the other and you get a single number that tells you which kind of change you are looking at.

When the divergence is high, the words moved much more than the meaning, a reshuffle. When the divergence is near zero and both changes are large, the words and the meaning moved together, a real disclosure shift. That is the case worth a human read, and the case the Zillow example illustrates.

Technical Definition

For a section, with TF cosine similarity cosine and embedding cosine similarity cosine_embedding against the prior-year section:

lex_sem_divergence = (1 − cosine) − (1 − cosine_embedding) = cosine_embedding − cosine

A large positive value means lexical change exceeds semantic change (likely formatting). A value near zero with both changes high means a substantive shift. VectorFin sets format_switch_suspected = true when lexical change is high but semantic change is low.

How VectorFin Uses This

lex_sem_divergence is a field on every scored section of a FilingChangeRecord from the Filing Change Signal, alongside cosine, cosine_embedding, and format_switch_suspected. It is the metric that turns a raw filing diff into a usable signal.

Code Example

python
def classify(sec):
    if sec["parse_status"] != "ok":
        return "no-score"
    lexical = 1 - sec["cosine"]
    semantic = 1 - sec["cosine_embedding"]
    if sec["format_switch_suspected"]:
        return "format reshuffle (down-weight)"
    if lexical > 0.15 and semantic > 0.10:
        return "substantive disclosure shift"
    return "minor change"

Put Lexical-Semantic Divergence to work in your pipeline

Pull AI-ready embeddings and signals as Iceberg tables or over the REST API.

Get API Access