VectorFin/Glossary/Fine-Tuning
ML & AI

What is Fine-Tuning?

Adapting a pretrained model to a specific domain or task by continuing training on a targeted dataset, improving accuracy without training from scratch.

In Plain English

Think of a pretrained language model as a highly educated generalist — someone who has read billions of pages of text and developed a deep intuition for language. Fine-tuning is like giving that generalist a focused internship in a specific field. You do not start their education from scratch; you build on what they already know and sharpen their expertise in the area that matters to you.

For financial applications, a general-purpose language model knows words like "revenue," "margin," and "guidance," but it has absorbed those words from a vast mixture of sources. Fine-tuning on earnings call transcripts, 10-K filings, and analyst reports trains the model to recognize the specific patterns, euphemisms, and code words that financial professionals use — "we see some lumpiness in the quarter" as a softened admission of weak results, for instance.

Fine-tuning requires far less data and compute than pretraining. You take a model with billions of parameters already trained, freeze most of them, and update only a smaller set with your domain data. Techniques like LoRA (Low-Rank Adaptation) make this even more efficient by training low-rank matrices rather than full weight updates, dramatically reducing the number of trainable parameters.

The result is a model that speaks the language of your domain, produces more accurate embeddings for domain-specific concepts, and makes fewer errors on domain-specific classification tasks.

Technical Definition

Fine-tuning involves continuing gradient descent on a pretrained model M_θ using a domain-specific dataset D_ft with a task-appropriate loss function. For embedding models, the typical objective is contrastive loss:

L = -log[ exp(sim(e_q, e_p) / τ) / Σⱼ exp(sim(e_q, e_j) / τ) ]

where e_q is a query embedding, e_p is a positive (relevant) document embedding, and the denominator sums over all negative examples. Temperature τ controls sharpness of the distribution.

Parameter-efficient fine-tuning (PEFT) methods, particularly LoRA, decompose weight update ΔW as ΔW = BA where B ∈ ℝ^(d×r) and A ∈ ℝ^(r×k), with rank r ≪ min(d,k). This reduces trainable parameters by 10–100× while matching full fine-tuning performance on most tasks.

Catastrophic forgetting — loss of general knowledge during fine-tuning — is mitigated by low learning rates (1e-5 to 1e-4), early stopping, and mixing domain data with general-purpose samples.

How VectorFin Uses This

VectorFin's embedding pipeline uses gemini-embedding-2-preview, which has been trained on diverse text including financial documents. The embeddings it produces capture domain-specific financial semantics out of the box, meaning customers benefit from implicit fine-tuning on financial language without managing their own training infrastructure.

For customers who need custom fine-tuned embeddings — for example, a hedge fund that wants embeddings calibrated to their proprietary taxonomy of risk factors — VectorFin's Pro and Enterprise tiers provide raw Iceberg access to transcript and filing text, which can serve as fine-tuning data:

import duckdb

# Pull training pairs from the Iceberg warehouse for fine-tuning
conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg;")

pairs = conn.execute("""
    SELECT ticker, fiscal_period, chunk_idx, chunk_text
    FROM iceberg_scan('gs://vectorfinancials-data/warehouse/embeddings/transcripts/')
    WHERE knowledge_ts >= '2023-01-01'
    ORDER BY ticker, fiscal_period, chunk_idx
""").fetchdf()

print(f"Fine-tuning corpus: {len(pairs):,} document chunks")

Code Example

# Example: fine-tune a sentence transformer on financial text pairs
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Load base model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Training pairs: (anchor, positive, negative)
# These would come from VectorFin Iceberg tables or your own labels
train_examples = [
    InputExample(texts=[
        "Revenue growth exceeded expectations driven by strong cloud demand.",
        "Cloud segment outperformed guidance with accelerating top-line momentum.",
        "Supply chain disruptions reduced gross margins by 200 basis points.",
    ]),
    InputExample(texts=[
        "We expect headwinds from FX to moderate in the second half.",
        "Currency tailwinds are expected to become neutral by Q3.",
        "The company increased its quarterly dividend by 15 percent.",
    ]),
    # ... thousands more pairs from the VectorFin corpus
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./financial-embedding-model",
)
print("Fine-tuned model saved.")

Put Fine-Tuning to work in your pipeline

Access AI-ready financial data — embeddings, signals, Iceberg tables.