Adapting a pretrained model to a specific domain or task by continuing training on a targeted dataset, improving accuracy without training from scratch.
In Plain English
Think of a pretrained language model as a highly educated generalist — someone who has read billions of pages of text and developed a deep intuition for language. Fine-tuning is like giving that generalist a focused internship in a specific field. You do not start their education from scratch; you build on what they already know and sharpen their expertise in the area that matters to you.
For financial applications, a general-purpose language model knows words like "revenue," "margin," and "guidance," but it has absorbed those words from a vast mixture of sources. Fine-tuning on earnings call transcripts, 10-K filings, and analyst reports trains the model to recognize the specific patterns, euphemisms, and code words that financial professionals use — "we see some lumpiness in the quarter" as a softened admission of weak results, for instance.
Fine-tuning requires far less data and compute than pretraining. You take a model with billions of parameters already trained, freeze most of them, and update only a smaller set with your domain data. Techniques like LoRA (Low-Rank Adaptation) make this even more efficient by training low-rank matrices rather than full weight updates, dramatically reducing the number of trainable parameters.
The result is a model that speaks the language of your domain, produces more accurate embeddings for domain-specific concepts, and makes fewer errors on domain-specific classification tasks.
Technical Definition
Fine-tuning involves continuing gradient descent on a pretrained model M_θ using a domain-specific dataset D_ft with a task-appropriate loss function. For embedding models, the typical objective is contrastive loss:
L = -log[ exp(sim(e_q, e_p) / τ) / Σⱼ exp(sim(e_q, e_j) / τ) ]where e_q is a query embedding, e_p is a positive (relevant) document embedding, and the denominator sums over all negative examples. Temperature τ controls sharpness of the distribution.
Parameter-efficient fine-tuning (PEFT) methods, particularly LoRA, decompose weight update ΔW as ΔW = BA where B ∈ ℝ^(d×r) and A ∈ ℝ^(r×k), with rank r ≪ min(d,k). This reduces trainable parameters by 10–100× while matching full fine-tuning performance on most tasks.
Catastrophic forgetting — loss of general knowledge during fine-tuning — is mitigated by low learning rates (1e-5 to 1e-4), early stopping, and mixing domain data with general-purpose samples.
How VectorFin Uses This
VectorFin's embedding pipeline uses gemini-embedding-2-preview, which has been trained on diverse text including financial documents. The embeddings it produces capture domain-specific financial semantics out of the box, meaning customers benefit from implicit fine-tuning on financial language without managing their own training infrastructure.
For customers who need custom fine-tuned embeddings — for example, a hedge fund that wants embeddings calibrated to their proprietary taxonomy of risk factors — VectorFin's Pro and Enterprise tiers provide raw Iceberg access to transcript and filing text, which can serve as fine-tuning data:
import duckdb
# Pull training pairs from the Iceberg warehouse for fine-tuning
conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg;")
pairs = conn.execute("""
SELECT ticker, fiscal_period, chunk_idx, chunk_text
FROM iceberg_scan('gs://vectorfinancials-data/warehouse/embeddings/transcripts/')
WHERE knowledge_ts >= '2023-01-01'
ORDER BY ticker, fiscal_period, chunk_idx
""").fetchdf()
print(f"Fine-tuning corpus: {len(pairs):,} document chunks")Code Example
# Example: fine-tune a sentence transformer on financial text pairs
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# Load base model
model = SentenceTransformer("all-MiniLM-L6-v2")
# Training pairs: (anchor, positive, negative)
# These would come from VectorFin Iceberg tables or your own labels
train_examples = [
InputExample(texts=[
"Revenue growth exceeded expectations driven by strong cloud demand.",
"Cloud segment outperformed guidance with accelerating top-line momentum.",
"Supply chain disruptions reduced gross margins by 200 basis points.",
]),
InputExample(texts=[
"We expect headwinds from FX to moderate in the second half.",
"Currency tailwinds are expected to become neutral by Q3.",
"The company increased its quarterly dividend by 15 percent.",
]),
# ... thousands more pairs from the VectorFin corpus
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
output_path="./financial-embedding-model",
)
print("Fine-tuned model saved.")Put Fine-Tuning to work in your pipeline
Access AI-ready financial data — embeddings, signals, Iceberg tables.