End-to-End Recipes
The following recipes chain together the decisions from the other guide pages into complete, copy-pasteable pipelines. Each recipe shows both the CLI (YAML) and programmatic (Python) approach.
Tip
Not sure which recipe to use? Start with the What Do You Want to Do? to identify your workflow, then return here for the full pipeline.
Recipe 1: DPR Dense Retrieval Pipeline
Goal: Fine-tune a simple dense bi-encoder, index a collection, search, then re-rank with a cross-encoder.
Step 1 — Fine-tune the DPR model:
lightning-ir fit --config recipe-dpr-fit.yaml
recipe-dpr-fit.yaml
trainer:
max_steps: 100_000
precision: bf16-mixed
accumulate_grad_batches: 4
gradient_clip_val: 1
model:
class_path: lightning_ir.BiEncoderModule
init_args:
model_name_or_path: bert-base-uncased
config:
class_path: lightning_ir.models.DprConfig
loss_functions:
- lightning_ir.RankNet
data:
class_path: lightning_ir.LightningIRDataModule
init_args:
train_dataset:
class_path: lightning_ir.TupleDataset
init_args:
tuples_dataset: msmarco-passage/train/triples-small
train_batch_size: 32
optimizer:
class_path: torch.optim.AdamW
init_args:
lr: 1e-5
recipe_dpr_fit.py
from torch.optim import AdamW
from lightning_ir import (
BiEncoderModule, LightningIRDataModule,
LightningIRTrainer, RankNet, TupleDataset,
)
from lightning_ir.models import DprConfig
module = BiEncoderModule(
model_name_or_path="bert-base-uncased",
config=DprConfig(),
loss_functions=[RankNet()],
)
module.set_optimizer(AdamW, lr=1e-5)
data_module = LightningIRDataModule(
train_dataset=TupleDataset("msmarco-passage/train/triples-small"),
train_batch_size=32,
)
trainer = LightningIRTrainer(
max_steps=100_000,
precision="bf16-mixed",
accumulate_grad_batches=4,
gradient_clip_val=1,
)
trainer.fit(module, data_module)
Step 2 — Index the collection:
lightning-ir index --config recipe-dpr-index.yaml
recipe-dpr-index.yaml
trainer:
logger: false
callbacks:
- class_path: lightning_ir.IndexCallback
init_args:
index_dir: ./msmarco-passage-index
index_config:
class_path: lightning_ir.FaissFlatIndexConfig
model:
class_path: lightning_ir.BiEncoderModule
init_args:
model_name_or_path: ./my-dpr-checkpoint # or a Model Zoo model
data:
class_path: lightning_ir.LightningIRDataModule
init_args:
inference_datasets:
- class_path: lightning_ir.DocDataset
init_args:
doc_dataset: msmarco-passage
inference_batch_size: 256
recipe_dpr_index.py
from lightning_ir import (
BiEncoderModule, DocDataset, IndexCallback,
LightningIRDataModule, LightningIRTrainer,
FaissFlatIndexConfig,
)
module = BiEncoderModule(model_name_or_path="./my-dpr-checkpoint")
data_module = LightningIRDataModule(
inference_datasets=[DocDataset("msmarco-passage")],
inference_batch_size=256,
)
callback = IndexCallback(
index_dir="./msmarco-passage-index",
index_config=FaissFlatIndexConfig(),
)
trainer = LightningIRTrainer(
callbacks=[callback], logger=False, enable_checkpointing=False
)
trainer.index(module, data_module)
Step 3 — Search:
lightning-ir search --config recipe-dpr-search.yaml
recipe-dpr-search.yaml
trainer:
logger: false
callbacks:
- class_path: lightning_ir.SearchCallback
init_args:
index_dir: ./msmarco-passage-index
search_config:
class_path: lightning_ir.FaissSearchConfig
init_args:
k: 100
save_dir: ./runs
model:
class_path: lightning_ir.BiEncoderModule
init_args:
model_name_or_path: ./my-dpr-checkpoint
evaluation_metrics:
- nDCG@10
data:
class_path: lightning_ir.LightningIRDataModule
init_args:
inference_datasets:
- class_path: lightning_ir.QueryDataset
init_args:
query_dataset: msmarco-passage/trec-dl-2019/judged
inference_batch_size: 4
recipe_dpr_search.py
from lightning_ir import (
BiEncoderModule, QueryDataset, SearchCallback,
LightningIRDataModule, LightningIRTrainer,
FaissSearchConfig,
)
module = BiEncoderModule(
model_name_or_path="./my-dpr-checkpoint",
evaluation_metrics=["nDCG@10"],
)
data_module = LightningIRDataModule(
inference_datasets=[
QueryDataset("msmarco-passage/trec-dl-2019/judged"),
],
inference_batch_size=4,
)
callback = SearchCallback(
index_dir="./msmarco-passage-index",
search_config=FaissSearchConfig(k=100),
save_dir="./runs",
)
trainer = LightningIRTrainer(
callbacks=[callback], logger=False, enable_checkpointing=False
)
trainer.search(module, data_module)
Step 4 — Re-rank with a cross-encoder:
lightning-ir re_rank --config recipe-dpr-rerank.yaml
recipe-dpr-rerank.yaml
trainer:
logger: false
callbacks:
- class_path: lightning_ir.ReRankCallback
init_args:
save_dir: ./re-ranked-runs
model:
class_path: lightning_ir.CrossEncoderModule
init_args:
model_name_or_path: webis/monoelectra-base
evaluation_metrics:
- nDCG@10
data:
class_path: lightning_ir.LightningIRDataModule
init_args:
inference_datasets:
- class_path: lightning_ir.RunDataset
init_args:
run_path_or_id: ./runs/msmarco-passage-trec-dl-2019-judged.run
inference_batch_size: 4
recipe_dpr_rerank.py
from lightning_ir import (
CrossEncoderModule, RunDataset, ReRankCallback,
LightningIRDataModule, LightningIRTrainer,
)
module = CrossEncoderModule(
model_name_or_path="webis/monoelectra-base",
evaluation_metrics=["nDCG@10"],
)
data_module = LightningIRDataModule(
inference_datasets=[
RunDataset("./runs/msmarco-passage-trec-dl-2019-judged.run"),
],
inference_batch_size=4,
)
callback = ReRankCallback(save_dir="./re-ranked-runs")
trainer = LightningIRTrainer(
callbacks=[callback], logger=False, enable_checkpointing=False
)
trainer.re_rank(module, data_module)
Recipe 2: SPLADE Sparse Retrieval Pipeline
Goal: Train a SPLADE model with proper regularization, build a sparse index, and retrieve.
Step 1 — Fine-tune SPLADE with FLOPS regularization:
lightning-ir fit --config recipe-splade-fit.yaml
recipe-splade-fit.yaml
trainer:
max_steps: 100_000
precision: bf16-mixed
callbacks:
- class_path: lightning_ir.GenericConstantSchedulerWithLinearWarmup
init_args:
keys:
- loss_functions.1.query_weight
- loss_functions.1.doc_weight
num_warmup_steps: 20_000
num_delay_steps: 50_000
model:
class_path: lightning_ir.BiEncoderModule
init_args:
model_name_or_path: bert-base-uncased
config:
class_path: lightning_ir.models.SpladeConfig
loss_functions:
- lightning_ir.InBatchCrossEntropy
- class_path: lightning_ir.FLOPSRegularization
init_args:
query_weight: 0.0008
doc_weight: 0.0006
data:
class_path: lightning_ir.LightningIRDataModule
init_args:
train_dataset:
class_path: lightning_ir.TupleDataset
init_args:
tuples_dataset: msmarco-passage/train/triples-small
train_batch_size: 32
optimizer:
class_path: torch.optim.AdamW
init_args:
lr: 1e-5
recipe_splade_fit.py
from torch.optim import AdamW
from lightning_ir import (
BiEncoderModule, LightningIRDataModule, LightningIRTrainer,
TupleDataset, InBatchCrossEntropy, FLOPSRegularization,
GenericConstantSchedulerWithLinearWarmup,
)
from lightning_ir.models import SpladeConfig
module = BiEncoderModule(
model_name_or_path="bert-base-uncased",
config=SpladeConfig(),
loss_functions=[
InBatchCrossEntropy(),
FLOPSRegularization(query_weight=0.0008, doc_weight=0.0006),
],
)
module.set_optimizer(AdamW, lr=1e-5)
data_module = LightningIRDataModule(
train_dataset=TupleDataset("msmarco-passage/train/triples-small"),
train_batch_size=32,
)
scheduler = GenericConstantSchedulerWithLinearWarmup(
keys=[
"loss_functions.1.query_weight",
"loss_functions.1.doc_weight",
],
num_warmup_steps=20_000,
num_delay_steps=50_000,
)
trainer = LightningIRTrainer(
max_steps=100_000,
precision="bf16-mixed",
callbacks=[scheduler],
)
trainer.fit(module, data_module)
Step 2 — Build a sparse index:
lightning-ir index --config recipe-splade-index.yaml
recipe-splade-index.yaml
trainer:
logger: false
callbacks:
- class_path: lightning_ir.IndexCallback
init_args:
index_dir: ./splade-index
index_config:
class_path: lightning_ir.TorchSparseIndexConfig
model:
class_path: lightning_ir.BiEncoderModule
init_args:
model_name_or_path: ./my-splade-checkpoint
data:
class_path: lightning_ir.LightningIRDataModule
init_args:
inference_datasets:
- class_path: lightning_ir.DocDataset
init_args:
doc_dataset: msmarco-passage
inference_batch_size: 256
recipe_splade_index.py
from lightning_ir import (
BiEncoderModule, DocDataset, IndexCallback,
LightningIRDataModule, LightningIRTrainer,
TorchSparseIndexConfig,
)
module = BiEncoderModule(model_name_or_path="./my-splade-checkpoint")
data_module = LightningIRDataModule(
inference_datasets=[DocDataset("msmarco-passage")],
inference_batch_size=256,
)
callback = IndexCallback(
index_dir="./splade-index",
index_config=TorchSparseIndexConfig(),
)
trainer = LightningIRTrainer(
callbacks=[callback], logger=False, enable_checkpointing=False
)
trainer.index(module, data_module)
Step 3 — Sparse search:
lightning-ir search --config recipe-splade-search.yaml
recipe-splade-search.yaml
trainer:
logger: false
callbacks:
- class_path: lightning_ir.SearchCallback
init_args:
index_dir: ./splade-index
search_config:
class_path: lightning_ir.TorchSparseSearchConfig
init_args:
k: 100
save_dir: ./runs
model:
class_path: lightning_ir.BiEncoderModule
init_args:
model_name_or_path: ./my-splade-checkpoint
evaluation_metrics:
- nDCG@10
data:
class_path: lightning_ir.LightningIRDataModule
init_args:
inference_datasets:
- class_path: lightning_ir.QueryDataset
init_args:
query_dataset: msmarco-passage/trec-dl-2019/judged
inference_batch_size: 4
recipe_splade_search.py
from lightning_ir import (
BiEncoderModule, QueryDataset, SearchCallback,
LightningIRDataModule, LightningIRTrainer,
TorchSparseSearchConfig,
)
module = BiEncoderModule(
model_name_or_path="./my-splade-checkpoint",
evaluation_metrics=["nDCG@10"],
)
data_module = LightningIRDataModule(
inference_datasets=[
QueryDataset("msmarco-passage/trec-dl-2019/judged"),
],
inference_batch_size=4,
)
callback = SearchCallback(
index_dir="./splade-index",
search_config=TorchSparseSearchConfig(k=100),
save_dir="./runs",
)
trainer = LightningIRTrainer(
callbacks=[callback], logger=False, enable_checkpointing=False
)
trainer.search(module, data_module)
Recipe 3: ColBERT Multi-Vector Pipeline
Goal: Fine-tune a ColBERT model, build a PLAID index for fast retrieval, and search.
Step 1 — Fine-tune ColBERT:
lightning-ir fit --config recipe-colbert-fit.yaml
recipe-colbert-fit.yaml
trainer:
max_steps: 100_000
precision: bf16-mixed
accumulate_grad_batches: 4
gradient_clip_val: 1
model:
class_path: lightning_ir.BiEncoderModule
init_args:
model_name_or_path: bert-base-uncased
config:
class_path: lightning_ir.models.ColConfig
init_args:
similarity_function: dot
query_aggregation_function: sum
query_expansion: true
query_length: 32
doc_length: 256
normalization_strategy: l2
embedding_dim: 128
projection: linear_no_bias
add_marker_tokens: true
loss_functions:
- lightning_ir.RankNet
data:
class_path: lightning_ir.LightningIRDataModule
init_args:
train_dataset:
class_path: lightning_ir.TupleDataset
init_args:
tuples_dataset: msmarco-passage/train/triples-small
train_batch_size: 32
optimizer:
class_path: torch.optim.AdamW
init_args:
lr: 1e-5
recipe_colbert_fit.py
from torch.optim import AdamW
from lightning_ir import (
BiEncoderModule, LightningIRDataModule,
LightningIRTrainer, RankNet, TupleDataset,
)
from lightning_ir.models import ColConfig
module = BiEncoderModule(
model_name_or_path="bert-base-uncased",
config=ColConfig(
similarity_function="dot",
query_aggregation_function="sum",
query_expansion=True,
query_length=32,
doc_length=256,
normalization_strategy="l2",
embedding_dim=128,
projection="linear_no_bias",
add_marker_tokens=True,
),
loss_functions=[RankNet()],
)
module.set_optimizer(AdamW, lr=1e-5)
data_module = LightningIRDataModule(
train_dataset=TupleDataset("msmarco-passage/train/triples-small"),
train_batch_size=32,
)
trainer = LightningIRTrainer(
max_steps=100_000,
precision="bf16-mixed",
accumulate_grad_batches=4,
gradient_clip_val=1,
)
trainer.fit(module, data_module)
Step 2 — Build a PLAID index:
lightning-ir index --config recipe-colbert-index.yaml
recipe-colbert-index.yaml
trainer:
logger: false
callbacks:
- class_path: lightning_ir.IndexCallback
init_args:
index_dir: ./colbert-index
index_config:
class_path: lightning_ir.PlaidIndexConfig
model:
class_path: lightning_ir.BiEncoderModule
init_args:
model_name_or_path: ./my-colbert-checkpoint
data:
class_path: lightning_ir.LightningIRDataModule
init_args:
inference_datasets:
- class_path: lightning_ir.DocDataset
init_args:
doc_dataset: msmarco-passage
inference_batch_size: 256
recipe_colbert_index.py
from lightning_ir import (
BiEncoderModule, DocDataset, IndexCallback,
LightningIRDataModule, LightningIRTrainer,
PlaidIndexConfig,
)
module = BiEncoderModule(model_name_or_path="./my-colbert-checkpoint")
data_module = LightningIRDataModule(
inference_datasets=[DocDataset("msmarco-passage")],
inference_batch_size=256,
)
callback = IndexCallback(
index_dir="./colbert-index",
index_config=PlaidIndexConfig(),
)
trainer = LightningIRTrainer(
callbacks=[callback], logger=False, enable_checkpointing=False
)
trainer.index(module, data_module)
Step 3 — Search with PLAID:
lightning-ir search --config recipe-colbert-search.yaml
recipe-colbert-search.yaml
trainer:
logger: false
callbacks:
- class_path: lightning_ir.SearchCallback
init_args:
index_dir: ./colbert-index
search_config:
class_path: lightning_ir.PlaidSearchConfig
init_args:
k: 100
save_dir: ./runs
model:
class_path: lightning_ir.BiEncoderModule
init_args:
model_name_or_path: ./my-colbert-checkpoint
evaluation_metrics:
- nDCG@10
data:
class_path: lightning_ir.LightningIRDataModule
init_args:
inference_datasets:
- class_path: lightning_ir.QueryDataset
init_args:
query_dataset: msmarco-passage/trec-dl-2019/judged
inference_batch_size: 4
recipe_colbert_search.py
from lightning_ir import (
BiEncoderModule, QueryDataset, SearchCallback,
LightningIRDataModule, LightningIRTrainer,
PlaidSearchConfig,
)
module = BiEncoderModule(
model_name_or_path="./my-colbert-checkpoint",
evaluation_metrics=["nDCG@10"],
)
data_module = LightningIRDataModule(
inference_datasets=[
QueryDataset("msmarco-passage/trec-dl-2019/judged"),
],
inference_batch_size=4,
)
callback = SearchCallback(
index_dir="./colbert-index",
search_config=PlaidSearchConfig(k=100),
save_dir="./runs",
)
trainer = LightningIRTrainer(
callbacks=[callback], logger=False, enable_checkpointing=False
)
trainer.search(module, data_module)
Quick Reference: Compatibility
Use this table as a cheat sheet when composing configurations.
Model Config |
Module |
Compatible Index |
Compatible Search |
Supported Workflows |
|---|---|---|---|---|
|
fit, index, search, re_rank |
|||
fit, index, search, re_rank |
||||
|
|
fit, index, search, re_rank |
||
— |
— |
fit, re_rank |
||
— |
— |
fit, re_rank |