LightningIRModule

class lightning_ir.base.module.LightningIRModule(model_name_or_path: str | None = None, config: LightningIRConfig | None = None, model: LightningIRModel | None = None, loss_functions: Sequence[LossFunction | Tuple[LossFunction, float]] | None = None, evaluation_metrics: Sequence[str] | None = None, model_kwargs: Mapping[str, Any] | None = None)[source]

Bases: LightningModule

LightningIRModule base class. It dervies from a LightningModule. LightningIRModules contain a LightningIRModel and a LightningIRTokenizer and implements the training, validation, and testing steps for the model. Derived classes must implement the forward method for the model.

Initializes the LightningIRModule.

Parameters:

model_name_or_path (str | None) – Name or path of backbone model or fine-tuned Lightning IR model. Defaults to None.
config (LightningIRConfig | None) – LightningIRConfig to apply when loading from backbone model. Defaults to None.
model (LightningIRModel | None) – Already instantiated Lightning IR model. Defaults to None.
loss_functions (Sequence[LossFunction | Tuple[LossFunction, float]] | None) – Loss functions to apply during fine-tuning, optional loss weights can be provided per loss function Defaults to None.
evaluation_metrics (Sequence[str] | None) – Metrics corresponding to ir-measures measure strings to apply during validation or testing. Defaults to None.
model_kwargs (Mapping[str, Any] | None) – Additional keyword arguments to pass to from_pretrained when loading a model. Defaults to None.

Raises:

ValueError – If both model and model_name_or_path are provided.
ValueError – If neither model nor model_name_or_path are provided.

Methods

`__init__`([model_name_or_path, config, ...])	Initializes the LightningIRModule.
`configure_optimizers`()	Configures the optizmizer for fine-tuning.
`forward`(batch)	Handles the forward pass of the model.
`get_dataset`(dataloader_idx)	Gets the dataset instance from the dataloader index.
`get_dataset_id`(dataset)	Gets the dataset id from the dataloader index for logging.
`on_save_checkpoint`(checkpoint)	Saves the model and tokenizer to the trainer's log directory.
`on_test_end`()	Prints the accumulated metrics for each dataloader.
`on_test_start`()	Called at the beginning of testing.
`on_train_start`()	Called at the beginning of training after sanity check.
`on_validation_end`()	Prints the validation results for each dataloader.
`on_validation_start`()	Called at the beginning of validation.
`prepare_input`(queries, docs, num_docs)	Tokenizes queries and documents and returns the tokenized BatchEncoding.
`save_pretrained`(save_path)	Saves the model and tokenizer to the save path.
`score`(queries, docs)	Computes relevance scores for queries and documents.
`set_optimizer`(optimizer, **optimizer_kwargs)	Sets the optimizer for the model.
`test_step`(batch, batch_idx[, dataloader_idx])	Handles the testing step for the model.
`training_step`(batch, batch_idx)	Handles the training step for the model.
`validate`(output, batch)	Validates the model output with the evaluation metrics and loss functions.
`validate_loss`(output, batch)	Validates the model output with the loss functions.
`validate_metrics`(output, batch)	Validates the model output with the evaluation metrics.
`validation_step`(batch, batch_idx[, ...])	Handles the validation step for the model.

Attributes

training

configure_optimizers() → Optimizer[source]

Configures the optizmizer for fine-tuning. This method is ignored when using the CLI. When using Lightning IR programmatically, the optimizer must be set using set_optimizer().

Returns:: The optimizer set for the model.
Return type:: torch.optim.Optimizer
Raises:: ValueError – If optimizer is not set. Call set_optimizer.

forward(batch: TrainBatch | RankBatch | SearchBatch) → LightningIROutput[source]

Handles the forward pass of the model.

Parameters:: batch (TrainBatch | RankBatch | SearchBatch) – Batch of training or ranking data.
Returns:: Model output.
Return type:: LightningIROutput
Raises:: NotImplementedError – Must be implemented by derived class.

get_dataset(dataloader_idx: int) → IRDataset | None[source]

Gets the dataset instance from the dataloader index. Returns None if no dataset is found.

Parameters:: dataloader_idx (int) – Index of the dataloader.
Returns:: Inference dataset or None if no dataset is found.
Return type:: IRDataset | None

get_dataset_id(dataset: IRDataset) → str[source]

Gets the dataset id from the dataloader index for logging.

Parameters:: dataset (IRDataset) – Dataset instance.
Returns:: Path to run file, ir-datasets dataset id, or dataloader index.
Return type:: str

on_save_checkpoint(checkpoint: Dict[str, Any]) → None[source]: Saves the model and tokenizer to the trainer’s log directory.

on_test_end() → None[source]: Prints the accumulated metrics for each dataloader.

on_test_start() → None[source]: Called at the beginning of testing.

on_train_start() → None[source]: Called at the beginning of training after sanity check.

on_validation_end() → None[source]: Prints the validation results for each dataloader.

on_validation_start() → None[source]: Called at the beginning of validation.

prepare_input(queries: Sequence[str] | None, docs: Sequence[str] | None, num_docs: Sequence[int] | int | None) → Dict[str, BatchEncoding][source]

Tokenizes queries and documents and returns the tokenized BatchEncoding.

Parameters:

queries (Sequence[str] | None) – Queries to tokenize.
docs (Sequence[str] | None) – Documents to tokenize.
num_docs (Sequence[int] | int | None) – Number of documents per query, if None num_docs is inferred by len(docs) // len(queries). Defaults to None.

Returns:

Tokenized queries and documents, format depends on the tokenizer.

Return type:

Dict[str, BatchEncoding]

save_pretrained(save_path: str | Path) → None[source]

Saves the model and tokenizer to the save path.

Parameters:: save_path (str | Path) – Path to save the model and tokenizer.

score(queries: Sequence[str] | str, docs: Sequence[Sequence[str]] | Sequence[str]) → LightningIROutput[source]

Computes relevance scores for queries and documents.

Parameters:

queries (Sequence[str] | str) – Queries to score.
docs (Sequence[Sequence[str]] | Sequence[str]) – Documents to score.

Returns:

Model output containing the scores.

Return type:

LightningIROutput

set_optimizer(optimizer: Type[Optimizer], **optimizer_kwargs: Dict[str, Any]) → LightningIRModule[source]

Sets the optimizer for the model. Necessary for fine-tuning when not using the CLI.

Parameters:

optimizer (Type[torch.optim.Optimizer]) – Torch optimizer class.
optimizer_kwargs (Dict[str, Any]) – Arguments to initialize the optimizer.

Returns:

Self with the optimizer set.

Return type:

LightningIRModule

test_step(batch: TrainBatch | RankBatch, batch_idx: int, dataloader_idx: int = 0) → LightningIROutput[source]

Handles the testing step for the model. Passes the batch to the validation step.

Parameters:

batch (TrainBatch | RankBatch) – Batch of testing data.
batch_idx (int) – Index of the batch.
dataloader_idx (int, optional) – Index of the dataloader. Defaults to 0.

Returns:

Model output.

Return type:

LightningIROutput

training_step(batch: TrainBatch, batch_idx: int) → Tensor[source]

Handles the training step for the model.

Parameters:

batch (TrainBatch) – Batch of training data.
batch_idx (int) – Index of the batch.

Returns:

Sum of the losses weighted by the loss weights.

Return type:

torch.Tensor

Raises:

ValueError – If no loss functions are set.

validate(output: LightningIROutput, batch: TrainBatch | RankBatch | SearchBatch) → Dict[str, float][source]

Validates the model output with the evaluation metrics and loss functions.

Parameters:

output (LightningIROutput) – Model output.
batch (TrainBatch | RankBatch | SearchBatch) – Batch of validation or testing data.

Returns:

Dictionary of evaluation metrics.

Return type:

Dict[str, float]

validate_loss(output: LightningIROutput, batch: TrainBatch | RankBatch | SearchBatch) → Dict[str, float][source]

Validates the model output with the loss functions.

Parameters:

output (LightningIROutput) – Model output.
batch (TrainBatch | RankBatch | SearchBatch) – Batch of validation or testing data.

Returns:

Dictionary of evaluation metrics.

Return type:

Dict[str, float]

validate_metrics(output: LightningIROutput, batch: TrainBatch | RankBatch | SearchBatch) → Dict[str, float][source]

Validates the model output with the evaluation metrics.

Parameters:

output (LightningIROutput) – Model output.
batch (TrainBatch | RankBatch | SearchBatch) – Batch of validation or testing data.

Returns:

Dictionary of evaluation metrics.

Return type:

Dict[str, float]

Raises:

ValueError – If query_ids or doc_ids are not set in the batch.

validation_step(batch: TrainBatch | RankBatch | SearchBatch, batch_idx: int, dataloader_idx: int = 0) → LightningIROutput[source]

Handles the validation step for the model.

Parameters:

batch (TrainBatch | RankBatch | SearchBatch) – Batch of validation or testing data.
batch_idx (int) – Index of the batch.
dataloader_idx (int, optional) – Index of the dataloader. Defaults to 0.

Returns:

Model output.

Return type:

LightningIROutput