Which Dataset Format to Use?
Lightning IR provides four dataset classes —
TupleDataset,
RunDataset,
DocDataset, and
QueryDataset —
which one to use depends on your workflow and the shape of your data.
What are you trying to do?
│
├── Fine-tune a model (fit)
│ │
│ ├── Have query + positive + negative triples?
│ │ └── TupleDataset
│ │ (uses an ir_datasets ID, e.g. "msmarco-passage/train/triples-small")
│ │
│ └── Have a run file with ranked docs and teacher scores?
│ └── RunDataset (targets: score, sampling_strategy: random)
│
├── Index documents (index)
│ └── DocDataset
│ (uses an ir_datasets ID, e.g. "msmarco-passage")
│
├── Search / retrieve (search)
│ └── QueryDataset
│ (uses an ir_datasets ID, e.g. "msmarco-passage/trec-dl-2019/judged")
│
└── Re-rank (re_rank)
└── RunDataset
(path to a TREC-format run file or an ir_datasets ID)
Dataset Class Reference
Dataset |
Workflow |
Description |
|---|---|---|
|
Iterates over (query, doc₁, doc₂, …) tuples with relevance targets. Backed by ir_datasets. |
|
|
Loads a ranked list of documents per query from a TREC-format run file
or an ir_datasets ID. Key parameters: |
|
|
Iterates over all documents in a collection. Backed by ir_datasets. |
|
|
Iterates over queries in a dataset split. Backed by ir_datasets. |
Tip
When using RunDataset for training (knowledge distillation), set
sampling_strategy: random so the model sees diverse negatives, and
targets: score to use the teacher’s relevance scores.
When using RunDataset for re-ranking (inference), set
sampling_strategy: top and increase depth / sample_size
to cover the full candidate list.
Next steps:
Which Loss Function to Use? — Choose a loss function for training
End-to-End Recipes — See complete end-to-end pipelines