Which Dataset Format to Use?

Lightning IR provides four dataset classes — TupleDataset, RunDataset, DocDataset, and QueryDataset — which one to use depends on your workflow and the shape of your data.

What are you trying to do?
│
├── Fine-tune a model (fit)
│   │
│   ├── Have query + positive + negative triples?
│   │   └── TupleDataset
│   │       (uses an ir_datasets ID, e.g. "msmarco-passage/train/triples-small")
│   │
│   └── Have a run file with ranked docs and teacher scores?
│       └── RunDataset (targets: score, sampling_strategy: random)
│
├── Index documents (index)
│   └── DocDataset
│       (uses an ir_datasets ID, e.g. "msmarco-passage")
│
├── Search / retrieve (search)
│   └── QueryDataset
│       (uses an ir_datasets ID, e.g. "msmarco-passage/trec-dl-2019/judged")
│
└── Re-rank (re_rank)
    └── RunDataset
        (path to a TREC-format run file or an ir_datasets ID)

Dataset Class Reference

Dataset	Workflow	Description
`TupleDataset`	`fit`	Iterates over (query, doc₁, doc₂, …) tuples with relevance targets. Backed by ir_datasets.
`RunDataset`	`fit`, `re_rank`	Loads a ranked list of documents per query from a TREC-format run file or an ir_datasets ID. Key parameters: `depth` (max rank to load), `sample_size` (docs per query), `sampling_strategy` (`top` or `random`), `targets` (`relevance` or `score`).
`DocDataset`	`index`	Iterates over all documents in a collection. Backed by ir_datasets.
`QueryDataset`	`search`	Iterates over queries in a dataset split. Backed by ir_datasets.

Tip

When using RunDataset for training (knowledge distillation), set sampling_strategy: random so the model sees diverse negatives, and targets: score to use the teacher’s relevance scores.

When using RunDataset for re-ranking (inference), set sampling_strategy: top and increase depth / sample_size to cover the full candidate list.

Next steps:

Which Loss Function to Use? — Choose a loss function for training
End-to-End Recipes — See complete end-to-end pipelines