RunDataset

class lightning_ir.data.dataset.RunDataset(run_path_or_id: Path | str, depth: int = -1, sample_size: int = -1, sampling_strategy: Literal['single_relevant', 'top', 'random', 'log_random', 'top_and_random'] = 'top', targets: Literal['relevance', 'subtopic_relevance', 'rank', 'score'] | None = None, normalize_targets: bool = False, add_docs_not_in_ranking: bool = False)[source]

Bases: IRDataset, Dataset

__init__(run_path_or_id: Path | str, depth: int = -1, sample_size: int = -1, sampling_strategy: Literal['single_relevant', 'top', 'random', 'log_random', 'top_and_random'] = 'top', targets: Literal['relevance', 'subtopic_relevance', 'rank', 'score'] | None = None, normalize_targets: bool = False, add_docs_not_in_ranking: bool = False) → None[source]

Dataset containing a list of queries with a ranked list of documents per query. Subsets of the ranked list can be sampled using different sampling strategies.

Parameters:

run_path_or_id (Path | str) – Path to a run file or valid ir_datasets id.
depth (int) – Depth at which to cut off the ranking. If -1 the full ranking is kept. Defaults to -1.
sample_size (int) – The number of documents to sample per query. Defaults to -1.
sampling_strategy (Literal["single_relevant", "top", "random", "log_random", "top_and_random"]) – The sample strategy to use to sample documents. Defaults to “top”.
targets (Literal["relevance", "subtopic_relevance", "rank", "score"] | None) – The data type to use as targets for a model during fine-tuning. If “relevance” the relevance judgements are parsed from qrels. Defaults to None.
normalize_targets (bool) – Whether to normalize the targets between 0 and 1. Defaults to False.
add_docs_not_in_ranking (bool) – Whether to add relevant documents to a sample that are in the qrels but not in the ranking. Defaults to False.

Methods

`__init__`(run_path_or_id[, depth, ...])	Dataset containing a list of queries with a ranked list of documents per query.
`prepare_data`()	Downloads docs, queries, scoreddocs, and qrels using ir_datasets if needed and available.

Attributes

qrels

The qrels in the dataset.

property DASHED_DATASET_MAP: Dict[str, str]

Map of dataset names with dashes to dataset names with slashes.

Returns:: Dataset map.
Return type:: Dict[str, str]

property dataset: str

Dataset name.

Returns:: Dataset name.
Return type:: str

property dataset_id: str

Dataset id.

Returns:: Dataset id.
Return type:: str

property docs: Docstore | Dict[str, GenericDoc]

Documents in the dataset.

Returns:: Documents.
Return type:: ir_datasets.indices.Docstore | Dict[str, GenericDoc]
Raises:: ValueError – If no documents are found in the dataset.

property docs_dataset_id: str

ID of the dataset containing the documents.

Returns:: Document dataset id.
Return type:: str

property ir_dataset: Dataset | None

Instance of ir_datasets.Dataset.

Returns:: Instance of ir_datasets.Dataset or None if the dataset is not found.
Return type:: ir_datasets.Dataset | None

prepare_constituent(constituent: Literal['qrels', 'queries', 'docs', 'scoreddocs', 'docpairs']) → None

Downloads the constituent of the dataset using ir_datasets if needed.

Parameters:: constituent (Literal["qrels", "queries", "docs", "scoreddocs", "docpairs"]) – Constituent to download.

prepare_data() → None[source]: Downloads docs, queries, scoreddocs, and qrels using ir_datasets if needed and available.

property qrels: DataFrame | None

The qrels in the dataset. If the dataset does not contain qrels, the qrels are None.

Returns:: Qrels.
Return type:: pd.DataFrame | None

property queries: Series

Queries in the dataset.

Returns:: Queries.
Return type:: pd.Series
Raises:: ValueError – If no queries are found in the dataset.