RunDataset

class lightning_ir.data.dataset.RunDataset(run_path_or_id: Path | str, depth: int = -1, sample_size: int = -1, sampling_strategy: 'single_relevant' | 'top' | 'random' | 'log_random' | 'top_and_random' = 'top', targets: 'relevance' | 'subtopic_relevance' | 'rank' | 'score' | None = None, normalize_targets: bool = False, add_docs_not_in_ranking: bool = False)[source]

Bases: IRDataset, Dataset

__init__(run_path_or_id: Path | str, depth: int = -1, sample_size: int = -1, sampling_strategy: 'single_relevant' | 'top' | 'random' | 'log_random' | 'top_and_random' = 'top', targets: 'relevance' | 'subtopic_relevance' | 'rank' | 'score' | None = None, normalize_targets: bool = False, add_docs_not_in_ranking: bool = False) None[source]

Dataset containing a list of queries with a ranked list of documents per query. Subsets of the ranked list can be sampled using different sampling strategies.

Parameters:
  • run_path_or_id (Path | str) – Path to a run file or valid ir_datasets id.

  • depth (int) – Depth at which to cut off the ranking. If -1 the full ranking is kept. Defaults to -1.

  • sample_size (int) – The number of documents to sample per query. Defaults to -1.

  • sampling_strategy (Literal["single_relevant", "top", "random", "log_random", "top_and_random"]) – The sample strategy to use to sample documents. Defaults to “top”.

  • targets (Literal["relevance", "subtopic_relevance", "rank", "score"] | None) – The data type to use as targets for a model during fine-tuning. If “relevance” the relevance judgements are parsed from qrels. Defaults to None.

  • normalize_targets (bool) – Whether to normalize the targets between 0 and 1. Defaults to False.

  • add_docs_not_in_ranking (bool) – Whether to add relevant documents to a sample that are in the qrels but not in the ranking. Defaults to False.

Methods

__init__(run_path_or_id[, depth, ...])

Dataset containing a list of queries with a ranked list of documents per query.

prepare_data()

Downloads docs, queries, scoreddocs, and qrels using ir_datasets if needed and available.

Attributes

qrels

The qrels in the dataset.

prepare_data() None[source]

Downloads docs, queries, scoreddocs, and qrels using ir_datasets if needed and available.

property qrels: DataFrame | None

The qrels in the dataset. If the dataset does not contain qrels, the qrels are None.

Returns:

Qrels.

Return type:

pd.DataFrame | None