RunDataset
- class lightning_ir.data.dataset.RunDataset(run_path_or_id: Path | str, depth: int = -1, sample_size: int = -1, sampling_strategy: Literal['single_relevant', 'top', 'random', 'log_random', 'top_and_random'] = 'top', targets: Literal['relevance', 'subtopic_relevance', 'rank', 'score'] | None = None, normalize_targets: bool = False, add_docs_not_in_ranking: bool = False)[source]
Bases:
IRDataset
,Dataset
- __init__(run_path_or_id: Path | str, depth: int = -1, sample_size: int = -1, sampling_strategy: Literal['single_relevant', 'top', 'random', 'log_random', 'top_and_random'] = 'top', targets: Literal['relevance', 'subtopic_relevance', 'rank', 'score'] | None = None, normalize_targets: bool = False, add_docs_not_in_ranking: bool = False) None [source]
Dataset containing a list of queries with a ranked list of documents per query. Subsets of the ranked list can be sampled using different sampling strategies.
- Parameters:
run_path_or_id (Path | str) – Path to a run file or valid ir_datasets id
depth (int, optional) – Depth at which to cut off the ranking. If -1 the full ranking is kept, defaults to -1
sample_size (int, optional) – The number of documents to sample per query, defaults to -1
sampling_strategy (Literal['single_relevant', 'top', 'random', 'log_random', 'top_and_random'], optional) – The sample strategy to use to sample documents, defaults to “top”
targets (Literal['relevance', 'subtopic_relevance', 'rank', 'score'] | None, optional) – The data type to use as targets for a model during fine-tuning. If relevance the relevance judgements are parsed from qrels, defaults to None
normalize_targets (bool, optional) – Whether to normalize the targets between 0 and 1, defaults to False
add_docs_not_in_ranking (bool, optional) – Whether to add relevant to a sample that are in the qrels but not in the ranking, defaults to False
Methods
__init__
(run_path_or_id[, depth, ...])Dataset containing a list of queries with a ranked list of documents per query.
Downloads docs, queries, scoreddocs, and qrels using ir_datasets if needed and available.
Attributes
The qrels in the dataset.
- property DASHED_DATASET_MAP: Dict[str, str]
Map of dataset names with dashes to dataset names with slashes.
- Returns:
Dataset map
- Return type:
Dict[str, str]
- property docs: Docstore | Dict[str, GenericDoc]
Documents in the dataset.
- Raises:
ValueError – If no documents are found in the dataset
- Returns:
Documents
- Return type:
ir_datasets.indices.Docstore | Dict[str, GenericDoc]
- property docs_dataset_id: str
ID of the dataset containing the documents.
- Returns:
Document dataset id
- Return type:
str
- property ir_dataset: Dataset | None
Instance of ir_datasets.Dataset.
- Returns:
ir_datasets dataset
- Return type:
ir_datasets.Dataset | None
- prepare_constituent(constituent: Literal['qrels', 'queries', 'docs', 'scoreddocs', 'docpairs']) None
Downloads the constituent of the dataset using ir_datasets if needed.
- Parameters:
constituent (Literal["qrels", "queries", "docs", "scoreddocs", "docpairs"]) – Constituent to download
- prepare_data() None [source]
Downloads docs, queries, scoreddocs, and qrels using ir_datasets if needed and available.