DocDataset
- class lightning_ir.data.dataset.DocDataset(doc_dataset: str, num_docs: int | None = None, text_fields: Sequence[str] | None = None)[source]
Bases:
IRDataset
,_DataParallelIterableDataset
- __init__(doc_dataset: str, num_docs: int | None = None, text_fields: Sequence[str] | None = None) None [source]
Dataset containing documents.
- Parameters:
doc_dataset (str) – Path to file containing documents or valid ir_datasets id
num_docs (int | None, optional) – Number of documents in dataset. If None, the number of documents will attempted to be inferred, defaults to None
text_fields (Sequence[str] | None, optional) – Fields to parse the document text from, defaults to None
Methods
__init__
(doc_dataset[, num_docs, text_fields])Dataset containing documents.
Downloads documents using ir_datasets if needed.
Attributes
- property DASHED_DATASET_MAP: Dict[str, str]
Map of dataset names with dashes to dataset names with slashes.
- Returns:
Dataset map
- Return type:
Dict[str, str]
- property docs: Docstore | Dict[str, GenericDoc]
Documents in the dataset.
- Raises:
ValueError – If no documents are found in the dataset
- Returns:
Documents
- Return type:
ir_datasets.indices.Docstore | Dict[str, GenericDoc]
- property docs_dataset_id: str
ID of the dataset containing the documents.
- Returns:
Document dataset id
- Return type:
str
- property ir_dataset: Dataset | None
Instance of ir_datasets.Dataset.
- Returns:
ir_datasets dataset
- Return type:
ir_datasets.Dataset | None
- prepare_constituent(constituent: Literal['qrels', 'queries', 'docs', 'scoreddocs', 'docpairs']) None
Downloads the constituent of the dataset using ir_datasets if needed.
- Parameters:
constituent (Literal["qrels", "queries", "docs", "scoreddocs", "docpairs"]) – Constituent to download
- prepare_data() None [source]
Downloads documents using ir_datasets if needed.