DocDataset

class lightning_ir.data.dataset.DocDataset(doc_dataset: str, num_docs: int | None = None, text_fields: Sequence[str] | None = None)[source]

Bases: IRDataset, _DataParallelIterableDataset

__init__(doc_dataset: str, num_docs: int | None = None, text_fields: Sequence[str] | None = None) None[source]

Dataset containing documents.

Parameters:
  • doc_dataset (str) – Path to file containing documents or valid ir_datasets id.

  • num_docs (int | None, optional) – Number of documents in dataset. If None, the number of documents will attempted to be inferred. Defaults to None.

  • text_fields (Sequence[str] | None, optional) – Fields to parse the document text from. Defaults to None.

Methods

__init__(doc_dataset[, num_docs, text_fields])

Dataset containing documents.

prepare_data()

Downloads documents using ir_datasets if needed.

Attributes

prepare_data() None[source]

Downloads documents using ir_datasets if needed.