CrossEncoderTokenizer

class lightning_ir.cross_encoder.cross_encoder_tokenizer.CrossEncoderTokenizer(*args, query_length: int | None = 32, doc_length: int | None = 512, scoring_strategy: str | None = None, **kwargs)[source]

Bases: LightningIRTokenizer

__init__(*args, query_length: int | None = 32, doc_length: int | None = 512, scoring_strategy: str | None = None, **kwargs)[source]

LightningIRTokenizer for cross-encoder models. Encodes queries and documents jointly and ensures that the input sequences are of the correct length.

Parameters:
  • query_length (int | None) – Maximum number of tokens per query. If None does not truncate. Defaults to 32.

  • doc_length (int | None) – Maximum number of tokens per document. If None does not truncate. Defaults to 512.

Methods

__init__(*args[, query_length, doc_length, ...])

LightningIRTokenizer for cross-encoder models.

tokenize([queries, docs, num_docs])

Tokenizes queries and documents into a single sequence of tokens.

Attributes

config_class

Configuration class for the tokenizer.

alias of CrossEncoderConfig

tokenize(queries: str | Sequence[str] | None = None, docs: str | Sequence[str] | None = None, num_docs: Sequence[int] | int | None = None, **kwargs) Dict[str, BatchEncoding][source]

Tokenizes queries and documents into a single sequence of tokens.

Parameters:
  • queries (str | Sequence[str] | None) – Queries to tokenize. Defaults to None.

  • docs (str | Sequence[str] | None) – Documents to tokenize. Defaults to None.

  • num_docs (Sequence[int] | int | None) – Specifies how many documents are passed per query. If a sequence of integers, len(num_docs) should be equal to the number of queries and sum(num_docs) equal to the number of documents, i.e., the sequence contains one value per query specifying the number of documents for that query. If an integer, assumes an equal number of documents per query. If None, tries to infer the number of documents by dividing the number of documents by the number of queries. Defaults to None.

Returns:

Tokenized query-document sequence.

Return type:

Dict[str, BatchEncoding]

Raises:
  • ValueError – If either queries or docs are None.

  • ValueError – If queries and docs are not both lists or both strings.