CrossEncoderTokenizer
- class lightning_ir.cross_encoder.cross_encoder_tokenizer.CrossEncoderTokenizer(*args, query_length: int = 32, doc_length: int = 512, **kwargs)[source]
Bases:
LightningIRTokenizer
- __init__(*args, query_length: int = 32, doc_length: int = 512, **kwargs)[source]
LightningIRTokenizer
for cross-encoder models. Encodes queries and documents jointly and ensures that the input sequences are of the correct length.- Parameters:
query_length (int, optional) – Maximum number of tokens per query, defaults to 32
doc_length (int, optional) – Maximum number of tokens per document, defaults to 512
Methods
__init__
(*args[, query_length, doc_length])LightningIRTokenizer
for cross-encoder models.tokenize
([queries, docs, num_docs])Tokenizes queries and documents into a single sequence of tokens.
Attributes
- config_class
Configuration class for the tokenizer.
alias of
CrossEncoderConfig
- classmethod from_pretrained(model_name_or_path: str, *args, **kwargs) Self
Loads a pretrained tokenizer. Wraps the transformers.PreTrainedTokenizer.from_pretrained method to return a derived LightningIRTokenizer class. See
LightningIRTokenizerClassFactory
for more details.>>> Loading using model class and backbone checkpoint >>> type(BiEncoderTokenizer.from_pretrained("bert-base-uncased")) ... <class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'> >>> Loading using base class and backbone checkpoint >>> type(LightningIRTokenizer.from_pretrained("bert-base-uncased", config=BiEncoderConfig())) ... <class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'>
- Parameters:
model_name_or_path (str) – Name or path of the pretrained tokenizer
- Raises:
ValueError – If called on the abstract class
LightningIRTokenizer
and no config is passed- Returns:
A derived LightningIRTokenizer consisting of a backbone tokenizer and a LightningIRTokenizer mixin
- Return type:
- tokenize(queries: str | Sequence[str] | None = None, docs: str | Sequence[str] | None = None, num_docs: Sequence[int] | int | None = None, **kwargs) Dict[str, BatchEncoding] [source]
Tokenizes queries and documents into a single sequence of tokens.
- Parameters:
queries (str | Sequence[str] | None, optional) – Queries to tokenize, defaults to None
docs (str | Sequence[str] | None, optional) – Documents to tokenize, defaults to None
num_docs (Sequence[int] | int | None, optional) – Specifies how many documents are passed per query. If a sequence of integers, len(num_doc) should be equal to the number of queries and sum(num_docs) equal to the number of documents, i.e., the sequence contains one value per query specifying the number of documents for that query. If an integer, assumes an equal number of documents per query. If None, tries to infer the number of documents by dividing the number of documents by the number of queries, defaults to None
- Returns:
Tokenized query-document sequence
- Return type:
Dict[str, BatchEncoding]