SetEncoderTokenizer
- class lightning_ir.models.set_encoder.SetEncoderTokenizer(*args, query_length: int = 32, doc_length: int = 512, add_extra_token: bool = False, **kwargs)[source]
Bases:
CrossEncoderTokenizer- __init__(*args, query_length: int = 32, doc_length: int = 512, add_extra_token: bool = False, **kwargs)[source]
Initializes a SetEncoder tokenizer.
- Parameters:
query_length (int) – Maximum query length. Defaults to 32.
doc_length (int) – Maximum document length. Defaults to 512.
add_extra_token (bool) – Whether to add an extra interaction token. Defaults to False.
Methods
__init__(*args[, query_length, doc_length, ...])Initializes a SetEncoder tokenizer.
tokenize([queries, docs, num_docs])Tokenizes queries and documents into a single sequence of tokens.
Attributes
interaction_token_id- config_class
Configuration class for the tokenizer.
alias of
SetEncoderConfig
- classmethod from_pretrained(model_name_or_path: str, *args, **kwargs) Self
Loads a pretrained tokenizer. Wraps the transformers.PreTrainedTokenizer.from_pretrained method to return a derived LightningIRTokenizer class. See
LightningIRTokenizerClassFactoryfor more details.>>> Loading using model class and backbone checkpoint >>> type(BiEncoderTokenizer.from_pretrained("bert-base-uncased")) ... <class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'> >>> Loading using base class and backbone checkpoint >>> type(LightningIRTokenizer.from_pretrained("bert-base-uncased", config=BiEncoderConfig())) ... <class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'>- Parameters:
model_name_or_path (str) – Name or path of the pretrained tokenizer.
- Returns:
A derived LightningIRTokenizer consisting of a backbone tokenizer and a LightningIRTokenizer mixin.
- Return type:
Self
- Raises:
ValueError – If called on the abstract class LightningIRTokenizer and no config is passed.
- tokenize(queries: str | Sequence[str] | None = None, docs: str | Sequence[str] | None = None, num_docs: Sequence[int] | int | None = None, **kwargs) Dict[str, BatchEncoding][source]
Tokenizes queries and documents into a single sequence of tokens.
- Parameters:
queries (str | Sequence[str] | None) – Queries to tokenize. Defaults to None.
docs (str | Sequence[str] | None) – Documents to tokenize. Defaults to None.
num_docs (Sequence[int] | int | None) – Specifies how many documents are passed per query. If a sequence of integers, len(num_docs) should be equal to the number of queries and sum(num_docs) equal to the number of documents, i.e., the sequence contains one value per query specifying the number of documents for that query. If an integer, assumes an equal number of documents per query. If None, tries to infer the number of documents by dividing the number of documents by the number of queries. Defaults to None.
Returns – Dict[str, BatchEncoding]: Tokenized query-document sequence.
Raises – ValueError: If both queries and docs are None. ValueError: If queries and docs are not both lists or both strings.