LightningIRTokenizer
- class lightning_ir.base.tokenizer.LightningIRTokenizer(*args, query_length: int | None = 32, doc_length: int | None = 512, **kwargs)[source]
Bases:
PreTrainedTokenizerBaseBase class for Lightning IR tokenizers. Derived classes implement the tokenize method for handling query and document tokenization. It acts as mixin for a transformers.PreTrainedTokenizer backbone tokenizer.
- __init__(*args, query_length: int | None = 32, doc_length: int | None = 512, **kwargs)[source]
Initializes the tokenizer.
- Parameters:
query_length (int | None) – Maximum number of tokens per query. If None does not truncate. Defaults to 32.
doc_length (int | None) – Maximum number of tokens per document. If None does not truncate. Defaults to 512.
Methods
__init__(*args[, query_length, doc_length])Initializes the tokenizer.
from_pretrained(model_name_or_path, *args, ...)Loads a pretrained tokenizer.
tokenize([queries, docs])Tokenizes queries and documents.
Attributes
- config_class
Configuration class for the tokenizer.
alias of
LightningIRConfig
- classmethod from_pretrained(model_name_or_path: str, *args, **kwargs) Self[source]
Loads a pretrained tokenizer. Wraps the transformers.PreTrainedTokenizer.from_pretrained method to return a derived LightningIRTokenizer class. See
LightningIRTokenizerClassFactoryfor more details.>>> Loading using model class and backbone checkpoint >>> type(BiEncoderTokenizer.from_pretrained("bert-base-uncased")) ... <class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'> >>> Loading using base class and backbone checkpoint >>> type(LightningIRTokenizer.from_pretrained("bert-base-uncased", config=BiEncoderConfig())) ... <class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'>- Parameters:
model_name_or_path (str) – Name or path of the pretrained tokenizer.
- Returns:
A derived LightningIRTokenizer consisting of a backbone tokenizer and a LightningIRTokenizer mixin.
- Return type:
Self
- Raises:
ValueError – If called on the abstract class LightningIRTokenizer and no config is passed.
- tokenize(queries: str | Sequence[str] | None = None, docs: str | Sequence[str] | None = None, **kwargs) Dict[str, BatchEncoding][source]
Tokenizes queries and documents.
- Parameters:
queries (str | Sequence[str] | None) – Queries to tokenize. Defaults to None.
docs (str | Sequence[str] | None) – Documents to tokenize. Defaults to None.
- Returns:
Dictionary containing tokenized queries and documents.
- Return type:
Dict[str, BatchEncoding]
- Raises:
NotImplementedError – Must be implemented by the derived class.