ColTokenizer
- class lightning_ir.models.col.ColTokenizer(*args, query_length: int = 32, doc_length: int = 512, add_marker_tokens: bool = False, query_expansion: bool = False, attend_to_query_expanded_tokens: bool = False, doc_expansion: bool = False, attend_to_doc_expanded_tokens: bool = False, **kwargs)[source]
Bases:
BiEncoderTokenizer
LightningIRTokenizer
for Col models.- __init__(*args, query_length: int = 32, doc_length: int = 512, add_marker_tokens: bool = False, query_expansion: bool = False, attend_to_query_expanded_tokens: bool = False, doc_expansion: bool = False, attend_to_doc_expanded_tokens: bool = False, **kwargs)[source]
Initializes a Col model’s tokenizer. Encodes queries and documents separately. Optionally adds marker tokens to encoded input sequences and expands queries and documents with mask tokens.
- Parameters:
query_length (int, optional) – Maximum query length in number of tokens, defaults to 32
doc_length (int, optional) – Maximum document length in number of tokens, defaults to 512
add_marker_tokens (bool, optional) – Whether to add marker tokens to the query and document input sequences, defaults to False
query_expansion (bool, optional) – Whether to expand queries with mask tokens, defaults to False
attend_to_query_expanded_tokens (bool, optional) – Whether to let non-expanded query tokens be able to attend to mask expanded query tokens, defaults to False
doc_expansion (bool, optional) – Whether to expand documents with mask tokens, defaults to False
attend_to_doc_expanded_tokens (bool, optional) – Whether to let non-expanded document tokens be able to attend to mask expanded document tokens, defaults to False
- Raises:
ValueError – If add_marker_tokens is True and a non-supported tokenizer is used
Methods
__init__
(*args[, query_length, doc_length, ...])Initializes a Col model's tokenizer.
tokenize_input_sequence
(text, input_type, ...)Tokenizes an input sequence.
Attributes
- config_class
Configuration class for the tokenizer.
alias of
ColConfig
- property doc_token_id: int | None
The token id of the document token if marker tokens are added.
- Returns:
Token id of the document token
- Return type:
int | None
- classmethod from_pretrained(model_name_or_path: str, *args, **kwargs) Self
Loads a pretrained tokenizer. Wraps the transformers.PreTrainedTokenizer.from_pretrained method to return a derived LightningIRTokenizer class. See
LightningIRTokenizerClassFactory
for more details.>>> Loading using model class and backbone checkpoint >>> type(BiEncoderTokenizer.from_pretrained("bert-base-uncased")) ... <class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'> >>> Loading using base class and backbone checkpoint >>> type(LightningIRTokenizer.from_pretrained("bert-base-uncased", config=BiEncoderConfig())) ... <class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'>
- Parameters:
model_name_or_path (str) – Name or path of the pretrained tokenizer
- Raises:
ValueError – If called on the abstract class
LightningIRTokenizer
and no config is passed- Returns:
A derived LightningIRTokenizer consisting of a backbone tokenizer and a LightningIRTokenizer mixin
- Return type:
- property query_token_id: int | None
The token id of the query token if marker tokens are added.
- Returns:
Token id of the query token
- Return type:
int | None
- tokenize(queries: str | Sequence[str] | None = None, docs: str | Sequence[str] | None = None, **kwargs) Dict[str, BatchEncoding]
Tokenizes queries and documents.
- Parameters:
queries (str | Sequence[str] | None, optional) – Queries to tokenize, defaults to None
docs (str | Sequence[str] | None, optional) – Documents to tokenize, defaults to None
- Returns:
Dictionary of tokenized queries and documents
- Return type:
Dict[str, BatchEncoding]
- tokenize_doc(docs: Sequence[str] | str, *args, **kwargs) BatchEncoding
Tokenizes input documents.
- Parameters:
docs (Sequence[str] | str) – Document or documents to tokenize
- Returns:
Tokenized documents
- Return type:
BatchEncoding
- tokenize_input_sequence(text: Sequence[str] | str, input_type: Literal['query', 'doc'], *args, **kwargs) BatchEncoding [source]
Tokenizes an input sequence. This method is used to tokenize both queries and documents.
- Parameters:
queries (Sequence[str] | str) – Single string or multiple strings to tokenize
- Returns:
Tokenized input sequences
- Return type:
BatchEncoding