ColTokenizer

class lightning_ir.models.col.ColTokenizer(*args, query_length: int = 32, doc_length: int = 512, add_marker_tokens: bool = False, query_expansion: bool = False, attend_to_query_expanded_tokens: bool = False, doc_expansion: bool = False, attend_to_doc_expanded_tokens: bool = False, **kwargs)[source]

Bases: BiEncoderTokenizer

LightningIRTokenizer for Col models.

__init__(*args, query_length: int = 32, doc_length: int = 512, add_marker_tokens: bool = False, query_expansion: bool = False, attend_to_query_expanded_tokens: bool = False, doc_expansion: bool = False, attend_to_doc_expanded_tokens: bool = False, **kwargs)[source]

Initializes a Col model’s tokenizer. Encodes queries and documents separately. Optionally adds marker tokens to encoded input sequences and expands queries and documents with mask tokens.

Parameters:

query_length (int) – Maximum query length in number of tokens. Defaults to 32.
doc_length (int) – Maximum document length in number of tokens. Defaults to 512.
add_marker_tokens (bool) – Whether to add extra marker tokens [Q] / [D] to queries / documents. Defaults to False.
query_expansion (bool) – Whether to expand queries with mask tokens. Defaults to False.
attend_to_query_expanded_tokens (bool) – Whether to allow query tokens to attend to mask expanded query tokens. Defaults to False.
doc_expansion (bool) – Whether to expand documents with mask tokens. Defaults to False.
attend_to_doc_expanded_tokens (bool) – Whether to allow document tokens to attend to mask expanded document tokens. Defaults to False.

Raises:

ValueError – If add_marker_tokens is True and a non-supported tokenizer is used.

Methods

`__init__`(*args[, query_length, doc_length, ...])	Initializes a Col model's tokenizer.
`tokenize_input_sequence`(text, input_type, ...)	Tokenizes an input sequence.

Attributes

DOC_TOKEN: str = '[DOC]': Token to mark a document sequence.

QUERY_TOKEN: str = '[QUE]': Token to mark a query sequence.

config_class

Configuration class for the tokenizer.

alias of ColConfig

property doc_token_id: int | None

The token id of the document token if marker tokens are added.

Returns:: Token id of the document token if added, otherwise None.

classmethod from_pretrained(model_name_or_path: str, *args, **kwargs) → Self

Loads a pretrained tokenizer. Wraps the transformers.PreTrainedTokenizer.from_pretrained method to return a derived LightningIRTokenizer class. See LightningIRTokenizerClassFactory for more details.

>>> Loading using model class and backbone checkpoint
>>> type(BiEncoderTokenizer.from_pretrained("bert-base-uncased"))
...
<class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'>
>>> Loading using base class and backbone checkpoint
>>> type(LightningIRTokenizer.from_pretrained("bert-base-uncased", config=BiEncoderConfig()))
...
<class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'>

Parameters:: model_name_or_path (str) – Name or path of the pretrained tokenizer.
Returns:: A derived LightningIRTokenizer consisting of a backbone tokenizer and a LightningIRTokenizer mixin.
Return type:: Self
Raises:: ValueError – If called on the abstract class LightningIRTokenizer and no config is passed.

property query_token_id: int | None

The token id of the query token if marker tokens are added.

Returns:: Token id of the query token if added, otherwise None.

tokenize(queries: str | Sequence[str] | None = None, docs: str | Sequence[str] | None = None, **kwargs) → Dict[str, BatchEncoding]

Tokenizes queries and documents.

Parameters:

queries (str | Sequence[str] | None) – Queries to tokenize. Defaults to None.
docs (str | Sequence[str] | None) – Documents to tokenize. Defaults to None.

Returns:

Dictionary containing tokenized queries and documents.

Return type:

Dict[str, BatchEncoding]

tokenize_doc(docs: Sequence[str] | str, *args, **kwargs) → BatchEncoding

Tokenizes input documents.

Parameters:: docs (Sequence[str] | str) – Document or documents to tokenize.
Returns:: Tokenized documents.
Return type:: BatchEncoding

tokenize_input_sequence(text: Sequence[str] | str, input_type: Literal['query', 'doc'], *args, **kwargs) → BatchEncoding[source]

Tokenizes an input sequence. This method is used to tokenize both queries and documents.

Parameters:

text (Sequence[str] | str) – Input text to tokenize.
input_type (Literal["query", "doc"]) – Type of input, either “query” or “doc”.

Returns:

Tokenized input sequences.

Return type:

BatchEncoding

tokenize_query(queries: Sequence[str] | str, *args, **kwargs) → BatchEncoding

Tokenizes input queries.

Parameters:: queries (Sequence[str] | str) – Query or queries to tokenize.
Returns:: Tokenized queries.
Return type:: BatchEncoding