ColTokenizer

class lightning_ir.models.bi_encoders.col.ColTokenizer(*args, query_length: int | None = 32, doc_length: int | None = 512, add_marker_tokens: bool = False, query_expansion: bool = False, attend_to_query_expanded_tokens: bool = False, doc_expansion: bool = False, attend_to_doc_expanded_tokens: bool = False, **kwargs)[source]

Bases: BiEncoderTokenizer

LightningIRTokenizer for Col models.

__init__(*args, query_length: int | None = 32, doc_length: int | None = 512, add_marker_tokens: bool = False, query_expansion: bool = False, attend_to_query_expanded_tokens: bool = False, doc_expansion: bool = False, attend_to_doc_expanded_tokens: bool = False, **kwargs)[source]

Initializes a Col model’s tokenizer. Encodes queries and documents separately. Optionally adds marker tokens to encoded input sequences and expands queries and documents with mask tokens.

Parameters:
  • query_length (int | None) – Maximum number of tokens per query. If None does not truncate. Defaults to 32.

  • doc_length (int | None) – Maximum number of tokens per document. If None does not truncate. Defaults to 512.

  • add_marker_tokens (bool) – Whether to add extra marker tokens [Q] / [D] to queries / documents. Defaults to False.

  • query_expansion (bool) – Whether to expand queries with mask tokens. Defaults to False.

  • attend_to_query_expanded_tokens (bool) – Whether to allow query tokens to attend to mask expanded query tokens. Defaults to False.

  • doc_expansion (bool) – Whether to expand documents with mask tokens. Defaults to False.

  • attend_to_doc_expanded_tokens (bool) – Whether to allow document tokens to attend to mask expanded document tokens. Defaults to False.

Raises:

ValueError – If add_marker_tokens is True and a non-supported tokenizer is used.

Methods

__init__(*args[, query_length, doc_length, ...])

Initializes a Col model's tokenizer.

tokenize_input_sequence(text, input_type, ...)

Tokenizes an input sequence.

Attributes

config_class

Configuration class for the tokenizer.

alias of ColConfig

tokenize_input_sequence(text: Sequence[str] | str, input_type: 'query' | 'doc', *args, **kwargs) BatchEncoding[source]

Tokenizes an input sequence. This method is used to tokenize both queries and documents.

Parameters:
  • text (Sequence[str] | str) – Input text to tokenize.

  • input_type (Literal["query", "doc"]) – Type of input, either “query” or “doc”.

Returns:

Tokenized input sequences.

Return type:

BatchEncoding