BiEncoderTokenizer

class lightning_ir.bi_encoder.bi_encoder_tokenizer.BiEncoderTokenizer(*args, query_length: int = 32, doc_length: int = 512, add_marker_tokens: bool = False, **kwargs)[source]

Bases: LightningIRTokenizer

__init__(*args, query_length: int = 32, doc_length: int = 512, add_marker_tokens: bool = False, **kwargs)[source]

LightningIRTokenizer for bi-encoder models. Encodes queries and documents separately. Optionally adds marker tokens are added to encoded input sequences.

Parameters:
  • query_length (int, optional) – Maximum query length in number of tokens, defaults to 32

  • doc_length (int, optional) – Maximum document length in number of tokens, defaults to 512

  • add_marker_tokens (bool, optional) – Whether to add marker tokens to the query and document input sequences, defaults to False

Raises:

ValueError – If add_marker_tokens is True and a non-supported tokenizer is used

Methods

__init__(*args[, query_length, doc_length, ...])

LightningIRTokenizer for bi-encoder models.

tokenize([queries, docs])

Tokenizes queries and documents.

tokenize_doc(docs, *args, **kwargs)

Tokenizes input documents.

tokenize_input_sequence(text, input_type, ...)

Tokenizes an input sequence.

tokenize_query(queries, *args, **kwargs)

Tokenizes input queries.

Attributes

DOC_TOKEN

Token to mark a document sequence.

QUERY_TOKEN

Token to mark a query sequence.

doc_token_id

The token id of the document token if marker tokens are added.

query_token_id

The token id of the query token if marker tokens are added.

DOC_TOKEN: str = '[DOC]'

Token to mark a document sequence.

QUERY_TOKEN: str = '[QUE]'

Token to mark a query sequence.

config_class

Configuration class for the tokenizer.

alias of BiEncoderConfig

property doc_token_id: int | None

The token id of the document token if marker tokens are added.

Returns:

Token id of the document token

Return type:

int | None

classmethod from_pretrained(model_name_or_path: str, *args, **kwargs) Self

Loads a pretrained tokenizer. Wraps the transformers.PreTrainedTokenizer.from_pretrained method to return a derived LightningIRTokenizer class. See LightningIRTokenizerClassFactory for more details.

>>> Loading using model class and backbone checkpoint
>>> type(BiEncoderTokenizer.from_pretrained("bert-base-uncased"))
...
<class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'>
>>> Loading using base class and backbone checkpoint
>>> type(LightningIRTokenizer.from_pretrained("bert-base-uncased", config=BiEncoderConfig()))
...
<class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'>
Parameters:

model_name_or_path (str) – Name or path of the pretrained tokenizer

Raises:

ValueError – If called on the abstract class LightningIRTokenizer and no config is passed

Returns:

A derived LightningIRTokenizer consisting of a backbone tokenizer and a LightningIRTokenizer mixin

Return type:

LightningIRTokenizer

property query_token_id: int | None

The token id of the query token if marker tokens are added.

Returns:

Token id of the query token

Return type:

int | None

tokenize(queries: str | Sequence[str] | None = None, docs: str | Sequence[str] | None = None, **kwargs) Dict[str, BatchEncoding][source]

Tokenizes queries and documents.

Parameters:
  • queries (str | Sequence[str] | None, optional) – Queries to tokenize, defaults to None

  • docs (str | Sequence[str] | None, optional) – Documents to tokenize, defaults to None

Returns:

Dictionary of tokenized queries and documents

Return type:

Dict[str, BatchEncoding]

tokenize_doc(docs: Sequence[str] | str, *args, **kwargs) BatchEncoding[source]

Tokenizes input documents.

Parameters:

docs (Sequence[str] | str) – Document or documents to tokenize

Returns:

Tokenized documents

Return type:

BatchEncoding

tokenize_input_sequence(text: Sequence[str] | str, input_type: Literal['query', 'doc'], *args, **kwargs) BatchEncoding[source]

Tokenizes an input sequence. This method is used to tokenize both queries and documents.

Parameters:

queries (Sequence[str] | str) – Single string or multiple strings to tokenize

Returns:

Tokenized input sequences

Return type:

BatchEncoding

tokenize_query(queries: Sequence[str] | str, *args, **kwargs) BatchEncoding[source]

Tokenizes input queries.

Parameters:

queries (Sequence[str] | str) – Query or queries to tokenize

Returns:

Tokenized queries

Return type:

BatchEncoding