BiEncoderTokenizer

class lightning_ir.bi_encoder.bi_encoder_tokenizer.BiEncoderTokenizer(*args, query_length: int | None = 32, doc_length: int | None = 512, add_marker_tokens: bool = False, **kwargs)[source]

Bases: LightningIRTokenizer

__init__(*args, query_length: int | None = 32, doc_length: int | None = 512, add_marker_tokens: bool = False, **kwargs)[source]

LightningIRTokenizer for bi-encoder models. Encodes queries and documents separately. Optionally adds marker tokens are added to encoded input sequences.

Parameters:
  • query_length (int | None) – Maximum number of tokens per query. If None does not truncate. Defaults to 32.

  • doc_length (int | None) – Maximum number of tokens per document. If None does not truncate. Defaults to 512.

  • add_marker_tokens (bool) – Whether to add marker tokens to the query and document input sequences. Defaults to False.

Raises:

ValueError – If add_marker_tokens is True and a non-supported tokenizer is used.

Methods

__init__(*args[, query_length, doc_length, ...])

LightningIRTokenizer for bi-encoder models.

tokenize([queries, docs])

Tokenizes queries and documents.

tokenize_doc(docs, *args, **kwargs)

Tokenizes input documents.

tokenize_input_sequence(text, input_type, ...)

Tokenizes an input sequence.

tokenize_query(queries, *args, **kwargs)

Tokenizes input queries.

Attributes

DOC_TOKEN

Token to mark a document sequence.

QUERY_TOKEN

Token to mark a query sequence.

doc_token_id

The token id of the document token if marker tokens are added.

query_token_id

The token id of the query token if marker tokens are added.

DOC_TOKEN: str = '[DOC]'

Token to mark a document sequence.

QUERY_TOKEN: str = '[QUE]'

Token to mark a query sequence.

config_class

Configuration class for the tokenizer.

alias of BiEncoderConfig

property doc_token_id: int | None

The token id of the document token if marker tokens are added.

Returns:

Token id of the document token if added, otherwise None.

property query_token_id: int | None

The token id of the query token if marker tokens are added.

Returns:

Token id of the query token if added, otherwise None.

tokenize(queries: str | Sequence[str] | None = None, docs: str | Sequence[str] | None = None, **kwargs) Dict[str, BatchEncoding][source]

Tokenizes queries and documents.

Parameters:
  • queries (str | Sequence[str] | None) – Queries to tokenize. Defaults to None.

  • docs (str | Sequence[str] | None) – Documents to tokenize. Defaults to None.

Returns:

Dictionary containing tokenized queries and documents.

Return type:

Dict[str, BatchEncoding]

tokenize_doc(docs: Sequence[str] | str, *args, **kwargs) BatchEncoding[source]

Tokenizes input documents.

Parameters:

docs (Sequence[str] | str) – Document or documents to tokenize.

Returns:

Tokenized documents.

Return type:

BatchEncoding

tokenize_input_sequence(text: Sequence[str] | str, input_type: 'query' | 'doc', *args, **kwargs) BatchEncoding[source]

Tokenizes an input sequence. This method is used to tokenize both queries and documents.

Parameters:
  • text (Sequence[str] | str) – Input text to tokenize.

  • input_type (Literal["query", "doc"]) – Type of input, either “query” or “doc”.

Returns:

Tokenized input sequences.

Return type:

BatchEncoding

tokenize_query(queries: Sequence[str] | str, *args, **kwargs) BatchEncoding[source]

Tokenizes input queries.

Parameters:

queries (Sequence[str] | str) – Query or queries to tokenize.

Returns:

Tokenized queries.

Return type:

BatchEncoding