SpladeTokenizer

class lightning_ir.models.bi_encoders.splade.SpladeTokenizer(*args, query_length: int | None = 32, doc_length: int | None = 512, add_marker_tokens: bool = False, query_weighting: 'contextualized' | 'static' = 'contextualized', doc_weighting: 'contextualized' | 'static' = 'contextualized', **kwargs)[source]

Bases: BiEncoderTokenizer

Tokenizer class for SPLADE models.

__init__(*args, query_length: int | None = 32, doc_length: int | None = 512, add_marker_tokens: bool = False, query_weighting: 'contextualized' | 'static' = 'contextualized', doc_weighting: 'contextualized' | 'static' = 'contextualized', **kwargs)[source]

LightningIRTokenizer for bi-encoder models. Encodes queries and documents separately. Optionally adds marker tokens are added to encoded input sequences.

Parameters:

query_length (int | None) – Maximum number of tokens per query. If None does not truncate. Defaults to 32.
doc_length (int | None) – Maximum number of tokens per document. If None does not truncate. Defaults to 512.
add_marker_tokens (bool) – Whether to add marker tokens to the query and document input sequences. Defaults to False.

Raises:

ValueError – If add_marker_tokens is True and a non-supported tokenizer is used.

Methods

`__init__`(*args[, query_length, doc_length, ...])	`LightningIRTokenizer` for bi-encoder models.
`tokenize_input_sequence`(text, input_type, ...)	Tokenizes an input sequence.

Attributes

config_class

Configuration class for a SPLADE model.

alias of SpladeConfig

tokenize_input_sequence(text: Sequence[str] | str, input_type: 'query' | 'doc', *args, **kwargs) → BatchEncoding[source]

Tokenizes an input sequence. This method is used to tokenize both queries and documents.

Parameters:

text (Sequence[str] | str) – Input text to tokenize.
input_type (Literal["query", "doc"]) – Type of input, either “query” or “doc”.

Returns:

Tokenized input sequences.

Return type:

BatchEncoding