BiEncoderTokenizer
- class lightning_ir.bi_encoder.bi_encoder_tokenizer.BiEncoderTokenizer(*args, query_length: int | None = 32, doc_length: int | None = 512, add_marker_tokens: bool = False, **kwargs)[source]
Bases:
LightningIRTokenizer- __init__(*args, query_length: int | None = 32, doc_length: int | None = 512, add_marker_tokens: bool = False, **kwargs)[source]
LightningIRTokenizerfor bi-encoder models. Encodes queries and documents separately. Optionally adds marker tokens are added to encoded input sequences.- Parameters:
query_length (int | None) – Maximum number of tokens per query. If None does not truncate. Defaults to 32.
doc_length (int | None) – Maximum number of tokens per document. If None does not truncate. Defaults to 512.
add_marker_tokens (bool) – Whether to add marker tokens to the query and document input sequences. Defaults to False.
- Raises:
ValueError – If add_marker_tokens is True and a non-supported tokenizer is used.
Methods
__init__(*args[, query_length, doc_length, ...])LightningIRTokenizerfor bi-encoder models.tokenize([queries, docs])Tokenizes queries and documents.
tokenize_doc(docs, *args, **kwargs)Tokenizes input documents.
tokenize_input_sequence(text, input_type, ...)Tokenizes an input sequence.
tokenize_query(queries, *args, **kwargs)Tokenizes input queries.
Attributes
Token to mark a document sequence.
Token to mark a query sequence.
The token id of the document token if marker tokens are added.
The token id of the query token if marker tokens are added.
- config_class
Configuration class for the tokenizer.
alias of
BiEncoderConfig
- property doc_token_id: int | None
The token id of the document token if marker tokens are added.
- Returns:
Token id of the document token if added, otherwise None.
- property query_token_id: int | None
The token id of the query token if marker tokens are added.
- Returns:
Token id of the query token if added, otherwise None.
- tokenize(queries: str | Sequence[str] | None = None, docs: str | Sequence[str] | None = None, **kwargs) Dict[str, BatchEncoding][source]
Tokenizes queries and documents.
- Parameters:
queries (str | Sequence[str] | None) – Queries to tokenize. Defaults to None.
docs (str | Sequence[str] | None) – Documents to tokenize. Defaults to None.
- Returns:
Dictionary containing tokenized queries and documents.
- Return type:
Dict[str, BatchEncoding]
- tokenize_doc(docs: Sequence[str] | str, *args, **kwargs) BatchEncoding[source]
Tokenizes input documents.
- Parameters:
docs (Sequence[str] | str) – Document or documents to tokenize.
- Returns:
Tokenized documents.
- Return type:
BatchEncoding
- tokenize_input_sequence(text: Sequence[str] | str, input_type: 'query' | 'doc', *args, **kwargs) BatchEncoding[source]
Tokenizes an input sequence. This method is used to tokenize both queries and documents.
- Parameters:
text (Sequence[str] | str) – Input text to tokenize.
input_type (Literal["query", "doc"]) – Type of input, either “query” or “doc”.
- Returns:
Tokenized input sequences.
- Return type:
BatchEncoding
- tokenize_query(queries: Sequence[str] | str, *args, **kwargs) BatchEncoding[source]
Tokenizes input queries.
- Parameters:
queries (Sequence[str] | str) – Query or queries to tokenize.
- Returns:
Tokenized queries.
- Return type:
BatchEncoding