ColConfig

class lightning_ir.models.bi_encoders.col.ColConfig(query_length: int | None = 32, doc_length: int | None = 512, similarity_function: 'cosine' | 'dot' = 'dot', normalization_strategy: 'l2' | None = None, add_marker_tokens: bool = False, query_mask_scoring_tokens: Sequence[str] | 'punctuation' | None = None, doc_mask_scoring_tokens: Sequence[str] | 'punctuation' | None = None, query_aggregation_function: 'sum' | 'mean' | 'max' = 'sum', doc_aggregation_function: 'sum' | 'mean' | 'max' = 'max', embedding_dim: int = 128, projection: 'linear' | 'linear_no_bias' = 'linear', query_expansion: bool = False, attend_to_query_expanded_tokens: bool = False, doc_expansion: bool = False, attend_to_doc_expanded_tokens: bool = False, k_train: int | None = None, **kwargs)[source]

Bases: MultiVectorBiEncoderConfig

Configuration class for a Col model.

__init__(query_length: int | None = 32, doc_length: int | None = 512, similarity_function: 'cosine' | 'dot' = 'dot', normalization_strategy: 'l2' | None = None, add_marker_tokens: bool = False, query_mask_scoring_tokens: Sequence[str] | 'punctuation' | None = None, doc_mask_scoring_tokens: Sequence[str] | 'punctuation' | None = None, query_aggregation_function: 'sum' | 'mean' | 'max' = 'sum', doc_aggregation_function: 'sum' | 'mean' | 'max' = 'max', embedding_dim: int = 128, projection: 'linear' | 'linear_no_bias' = 'linear', query_expansion: bool = False, attend_to_query_expanded_tokens: bool = False, doc_expansion: bool = False, attend_to_doc_expanded_tokens: bool = False, k_train: int | None = None, **kwargs)[source]

A Col model encodes queries and documents separately and computes a late interaction score between the query and document embeddings. The aggregation behavior of the late-interaction function can be parameterized with the aggregation_function arguments. The dimensionality of the token embeddings is down-projected using a linear layer. Queries and documents can optionally be expanded with mask tokens. Optionally, a set of tokens can be ignored during scoring.

Parameters:

query_length (int | None) – Maximum number of tokens per query. If None does not truncate. Defaults to 32.
doc_length (int | None) – Maximum number of tokens per document. If None does not truncate. Defaults to 512.
similarity_function (Literal["cosine", "dot"]) – Similarity function to compute scores between query and document embeddings. Defaults to “dot”.
normalization_strategy (Literal['l2'] | None) – Whether to normalize query and document embeddings. Defaults to None.
add_marker_tokens (bool) – Whether to add extra marker tokens [Q] / [D] to queries / documents. Defaults to False.
query_mask_scoring_tokens (Sequence[str] | Literal["punctuation"] | None) – Whether and which query tokens to ignore during scoring. Defaults to None.
doc_mask_scoring_tokens (Sequence[str] | Literal["punctuation"] | None) – Whether and which document tokens to ignore during scoring. Defaults to None.
query_aggregation_function (Literal["sum", "mean", "max"]) – How to aggregate similarity scores over query tokens. Defaults to “sum”.
doc_aggregation_function (Literal["sum", "mean", "max"]) – How to aggregate similarity scores over document tokens. Defaults to “max”.
embedding_dim (int) – The output embedding dimension. Defaults to 128.
projection (Literal["linear", "linear_no_bias"]) – Whether and how to project the output embeddings. Defaults to “linear”. If set to “linear_no_bias”, the projection layer will not have a bias term.
query_expansion (bool) – Whether to expand queries with mask tokens. Defaults to False.
attend_to_query_expanded_tokens (bool) – Whether to allow query tokens to attend to mask expanded query tokens. Defaults to False.
doc_expansion (bool) – Whether to expand documents with mask tokens. Defaults to False.
attend_to_doc_expanded_tokens (bool) – Whether to allow document tokens to attend to mask expanded document tokens. Defaults to False.
k_train (int | None) – Whether to use XTR’s in-batch token retrieval during training and how many top-k document tokens to use. Defaults to 128.

Methods

__init__([query_length, doc_length, ...])

A Col model encodes queries and documents separately and computes a late interaction score between the query and document embeddings.

Attributes

model_type

Model type for a Col model.

model_type: str = 'col': Model type for a Col model.