ColConfig
- class lightning_ir.models.col.ColConfig(query_length: int = 32, doc_length: int = 512, similarity_function: Literal['cosine', 'dot'] = 'dot', normalize: bool = False, add_marker_tokens: bool = False, query_mask_scoring_tokens: Sequence[str] | Literal['punctuation'] | None = None, doc_mask_scoring_tokens: Sequence[str] | Literal['punctuation'] | None = None, query_aggregation_function: Literal['sum', 'mean', 'max', 'harmonic_mean'] = 'sum', doc_aggregation_function: Literal['sum', 'mean', 'max', 'harmonic_mean'] = 'max', embedding_dim: int = 128, projection: Literal['linear', 'linear_no_bias'] = 'linear', query_expansion: bool = False, attend_to_query_expanded_tokens: bool = False, doc_expansion: bool = False, attend_to_doc_expanded_tokens: bool = False, **kwargs)[source]
Bases:
MultiVectorBiEncoderConfig
Configuration class for a Col model.
- __init__(query_length: int = 32, doc_length: int = 512, similarity_function: Literal['cosine', 'dot'] = 'dot', normalize: bool = False, add_marker_tokens: bool = False, query_mask_scoring_tokens: Sequence[str] | Literal['punctuation'] | None = None, doc_mask_scoring_tokens: Sequence[str] | Literal['punctuation'] | None = None, query_aggregation_function: Literal['sum', 'mean', 'max', 'harmonic_mean'] = 'sum', doc_aggregation_function: Literal['sum', 'mean', 'max', 'harmonic_mean'] = 'max', embedding_dim: int = 128, projection: Literal['linear', 'linear_no_bias'] = 'linear', query_expansion: bool = False, attend_to_query_expanded_tokens: bool = False, doc_expansion: bool = False, attend_to_doc_expanded_tokens: bool = False, **kwargs)[source]
A Col model encodes queries and documents separately and computes a late interaction score between the query and document embeddings. The aggregation behavior of the late-interaction function can be parameterized with the aggregation_function arguments. The dimensionality of the token embeddings is down-projected using a linear layer. Queries and documents can optionally be expanded with mask tokens. Optionally, a set of tokens can be ignored during scoring.
- Parameters:
query_length (int, optional) – Maximum query length, defaults to 32
doc_length (int, optional) – Maximum document length, defaults to 512
similarity_function (Literal['cosine', 'dot'], optional) – Similarity function to compute scores between query and document embeddings, defaults to “dot”
normalize (bool, optional) – Whether to normalize query and document embeddings, defaults to False
add_marker_tokens (bool, optional) – Whether to add extra marker tokens [Q] / [D] to queries / documents, defaults to False
query_mask_scoring_tokens (Sequence[str] | Literal['punctuation'] | None, optional) – Whether and which query tokens to ignore during scoring, defaults to None
doc_mask_scoring_tokens (Sequence[str] | Literal['punctuation'] | None, optional) – Whether and which document tokens to ignore during scoring, defaults to None
query_aggregation_function (Literal[ 'sum', 'mean', 'max', 'harmonic_mean' ], optional) – How to aggregate similarity scores over query tokens, defaults to “sum”
doc_aggregation_function (Literal[ 'sum', 'mean', 'max', 'harmonic_mean' ], optional) – How to aggregate similarity scores over document tokens, defaults to “max”
embedding_dim (int, optional) – The output embedding dimension, defaults to 768
projection (Literal['linear', 'linear_no_bias', 'mlm'] | None, optional) – Whether and how to project the output emeddings, defaults to “linear”
query_expansion (bool, optional) – Whether to expand queries with mask tokens, defaults to False
attend_to_query_expanded_tokens (bool, optional) – Whether to allow query tokens to attend to mask tokens, defaults to False
doc_expansion (bool, optional) – Whether to expand documents with mask tokens, defaults to False
attend_to_doc_expanded_tokens (bool, optional) – Whether to allow document tokens to attend to mask tokens, defaults to False
Methods
__init__
([query_length, doc_length, ...])A Col model encodes queries and documents separately and computes a late interaction score between the query and document embeddings.
Attributes
Model type for a Col model.
- backbone_model_type: str | None = None
Backbone model type for the configuration. Set by
LightningIRModelClassFactory()
.
- classmethod from_pretrained(pretrained_model_name_or_path: str | Path, *args, **kwargs) LightningIRConfig
Loads the configuration from a pretrained model. Wraps the transformers.PretrainedConfig.from_pretrained
- Parameters:
pretrained_model_name_or_path (str | Path) – Pretrained model name or path
- Raises:
ValueError – If pre_trained_model_name_or_path is not a Lightning IR model and no
LightningIRConfig
is passed- Returns:
Derived LightningIRConfig class
- Return type:
- get_tokenizer_kwargs(Tokenizer: Type[LightningIRTokenizer]) Dict[str, Any]
Returns the keyword arguments for the tokenizer. This method is used to pass the configuration parameters to the tokenizer.
- Parameters:
Tokenizer (Type[LightningIRTokenizer]) – Class of the tokenizer to be used
- Returns:
Keyword arguments for the tokenizer
- Return type:
Dict[str, Any]
- to_dict() Dict[str, Any]
Overrides the transformers.PretrainedConfig.to_dict method to include the added arguments and the backbone model type.
- Returns:
Configuration dictionary
- Return type:
Dict[str, Any]