ColConfig

class lightning_ir.models.col.ColConfig(query_length: int = 32, doc_length: int = 512, similarity_function: Literal['cosine', 'dot'] = 'dot', normalize: bool = False, add_marker_tokens: bool = False, query_mask_scoring_tokens: Sequence[str] | Literal['punctuation'] | None = None, doc_mask_scoring_tokens: Sequence[str] | Literal['punctuation'] | None = None, query_aggregation_function: Literal['sum', 'mean', 'max', 'harmonic_mean'] = 'sum', doc_aggregation_function: Literal['sum', 'mean', 'max', 'harmonic_mean'] = 'max', embedding_dim: int = 128, projection: Literal['linear', 'linear_no_bias'] = 'linear', query_expansion: bool = False, attend_to_query_expanded_tokens: bool = False, doc_expansion: bool = False, attend_to_doc_expanded_tokens: bool = False, **kwargs)[source]

Bases: MultiVectorBiEncoderConfig

Configuration class for a Col model.

__init__(query_length: int = 32, doc_length: int = 512, similarity_function: Literal['cosine', 'dot'] = 'dot', normalize: bool = False, add_marker_tokens: bool = False, query_mask_scoring_tokens: Sequence[str] | Literal['punctuation'] | None = None, doc_mask_scoring_tokens: Sequence[str] | Literal['punctuation'] | None = None, query_aggregation_function: Literal['sum', 'mean', 'max', 'harmonic_mean'] = 'sum', doc_aggregation_function: Literal['sum', 'mean', 'max', 'harmonic_mean'] = 'max', embedding_dim: int = 128, projection: Literal['linear', 'linear_no_bias'] = 'linear', query_expansion: bool = False, attend_to_query_expanded_tokens: bool = False, doc_expansion: bool = False, attend_to_doc_expanded_tokens: bool = False, **kwargs)[source]

A Col model encodes queries and documents separately and computes a late interaction score between the query and document embeddings. The aggregation behavior of the late-interaction function can be parameterized with the aggregation_function arguments. The dimensionality of the token embeddings is down-projected using a linear layer. Queries and documents can optionally be expanded with mask tokens. Optionally, a set of tokens can be ignored during scoring.

Parameters:
  • query_length (int) – Maximum query length in number of tokens. Defaults to 32.

  • doc_length (int) – Maximum document length in number of tokens. Defaults to 512.

  • similarity_function (Literal["cosine", "dot"]) – Similarity function to compute scores between query and document embeddings. Defaults to “dot”.

  • normalize (bool) – Whether to normalize query and document embeddings. Defaults to False.

  • add_marker_tokens (bool) – Whether to add extra marker tokens [Q] / [D] to queries / documents. Defaults to False.

  • query_mask_scoring_tokens (Sequence[str] | Literal["punctuation"] | None) – Whether and which query tokens to ignore during scoring. Defaults to None.

  • doc_mask_scoring_tokens (Sequence[str] | Literal["punctuation"] | None) – Whether and which document tokens to ignore during scoring. Defaults to None.

  • query_aggregation_function (Literal["sum", "mean", "max", "harmonic_mean"]) – How to aggregate similarity scores over query tokens. Defaults to “sum”.

  • doc_aggregation_function (Literal["sum", "mean", "max", "harmonic_mean"]) – How to aggregate similarity scores over document tokens. Defaults to “max”.

  • embedding_dim (int) – The output embedding dimension. Defaults to 128.

  • projection (Literal["linear", "linear_no_bias"]) – Whether and how to project the output embeddings. Defaults to “linear”. If set to “linear_no_bias”, the projection layer will not have a bias term.

  • query_expansion (bool) – Whether to expand queries with mask tokens. Defaults to False.

  • attend_to_query_expanded_tokens (bool) – Whether to allow query tokens to attend to mask expanded query tokens. Defaults to False.

  • doc_expansion (bool) – Whether to expand documents with mask tokens. Defaults to False.

  • attend_to_doc_expanded_tokens (bool) – Whether to allow document tokens to attend to mask expanded document tokens. Defaults to False.

Methods

__init__([query_length, doc_length, ...])

A Col model encodes queries and documents separately and computes a late interaction score between the query and document embeddings.

Attributes

model_type

Model type for a Col model.

backbone_model_type: str | None = None

Backbone model type for the configuration. Set by LightningIRModelClassFactory().

classmethod from_pretrained(pretrained_model_name_or_path: str | Path, *args, **kwargs) LightningIRConfig

Loads the configuration from a pretrained model. Wraps the transformers.PretrainedConfig.from_pretrained

Parameters:

pretrained_model_name_or_path (str | Path) – Pretrained model name or path.

Returns:

Derived LightningIRConfig class.

Return type:

LightningIRConfig

Raises:

ValueError – If pretrained_model_name_or_path is not a Lightning IR model and no LightningIRConfig is passed.

get_tokenizer_kwargs(Tokenizer: Type[LightningIRTokenizer]) Dict[str, Any]

Returns the keyword arguments for the tokenizer. This method is used to pass the configuration parameters to the tokenizer.

Parameters:

Tokenizer (Type[LightningIRTokenizer]) – Class of the tokenizer to be used.

Returns:

Keyword arguments for the tokenizer.

Return type:

Dict[str, Any]

model_type: str = 'col'

Model type for a Col model.

to_dict() Dict[str, Any]

Overrides the transformers.PretrainedConfig.to_dict method to include the added arguments and the backbone model type.

Returns:

Configuration dictionary.

Return type:

Dict[str, Any]

to_diff_dict() dict[str, Any]

Removes all attributes from the configuration that correspond to the default config attributes for better readability, while always retaining the config attribute from the class. Serializes to a Python dictionary.

Returns:

Dictionary of all the attributes that make up this configuration instance.

Return type:

dict[str, Any]