LightningIRTokenizerClassFactory

class lightning_ir.base.class_factory.LightningIRTokenizerClassFactory(MixinConfig: Type[LightningIRConfig])[source]

Bases: LightningIRClassFactory

Class factory for creating derived LightningIRTokenizer classes from HuggingFace tokenizer classes.

Methods

from_backbone_class(BackboneClass)

Creates a derived LightningIRTokenizer from a transformers.PreTrainedTokenizerBase backbone tokenizer.

from_backbone_classes(BackboneClasses[, ...])

Creates derived slow and fastLightningIRTokenizers from a tuple of backbone HuggingFace tokenizer classes.

from_pretrained(model_name_or_path, *args[, ...])

Loads a derived LightningIRTokenizer from a pretrained HuggingFace tokenizer.

get_backbone_config(model_name_or_path)

Grabs the tokenizer configuration class from a checkpoint of a pretrained HuggingFace tokenizer.

get_backbone_model_type(model_name_or_path, ...)

Grabs the model type from a checkpoint of a pretrained HuggingFace tokenizer.

from_backbone_class(BackboneClass: Type[PreTrainedTokenizerBase]) Type[LightningIRTokenizer][source]

Creates a derived LightningIRTokenizer from a transformers.PreTrainedTokenizerBase backbone tokenizer. If the backbone tokenizer is already a LightningIRTokenizer, it is returned as is.

Parameters:

BackboneClass (Type[PreTrainedTokenizerBase]) – Backbone tokenizer class.

Returns:

Derived LightningIRTokenizer.

Return type:

Type[LightningIRTokenizer]

from_backbone_classes(BackboneClasses: Tuple[Type[PreTrainedTokenizerBase] | None, Type[PreTrainedTokenizerBase] | None], BackboneConfig: Type[PretrainedConfig] | None = None) Tuple[Type[LightningIRTokenizer] | None, Type[LightningIRTokenizer] | None][source]

Creates derived slow and fastLightningIRTokenizers from a tuple of backbone HuggingFace tokenizer classes.

Parameters:
  • BackboneClasses (Tuple[Type[PreTrainedTokenizerBase] | None, Type[PreTrainedTokenizerBase] | None]) – Slow and fast backbone tokenizer classes.

  • BackboneConfig (Type[PretrainedConfig] | None, optional) – Backbone configuration class. Defaults to None.

Returns:

Slow and fast derived LightningIRTokenizers.

Return type:

Tuple[Type[LightningIRTokenizer] | None, Type[LightningIRTokenizer] | None]

from_pretrained(model_name_or_path: str | Path, *args, use_fast: bool = True, **kwargs) Type[LightningIRTokenizer][source]

Loads a derived LightningIRTokenizer from a pretrained HuggingFace tokenizer.

Parameters:
  • model_name_or_path (str | Path) – Path to the tokenizer or its name.

  • use_fast (bool, optional) – Whether to use the fast tokenizer. Defaults to True.

Returns:

Derived LightningIRTokenizer.

Return type:

Type[LightningIRTokenizer]

Raises:
  • ValueError – If no fast tokenizer is found when use_fast is True.

  • ValueError – If no slow tokenizer is found when use_fast is False.

static get_backbone_config(model_name_or_path: str | Path) PretrainedConfig[source]

Grabs the tokenizer configuration class from a checkpoint of a pretrained HuggingFace tokenizer.

Parameters:

model_name_or_path (str | Path) – Path to the tokenizer or its name.

Returns:

Configuration class of the backbone tokenizer.

Return type:

PretrainedConfig

static get_backbone_model_type(model_name_or_path: str | Path, *args, **kwargs) str[source]

Grabs the model type from a checkpoint of a pretrained HuggingFace tokenizer.

Parameters:

model_name_or_path (str | Path) – Path to the tokenizer or its name.

Returns:

Model type of the backbone tokenizer.

Return type:

str