HFTrainer

pipeline pipeline

Trains a new Hugging Face Transformer model using the Trainer framework.

Example

The following shows a simple example using this pipeline.

  1. import pandas as pd
  2. from datasets import load_dataset
  3. from txtai.pipeline import HFTrainer
  4. trainer = HFTrainer()
  5. # Pandas DataFrame
  6. df = pd.read_csv("training.csv")
  7. model, tokenizer = trainer("bert-base-uncased", df)
  8. # Hugging Face dataset
  9. ds = load_dataset("glue", "sst2")
  10. model, tokenizer = trainer("bert-base-uncased", ds["train"], columns=("sentence", "label"))
  11. # List of dicts
  12. dt = [{"text": "sentence 1", "label": 0}, {"text": "sentence 2", "label": 1}]]
  13. model, tokenizer = trainer("bert-base-uncased", dt)
  14. # Support additional TrainingArguments
  15. model, tokenizer = trainer("bert-base-uncased", dt,
  16. learning_rate=3e-5, num_train_epochs=5)

All TrainingArguments are supported as function arguments to the trainer call.

See the links below for more detailed examples.

NotebookDescription
Train a text labelerBuild text sequence classification modelsOpen In Colab
Train without labelsUse zero-shot classifiers to train new modelsOpen In Colab
Train a QA modelBuild and fine-tune question-answering modelsOpen In Colab
Train a language model from scratchBuild new language modelsOpen In Colab

Training tasks

The HFTrainer pipeline builds and/or fine-tunes models for following training tasks.

TaskDescription
language-generationCausal language model for text generation (e.g. GPT)
language-modelingMasked language model for general tasks (e.g. BERT)
question-answeringExtractive question-answering model, typically with the SQuAD dataset
sequence-sequenceSequence-Sequence model (e.g. T5)
text-classificationClassify text with a set of labels
token-detectionELECTRA-style pre-training with replaced token detection

Methods

Python documentation for the pipeline.

Builds a new model using arguments.

Parameters:

NameTypeDescriptionDefault
base

path to base model, accepts Hugging Face model hub id, local path or (model, tokenizer) tuple

required
train

training data

required
validation

validation data

None
columns

tuple of columns to use for text/label, defaults to (text, None, label)

None
maxlength

maximum sequence length, defaults to tokenizer.model_max_length

None
stride

chunk size for splitting data for QA tasks

128
task

optional model task or category, determines the model type, defaults to “text-classification”

‘text-classification’
prefix

optional source prefix

None
metrics

optional function that computes and returns a dict of evaluation metrics

None
tokenizers

optional number of concurrent tokenizers, defaults to None

None
checkpoint

optional resume from checkpoint flag or path to checkpoint directory, defaults to None

None
args

training arguments

{}

Returns:

TypeDescription

(model, tokenizer)

Source code in txtai/pipeline/train/hftrainer.py

  1. def __call__(
  2. self,
  3. base,
  4. train,
  5. validation=None,
  6. columns=None,
  7. maxlength=None,
  8. stride=128,
  9. task="text-classification",
  10. prefix=None,
  11. metrics=None,
  12. tokenizers=None,
  13. checkpoint=None,
  14. **args
  15. ):
  16. """
  17. Builds a new model using arguments.
  18. Args:
  19. base: path to base model, accepts Hugging Face model hub id, local path or (model, tokenizer) tuple
  20. train: training data
  21. validation: validation data
  22. columns: tuple of columns to use for text/label, defaults to (text, None, label)
  23. maxlength: maximum sequence length, defaults to tokenizer.model_max_length
  24. stride: chunk size for splitting data for QA tasks
  25. task: optional model task or category, determines the model type, defaults to "text-classification"
  26. prefix: optional source prefix
  27. metrics: optional function that computes and returns a dict of evaluation metrics
  28. tokenizers: optional number of concurrent tokenizers, defaults to None
  29. checkpoint: optional resume from checkpoint flag or path to checkpoint directory, defaults to None
  30. args: training arguments
  31. Returns:
  32. (model, tokenizer)
  33. """
  34. # Parse TrainingArguments
  35. args = self.parse(args)
  36. # Set seed for model reproducibility
  37. set_seed(args.seed)
  38. # Load model configuration, tokenizer and max sequence length
  39. config, tokenizer, maxlength = self.load(base, maxlength)
  40. # Data collator and list of labels (only for classification models)
  41. collator, labels = None, None
  42. # Prepare datasets
  43. if task == "language-generation":
  44. # Default tokenizer pad token if it's not set
  45. tokenizer.pad_token = tokenizer.pad_token if tokenizer.pad_token is not None else tokenizer.eos_token
  46. process = Texts(tokenizer, columns, maxlength)
  47. collator = DataCollatorForLanguageModeling(tokenizer, mlm=False, pad_to_multiple_of=8 if args.fp16 else None)
  48. elif task in ("language-modeling", "token-detection"):
  49. process = Texts(tokenizer, columns, maxlength)
  50. collator = DataCollatorForLanguageModeling(tokenizer, pad_to_multiple_of=8 if args.fp16 else None)
  51. elif task == "question-answering":
  52. process = Questions(tokenizer, columns, maxlength, stride)
  53. elif task == "sequence-sequence":
  54. process = Sequences(tokenizer, columns, maxlength, prefix)
  55. collator = DataCollatorForSeq2Seq(tokenizer, pad_to_multiple_of=8 if args.fp16 else None)
  56. else:
  57. process = Labels(tokenizer, columns, maxlength)
  58. labels = process.labels(train)
  59. # Tokenize training and validation data
  60. train, validation = process(train, validation, os.cpu_count() if tokenizers and isinstance(tokenizers, bool) else tokenizers)
  61. # Create model to train
  62. model = self.model(task, base, config, labels, tokenizer)
  63. # Add model to collator
  64. if collator:
  65. collator.model = model
  66. # Build trainer
  67. trainer = Trainer(
  68. model=model,
  69. tokenizer=tokenizer,
  70. data_collator=collator,
  71. args=args,
  72. train_dataset=train,
  73. eval_dataset=validation if validation else None,
  74. compute_metrics=metrics,
  75. )
  76. # Run training
  77. trainer.train(resume_from_checkpoint=checkpoint)
  78. # Run evaluation
  79. if validation:
  80. trainer.evaluate()
  81. # Save model outputs
  82. if args.should_save:
  83. trainer.save_model()
  84. trainer.save_state()
  85. # Put model in eval mode to disable weight updates and return (model, tokenizer)
  86. return (model.eval(), tokenizer)