Training configurations

Parameters

Training duration

  • max_epochs: int | None = None
  • max_steps: int | None = None

Respectively the maximum number of epochs (full pass across the dataset) or [optimisation] steps to train for. If both are set, whichever of these two is reached first will stop training.

Batch size

  • batch_size: int = 64.

This is the number of sample in a forward-backward pass. If you use several devices and/or have device batches of a size bigger than $1$, this must be a multiple of device_batch_size*total_devices

Adam parameters

  • betas: Tuple[float, float] = (0.9, 0.98)
  • epsilon: float = 1e-8
  • learning_rate: float = 1e-4
  • weight_decay: flaot | None = None

These are respectively the $β$ and $ε$ parameters and the base learning rate for the Adam optimizer {cite}kingma2014AdamMethodStochastic and the weight decay rate. See the Pytorch documentation for more details.

Gradient clipping

  • gradient_clipping: float | int | None = None

If non-None, this is the maximum allowed gradient norm. Longer gradients will be clipped to this length, preserving their direction. See the Pytorch documentation for implementation details.

Learning rate schedule

  • lr_decay_steps: int | None = None
  • warmup_steps: int = 0

These are the number of steps in the slanted triangular learning rate schedule (Howard and Ruder, 2018): the base learning rate is made to follow an upward linear slope for warmup_steps steps up to learning_rate, then decayed linearly to $0$ in lr_decay_steps.

Note that setting lr_decay_steps overrides max_steps.

Bibliography

  • Howard, Jeremy, and Sebastian Ruder. 2018. ‘Universal Language Model Fine-Tuning for Text Classification’. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia.