Training configurations

Parameters

Training duration

  • max_epochs: int | None = None
  • max_steps: int | None = None

Respectively the maximum number of epochs (full pass across the dataset) or [optimisation] steps to train for. If both are set, whichever of these two is reached first will stop training.

Batch size

  • batch_size: int = 64.

This is the number of sample in a forward-backward pass. If you use several devices and/or have device batches of a size bigger than \(1\), this must be a multiple of device_batch_size*total_devices

Adam parameters

  • betas: Tuple[float, float] = (0.9, 0.98)
  • epsilon: float = 1e-8
  • learning_rate: float = 1e-4
  • weight_decay: flaot | None = None

These are respectively the \(β\) and \(ε\) parameters and the base learning rate for the Adam optimizer (Kingma and Ba, 2015) and the weight decay rate. See the Pytorch documentation for more details.

Gradient clipping

  • gradient_clipping: float | int | None = None

If non-None, this is the maximum allowed gradient norm. Longer gradients will be clipped to this length, preserving their direction. See the Pytorch documentation for implementation details.

Learning rate schedule

  • lr_decay_steps: int | None = None
  • warmup_steps: int = 0

These are the number of steps in the slanted triangular learning rate schedule (Howard and Ruder, 2018): the base learning rate is made to follow an upward linear slope for warmup_steps steps up to learning_rate, then decayed linearly to \(0\) in lr_decay_steps.

Note that setting lr_decay_steps overrides max_steps.

Bibliography

  • Howard, Jeremy, and Sebastian Ruder. 2018. ‘Universal Language Model Fine-Tuning for Text Classification’. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia.
  • Kingma, Diederik P., and Jimmy Ba. 2015. ‘Adam: A Method for Stochastic Optimization’. Proceedings of the 2015 International Conference on Learning Representations, July 7.