Training configurations
Parameters
Training duration
max_epochs: int | None = Nonemax_steps: int | None = None
Respectively the maximum number of epochs (full pass across the dataset) or [optimisation] steps to train for. If both are set, whichever of these two is reached first will stop training.
Batch size
batch_size: int = 64.
This is the number of sample in a forward-backward pass. If you use several devices and/or have
device batches of a size bigger than \(1\), this must be a multiple of
device_batch_size*total_devices
Adam parameters
betas: Tuple[float, float] = (0.9, 0.98)epsilon: float = 1e-8learning_rate: float = 1e-4weight_decay: flaot | None = None
These are respectively the \(β\) and \(ε\) parameters and the base learning rate for the Adam optimizer (Kingma and Ba, 2015) and the weight decay rate. See the Pytorch documentation for more details.
Gradient clipping
gradient_clipping: float | int | None = None
If non-None, this is the maximum allowed gradient norm. Longer gradients will be clipped to this
length, preserving their direction. See the
Pytorch documentation
for implementation details.
Learning rate schedule
lr_decay_steps: int | None = Nonewarmup_steps: int = 0
These are the number of steps in the slanted triangular learning rate schedule (Howard and Ruder,
2018): the base learning rate is made to follow an upward linear slope for warmup_steps steps up
to learning_rate, then decayed linearly to \(0\) in lr_decay_steps.
Note that setting lr_decay_steps overrides max_steps.
Bibliography
- Howard, Jeremy, and Sebastian Ruder. 2018. ‘Universal Language Model Fine-Tuning for Text Classification’. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia.
- Kingma, Diederik P., and Jimmy Ba. 2015. ‘Adam: A Method for Stochastic Optimization’. Proceedings of the 2015 International Conference on Learning Representations, July 7.