Training configurations
Parameters
Training duration
max_epochs: int | None = Nonemax_steps: int | None = None
Respectively the maximum number of epochs (full pass across the dataset) or [optimisation] steps to train for. If both are set, whichever of these two is reached first will stop training.
Batch size
batch_size: int = 64.
This is the number of sample in a forward-backward pass. If you use several devices and/or have
device batches of a size bigger than $1$, this must be a multiple of
device_batch_size*total_devices
Adam parameters
betas: Tuple[float, float] = (0.9, 0.98)epsilon: float = 1e-8learning_rate: float = 1e-4weight_decay: flaot | None = None
These are respectively the $β$ and $ε$ parameters and the base learning rate for the Adam optimizer
{cite}kingma2014AdamMethodStochastic and the weight decay rate. See the
Pytorch documentation for more
details.
Gradient clipping
gradient_clipping: float | int | None = None
If non-None, this is the maximum allowed gradient norm. Longer gradients will be clipped to this
length, preserving their direction. See the
Pytorch documentation
for implementation details.
Learning rate schedule
lr_decay_steps: int | None = Nonewarmup_steps: int = 0
These are the number of steps in the slanted triangular learning rate schedule (Howard and Ruder,
2018): the base learning rate is made to follow an upward linear slope for warmup_steps steps up
to learning_rate, then decayed linearly to $0$ in lr_decay_steps.
Note that setting lr_decay_steps overrides max_steps.
Bibliography
- Howard, Jeremy, and Sebastian Ruder. 2018. ‘Universal Language Model Fine-Tuning for Text Classification’. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia.