Replaced tokens detection

NOTE Since RTD requires training two models, the --pretrained-model and --model-config arguments take two arguments (separated by a comma and no space, lie --pretrained-model lgrobol/deberta-minuscule_gen,lgrobol/deberta-minuscule_dis). The first one goes to the generator and the second one for the discriminator.

The task, proposed by Clark et al. (2019) to train their ELECTRA model, consists of training two antagonist neural networks in parallel. One, the generator is trained as a masked language model, to fill in masked tokens in a sentence. The other, the discriminator, is trained as a detector of replaced tokens. The hinge of the technique resides in the disparity between the two models: the generator is in general much smaller than the discriminator. The resulting ensemble can be made smaller (in terms of number of parameters) and trained faster than a MLM model of equivalent performances.

Our implementation here follows more closely that of He et al. (2021), who extended the original ELECTRA to multilingualism and larger size, using tricks first presented by He et al. (2020).

A word of warning: the success in this task is heavily dependent on subtle hyperparameters choices. We have done our best to select reasonable default, but if you insist on using this pretraining, a comprehensive grid search might be in order (which can defeat its frugality advantages).

Task parameters

discriminator_loss_weight: float = 1.0
embeddings_sharing: Literal["deberta", "electra"] | None = None
mask_ratio: float = 0.15

discriminator_loss_weight is the $α$ parameter for the combined loss $α\mathrm{loss}{\mathrm{discriminator}}+\mathrm{loss}{\mathrm{generator}}$ use to train the models.
embeddings_sharing selects how the embedding layers of both models are shared: as in ELECTRA (Clark et al., 2019), as in DeBERTa v3 (He et al., 2021)) or not a all (None, the default).
mask_ratio is the proportion of tokens that are masked for the generator.

Inputs and outputs

For this task, the train and dev datasets should be raw text, every line containing a single sample (typically a sentence). It can come either from a local text file or from a 🤗 text dataset.

Bibliography

Clark, Kevin, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2019. ‘ELECTRA: Pre-Training Text Encoders as Discriminators Rather Than Generators’. Paper presented at ICLR 2020, አዲስ አበባ, ኢትዮጵያ (Adis Ababa, Ethiopia). Proceedings of the 8th International Conference on Learning Representations.
He, Pengcheng, Jianfeng Gao, and Weizhu Chen. 2021. ‘DeBERTaV3: Improving DeBERTa Using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing’. arXiv preprint.
He, Pengcheng, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. ‘DeBERTa: Decoding-Enhanced BERT with Disentangled Attention’. Proceedings of the 2021 International Conference on Learning Representations, Wien, Österreich.