mBART translation
NOTE given the specific nature of the task, only models and tokenizers of the mBART/m2M100 family are allowed.
The task proposed by Lewis et al. (2020), Liu et al. (2020) and Tang et al. (2020) for training text-to-text models. In our case, we will mostly think of it as a translation task, but it could easily be adapted for other tasks such as summarization. It consists of a pretraining an encoder-decoder for a self-supervised denoising task, then fine-tuning it on a translation task, allowing to use non-parallel corpora to improve machine translation.
Lewis et al. (2020) experimented with several noise functions, finally settling on text infilling
and sentence shuffling. Since sentence shuffling assumes document-level processing and Zelda
Rose is meant for sentence-level training, we only implement text infilling here, which consists
of masking small spans of tokens with a single \
- Original sentence: “The little tabby cat is happy”
- After infilling: “The little \
happy”
The masked sentence serves as input and the expected output of the model is the original. Since the length of the target can not be easily be deduced from input, the models used for this task are encoder-decoders, such as the original transformer model (Vaswani et al (2017)).
Translation is, as in Vaswani et al (2017), also treated as a text-to-text task.
One innovation of Zelda Rose is that the models can also be trained simultaneously on denoising and translation, with a weight hyperparameter that controls each task's contribution to the optimized loss.
Task parameters
change_ratio: float = 0.3
denoise_langs: list[str] | None
denoise_loss_ratio: float = 0.5
poisson_lambda: float = 3.0
source_langs: list[str] | None
target_langs: list[str] | None
strict_langs: bool = False
change_ratiois the proportion of tokens to which we apply some change either masking or switching.denoise_langs,source_langsandtarget_langsare the codes for the languages in these respective roles. See below for their link with model and data format.denoise_loss_rationis the weight (between $0$ and $1$) given to the denoising loss in the multitask loss.poisson_lambdais the $λ$ parameter of the Poisson distribution from which the sizes of the masked spans are drawnstrict_langsis a flag controlling if the lang codes are allowed to only partially match between dataset and model/tokenizer.
Inputs and outputs
For this task, the train and dev datasets should be in the jsonlines format, every row being a mapping between langcode and translation in the corresponding language such as
{"br": "Me am eus kanet", "fr": "J'ai chanté", "en": "I have sung"}
Or, for compatibility with 🤗 datasets, each row can be an arbitrary mapping, that has a
"translation" key associated to a mapping in the previous format:
{"translation": {"br": "Me am eus kanet", "fr": "J'ai chanté", "en": "I have sung"}}
Inputs can come either from local files or from a 🤗 dataset.
Bibliography
- Lewis, Mike, Yinhan Liu, Naman Goyal, et al. 2020. ‘BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension’. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
- Liu, Yinhan, Jiatao Gu, Naman Goyal, et al. 2020. ‘Multilingual Denoising Pre-Training for Neural Machine Translation’. Transactions of the Association for Computational Linguistics 8.
- Tang, Yuqing, Chau Tran, Xian Li, et al. 2020. ‘Multilingual Translation with Extensible Multilingual Pretraining and Finetuning’. arXiv preprint.
- Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. ‘Attention Is All You Need’. In Advances in Neural Information Processing Systems 30.