Next token prediction
The task also known as “causal language modelling”[^1], originating in Bengio et al (2003), used for first time using a transformer architecture by Howard and Ruder (2018) and popularized by Radfored et al. (2019) as a way to produce “universal learners”. It consists simply, given the first $n$ tokens in a sentence, in predicting the $n+1$-th. Assuming a word-level tokenization, input/output pairs thus look like this:
{"input": ["Dr.", "Chef", "knew", "exactly", "where", "all", "of", "his"], "output": "feelings"}
In practice, when training, all the positions of a given sentence are predicted in the same batch, thus improving the computational efficiency by reusing the hidden representations of the tokens. This allows us to recast the task as a sentence-level token labelling task, where each token (except the last one) in a sentence should be labelled with the token that follows it:
{
"sentence": ["Dr.", "Chef", "knew", "exactly", "where", "all", "of", "his", "feelings"],
"labels": ["Chef", "knew", "exactly", "where", "all", "of", "his", "feelings", "were"]
}
Zelda Rose deals with these details itself, so the only thing you need to do as a user is to provide a raw text dataset.
Task parameters
No parameters for this task!
Inputs and outputs
For this task, the train and dev datasets should be raw text, every line containing a single sample (typically a sentence). It can come either from a local text file or from a 🤗 text dataset.
Bibliography
- Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. ‘A Neural Probabilistic Language Model’. The Journal of Machine Learning Research 3.
- Howard, Jeremy, and Sebastian Ruder. 2018. ‘Universal Language Model Fine-Tuning for Text Classification’. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia.
- Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. ‘Language Models Are Unsupervised Multitask Learners’. Preprint.
[^1]: Historically, a “language model” is a model estimating the likelihood of (possibly partial) sentences. I maintain that a task where you predict the next token in a sentence is more properly “next token prediction”, and that even if under mild assumptions a next token predictor can be obtained from a language model and vice-versa, they are not properly the same thing. That ship seems to have sailed long ago, but maybe we can still bring it back to port.