Next token prediction

The task also known as “causal language modelling”[^1], originating in Bengio et al (2003), used for first time using a transformer architecture by Howard and Ruder (2018) and popularized by Radfored et al. (2019) as a way to produce “universal learners”. It consists simply, given the first \(n\) tokens in a sentence, in predicting the \(n+1\)-th. Assuming a word-level tokenization, input/output pairs thus look like this:

{"input": ["Dr.", "Chef", "knew", "exactly", "where", "all", "of", "his"], "output": "feelings"}

In practice, when training, all the positions of a given sentence are predicted in the same batch, thus improving the computational efficiency by reusing the hidden representations of the tokens. This allows us to recast the task as a sentence-level token labelling task, where each token (except the last one) in a sentence should be labelled with the token that follows it:

{
  "sentence": ["Dr.", "Chef", "knew", "exactly", "where", "all", "of", "his", "feelings"],
  "labels": ["Chef", "knew", "exactly", "where", "all", "of", "his", "feelings", "were"]
}

Zelda Rose deals with these details itself, so the only thing you need to do as a user is to provide a raw text dataset.

Task parameters

No parameters for this task!

Inputs and outputs

For this task, the train and dev datasets should be raw text, every line containing a single sample (typically a sentence). It can come either from a local text file or from a 🤗 text dataset.

Bibliography

Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. ‘A Neural Probabilistic Language Model’. The Journal of Machine Learning Research 3.
Howard, Jeremy, and Sebastian Ruder. 2018. ‘Universal Language Model Fine-Tuning for Text Classification’. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia.
Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. ‘Language Models Are Unsupervised Multitask Learners’. Preprint.

[^1]: Historically, a “language model” is a model estimating the likelihood of (possibly partial) sentences. I maintain that a task where you predict the next token in a sentence is more properly “next token prediction”, and that even if under mild assumptions a next token predictor can be obtained from a language model and vice-versa, they are not properly the same thing. That ship seems to have sailed long ago, but maybe we can still bring it back to port.