FAQ: Pitfalls of Next-Token Prediction

Below, I address some of the often asked questions about our findings in this paper¹

Q. I gave this problem to my favorite language model and it easily solved it.

A. Our result is not meant to be interpreted that way. Imagine there is a skill we want in the future but this is not covered in our current generation models; imagine that your solution is to collect many demonstrations of said skill and teach it to the model using next-token training. Are you guaranteed that the model will learn it this way? This is what we give some insight into.

Q. Next-token training may not solve this task. But wouldn’t reinforcement learning or chain-of-thought come to the rescue?

A. Totally! But it is worth considering the following:

Is there any headroom during pretraining? Would a different objective save us a lot of compute/data?
Would reinforcement learning teach us all skills including ones that don’t already exist within the model? What about situations (like creativity) where rewards are very sparse?
Are there tasks that are not chain-of-thought-friendly? How about tasks that require a “spark”?

Q. The failure you show here is for a very specific graph structure. This doesn’t look like it will generalize to natural language. Is this failure any relevant to practice?

A. You’re right that the striking failure requires the paths to be non-branching to encourage learning Clever Hans cheats; this is not true of language, where you expect a branching structure. Let’s step back and get a broader view of things:

Language pretraining is not the only situation where next-token-learning is used for. There are many applications such as protein modeling where models are trained with next-token prediction. See for instance, this paper² which reports how models learn shortcuts when trying to imitate human behavior.
Eliminating “striking failure” is a binary target; it isn’t the practitioners only business; they also need to care about data- and compute-efficiency. It is possible that, even if the model can recover from such cheats, the gradients corresponding to those cheats are wasteful. If you see Fig 6, you’ll see that the next-token predicting model is inefficient.

Q. It’s not just the topology of the task that makes it special, but it’s also that your path lengths are all fixed. What happens if I mix the data with graphs of varying complexities?

A. In that case, we expect the model to solve the problem! See Figure 2 in this excellent paper³ where they show how Transformers can remarkably solve compositional tasks with really small depth; if you notice, the model is trained on a dataset of varying hops, unlike in our path-star task (which is also a compositional task, but with a fixed hop).

To teach compositional tasks, models need some form of stepwise supervision. Your suggestion, in line with the above paper, is to use curriculum. Another alternative is to give chain-of-thought supervision (see this paper⁴). These are great and practically important insights; the path-star’s insight is independent of this. It is that you may be squandering away supervision that already exists in your data simply because your supervision is sub-optimal.