logoalt Hacker News

cmalast Sunday at 9:10 PM1 replyview on HN

> The auto regressive models consistently show better loss for the same number of training tokens

I thought bi-directional transformers (non auto-regressive) show less loss than autoregressive for the same amount of training tokens.


Replies

pamalast Monday at 2:52 PM

It is the other way around. If the data is causal and presented in the causal order, it is impossible to beat the loss of a pure auto-regressive model because it has the correct probability distribution for the dataset. Language data is mostly causal (as words follow in the context of previous words when they are spoken/written). Most of the remaining additional info in the extreme oversampling of the same data via diffusion models should be there by using fill-in-the-middle or order-reversal strategies with AR models as well and with significant compute savings during training.

show 1 reply