logoalt Hacker News

spiyesterday at 2:27 PM1 replyview on HN

This is a very nice, detailed post! I have a few minor comments though (maybe a few are discussed somewhere, it's a _long_ article and I can't claim 100% coverage :-) ):

Calling it "training LLM" is a bit misleading. This is a small GPT-2-sized model (~160M params), while the "L" in "LLM" stands for large...

The early discussion and worries about truncating strings look a bit weird. The author then realizes they're anyway not even going to use 30% of the total available data, so who cares if for each given string we're only using the first 1024 tokens? (And anyway, even if doing more epochs, he doesn't discuss the obvious solution to avoid throwing away data, i.e. not clipping always the tail but starting from a random point each epoch - maybe after a punctuation or something)

At this level of simplicity, setting up a validation loop might be an unneeded complication (for the autoregressive pretraining part, not the instruction-tuning of course). That's because anyway the model is training for < 1 epoch, so no data is seen twice (*). One might as well just track the training loss, it's slightly less "clean" because it's evaluated each time on different data, but the sheer size of it makes up for the issue. The final plot shows that the two curves are similar - train is noisier of course, but nothing a bit of rolling smoothing couldn't solve.

The choice to load all tokenized text into RAM feels odd... it works, and it's possibly slightly faster than loading on-the-fly, but only if you have enough RAM to "waste". PyTorch loads data on separate processes in a non-blocking way, so it feels like having it on disk and loaded on-the-fly would be safer and not make any hit on runtime. But well, if it fits, it's certainly easier that way (although, as the author remarks, it only works if you can store it as a numpy array or torch tensor of some internally supported dtypes like int or float; if they are any Python "object" types, they get replicated per dataloader worker, and OOM is guaranteed)

The choice to concatenate everything into a long string is a bit outdated nowadays. Because it trains with attention between different sentences that have nothing to do with each other, and could cause a bias or anyway suboptimal results. Nowadays people use masked attention ("document masking"), which is so popular it's even supported by FlashAttention: https://github.com/Dao-AILab/flash-attention/issues/654

(*) Of course, the data is dirty enough that there _will_ be some duplicated stuff here or there, but the same is true for a random train/validation split. Also such a small model would have very little risk to memorize, even if some data were replicated.*


Replies

BoxOfRainyesterday at 2:53 PM

> Calling it "training LLM" is a bit misleading. This is a small GPT-2-sized model (~160M params), while the "L" in "LLM" stands for large...

I've always felt the natural way of referring to smaller LLMs would be Medium Language Models and Small Language Models, but I guess MLM is an inauspicious acronym.

show 1 reply