> On the infra side, training a 1.5B model in ~4 hours on 8×H100 is impressive. It's hard ...

kouteiheika • today at 6:45 AM • 1 reply • view on HN

> On the infra side, training a 1.5B model in ~4 hours on 8×H100 is impressive.

It's hard to compare without more details about the training process and the dataset, but, is it? Genuine question, because I had the opposite impression. Like, for example, recently I did a full finetuning run on a 3B model chewing through a 146k entry dataset (with 116k entries having reasoning traces, so they're not short) in 7 hours on a single RTX 6000.

alt Hacker News

Replies