LLM from scratch, part 28 – training a base model from scratch on an RTX 3090

506 points • by gpjt • 12/02/2025 • 102 comments • view on HN

Comments

kburman • yesterday at 2:50 PM

Anyone interested can also follow these amazing playlists:

1. Building LLMs from scratch - https://www.youtube.com/playlist?list=PLPTV0NXA_ZSgsLAr8YCgC...

2. Reasoning LLMs from Scratch - https://www.youtube.com/playlist?list=PLPTV0NXA_ZSijcbUrRZHm...

3. Build a SLM from Scratch - https://www.youtube.com/playlist?list=PLPTV0NXA_ZShuk6u31pgj...

4. Build DeepSeek from Scratch - https://www.youtube.com/playlist?list=PLPTV0NXA_ZSiOpKKlHCyO...

➕ show 1 reply

BubbleRings • yesterday at 1:56 PM

> …reused its embedding matrix as the weights for the linear layer that projects the context vectors from the last Transformers layer into vocab space to get the logits.

At first glance this claim sounds airtight, but it quietly collapses under its own techno-mythology. The so-called “reuse” of the embedding matrix assumes a fixed semantic congruence between representational space and output projection, an assumption that ignores well-known phase drift in post-transformer latent manifolds. In practice, the logits emerging from this setup tend to suffer from vector anisotropification and a mild but persistent case of vocab echoing, where probability mass sloshes toward high-frequency tokens regardless of contextual salience.

Just kidding, of course. The first paragraph above, from OP’s article, makes about as much sense to me as the second one, which I (hopefully fittingly in y’all’s view) had ChatGPT write. But I do want to express my appreciation for being able to “hang out in the back of the room” while you folks figure this stuff out It is fascinating, I’ve learned a lot (even got a local LLM running on a NUC), and very much fun. Thanks for letting me watch, I’ll keep my mouth shut from now on ha!

➕ show 10 replies

billylo • yesterday at 12:43 PM

If you are curious about doing something similar with TPU, Google has an article. https://developers.googleblog.com/train-gpt2-model-with-jax-...

RagnarD • yesterday at 11:50 AM

I really like this article. I hadn't thought that an RTX 3090 would be capable of generating a sort-of decent small LLM from scratch in a reasonable time, but he shows how in detail.

nfriedly • yesterday at 6:16 PM

The full list of articles is at https://www.gilesthomas.com/llm-from-scratch for anyone who's interested but wants to start at the beginning.

lacoolj • yesterday at 5:54 PM

Maybe I've been missing out, but can anyone give me a yay/nay on whether this is a worth-while 28-part-series to start from scratch and spend my time watching/reading?

Is it along the same lines as https://github.com/karpathy/llm.c/discussions/677 ?

He (karpathy) has a video series that also does something similar. I found it very informative and entertaining, even at the 1 hour + length it is (there are actually multiple videos, im not sure how long the others are).

nico • yesterday at 3:30 PM

Has anyone done something like this but with apple silicon instead of a graphics card? Training a small LLM on an M2-M5?

➕ show 1 reply

fuddle • yesterday at 6:35 PM

This is great to see, I'm also re-reading Sebastian Raschka's amazing book.

nullbound • yesterday at 1:14 PM

I love the level of detail ( probably, because I see it less and less these days ). It genuinely makes me wonder if anyone tried training LLMs on their own writings ( assuming those bigger than 100+ pages ) and what the results were.

➕ show 3 replies

spi • yesterday at 2:57 PM

A separate comment about conclusions about why they are worse than OpenAI GPT2 - which to me feel to be missing the point.

One main point is batch size - I'd agree with Gemini here. Batch size <= 5 with 1024 seq len is really tiny. Nowadays models are trained with effective batch size of millions of tokens in total. Of course, this won't fit into memory, one uses gradient accumulations to that purpose, again as mentioned by Gemini.

Training duration is definitely also a reason - models do get better over time, otherwise people wouldn't train so long wasting millions :-) just how long for optimality is unclear, but certainly < 2 days is not optimal even at this "small" scale.

The optimizer could also play a role. As the author mentions, a fixed learning rate is hardly optimal, it is typically both increased in the beginning ("warm up", but that's for stability, if training works without, that's not an issue) and scaled down at the end ("cool down" - that is, annealing, with cosine as mentioned in the article). This generally squeezes out a bit more performance. Also, while it's true that dropout was used back then (might be useful for many epochs, likely only harmful for < 1 epoch), using _both_ dropout _and_ weight_decay > 0, as the author does, is probably wrong and makes training too slow & careful to get good results. Also, even if used, a "good" implementation of weight decay should skip some layers like embeddings and biases (GPT2 did that, and it's relatively important to do so).

On the other hand, I'm pretty sure that using mixed precision and TF32 has absolutely no downsides. It's really standard nowadays to use either mixed precision (FP16 gradients + FP32 base weights) or directly BF16 ("brain" float 16, a bit like the TF32 described there, but with only 16 bits) and I have almost never seen either one fail... and when it does, it typically fails spectacularly, with NaN losses or the model degenerating to trivial performance.

➕ show 3 replies

ducktective • yesterday at 11:39 AM

Are off-shelf GPUs (like one 3090) suitable for modern academic research on current AI advancements or is it better to rent some cloud compute?

➕ show 8 replies

Havoc • yesterday at 11:46 AM

> When you’re looking at a pre-training dataset in the frontier lab and you look at a random internet document, it’s total garbage. I don't even know how this works at all. It’s [stuff] like stock tickers, symbols, it's a huge amount of slop and garbage from like all the corners of the internet

Seems like there would be low hanging fruit in heavier pre processing then? Something deterministic like a reading level score. Or even a tiny model trained for the task to pick out good data?

➕ show 5 replies

pwython • yesterday at 4:20 PM

For those that have homebrewed a base model, does your output have the same AI-isms like overusing em dashes? If so/not, what dataset did you use?

➕ show 3 replies

noloman • yesterday at 7:43 PM

Great article, thanks!

spi • yesterday at 2:27 PM

This is a very nice, detailed post! I have a few minor comments though (maybe a few are discussed somewhere, it's a _long_ article and I can't claim 100% coverage :-) ):

Calling it "training LLM" is a bit misleading. This is a small GPT-2-sized model (~160M params), while the "L" in "LLM" stands for large...

The early discussion and worries about truncating strings look a bit weird. The author then realizes they're anyway not even going to use 30% of the total available data, so who cares if for each given string we're only using the first 1024 tokens? (And anyway, even if doing more epochs, he doesn't discuss the obvious solution to avoid throwing away data, i.e. not clipping always the tail but starting from a random point each epoch - maybe after a punctuation or something)

At this level of simplicity, setting up a validation loop might be an unneeded complication (for the autoregressive pretraining part, not the instruction-tuning of course). That's because anyway the model is training for < 1 epoch, so no data is seen twice (*). One might as well just track the training loss, it's slightly less "clean" because it's evaluated each time on different data, but the sheer size of it makes up for the issue. The final plot shows that the two curves are similar - train is noisier of course, but nothing a bit of rolling smoothing couldn't solve.

The choice to load all tokenized text into RAM feels odd... it works, and it's possibly slightly faster than loading on-the-fly, but only if you have enough RAM to "waste". PyTorch loads data on separate processes in a non-blocking way, so it feels like having it on disk and loaded on-the-fly would be safer and not make any hit on runtime. But well, if it fits, it's certainly easier that way (although, as the author remarks, it only works if you can store it as a numpy array or torch tensor of some internally supported dtypes like int or float; if they are any Python "object" types, they get replicated per dataloader worker, and OOM is guaranteed)

The choice to concatenate everything into a long string is a bit outdated nowadays. Because it trains with attention between different sentences that have nothing to do with each other, and could cause a bias or anyway suboptimal results. Nowadays people use masked attention ("document masking"), which is so popular it's even supported by FlashAttention: https://github.com/Dao-AILab/flash-attention/issues/654

(*) Of course, the data is dirty enough that there _will_ be some duplicated stuff here or there, but the same is true for a random train/validation split. Also such a small model would have very little risk to memorize, even if some data were replicated.*

➕ show 1 reply

lepicz • yesterday at 1:29 PM

cool, i was looking for something like this to try on my own puny hw - thanks!

noloman • yesterday at 7:54 PM

Great article

DeathArrow • yesterday at 11:35 AM

I think this is a very valuable exercise if you try to understand how LLMs work and if you have the time.

➕ show 1 reply

logicallee • yesterday at 2:10 PM

you can train an LLM in the browser, see this demonstration:

https://taonexus.com/mini-transformer-in-js.html

It's a very simple neural network with two attention heads that runs right in the browser in pure Javascript, you can view source on this implementation.

Even after training for a hundred epochs it really doesn't work very well (you can test it in the Inference tab after training it), but it doesn't use any libraries, so you can see the math itself in action in the source code.

chiengineer • yesterday at 2:10 PM

Off topic question since im not a regular here if its ok

Is anyone here actually using the 200$ a month subscriptions with chat gpt or the google 150$ per month ?

Is it worth it for more code generation ? Or spend my money on a couple gpus and go local

➕ show 2 replies

pixigenie • yesterday at 4:46 PM

thanks for sharing

roschdal • yesterday at 3:24 PM

Now this is cool. and can be used for evil AI.

alt Hacker News

LLM from scratch, part 28 – training a base model from scratch on an RTX 3090

Comments