This is a fantastic educational resource. I've always found that building a "toy" version of a complex system is the best way to actually understand the architecture.
Quick question for the author: did you experiment with different tokenization strategies, or did you stick to a simple character-level/word-level split for this scale? I'm curious if BPE or similar would even be worth the overhead for a 9M parameter model.