I've played with something similar with my M1 using Apple's MLX framework. The problem is I'm compute bound. I've never managed to get my M1 Max's GPU to process more than ~7.8k tokens per second at bf16 precision, so to train a 112M parameter model on ~20 billion tokens I'd need to run the model training for ~30 days.
One solution is to reduce the scope of the problem -- you can train on a smaller less diverse dataset such as TinyStories which is a collection of 1 billion tokens of chatGPT generated children's stories. After about 40 hours, less than one weekend, you'll have a model which can generate mostly grammatical children's stories.
If you have a newer mac and/or an ultra chip you'll have more and faster GPU cores, and might be able to train on FineWeb or a similar, larger and more diverse dataset.