This lines up with something I keep coming back to. Sara Hooker's research shows compact models now outperform massive predecessors on many tasks, and scaling laws only reliably predict pre-training loss, not downstream performance. A minimal transformer learning 10-digit addition is a neat data point for that thesis. I wrote about the broader implications (2)
The trillion-dollar scaling bet looks increasingly like it's hitting diminishing returns.
(1) https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5877662
(2) https://philippdubach.com/posts/the-most-expensive-assumptio...