No I'm saying there are quite a few more bottlenecks than that (I/O being a big one). Even...

spmurrayzzz • yesterday at 7:52 PM • 1 reply • view on HN

No I'm saying there are quite a few more bottlenecks than that (I/O being a big one). Even in the more efficient training frameworks, there's per-op dispatch overhead in python itself. All the boxing/unboxing of python objects to C++ handles, dispatcher lookup + setup, all the autograd bookkeeping, etc.

All of the bottlenecks in sum is why you'd never get to 100% MFUs (but I was conceding you probably don't need to in order to get value)

Replies

djsjajah • today at 1:01 AM

That’s kind of a moot point. Even if none of those overheads existed you would still be getting a a fractions of the mfu. Models are fundamental limited by memory bandwidth even with best case scenarios of sft or prefill.

And what are you doing that I/O is a bottleneck?

alt Hacker News

Replies