> Not sure what was unexpected about the multi GPU part. It's very well known that most LLM ...

scotty79 • today at 10:36 AM • 1 reply • view on HN

> Not sure what was unexpected about the multi GPU part. It's very well known that most LLM frameworks including llama.cpp splits models by layers, which has sequential dependency, and so multi GPU setups are completely stalled

Oh, I thought the point of transformers was being able to split the load veritcally to avoid seqential dependancies. Is it true just for training or not at all?

Replies

sailingparrot • today at 3:27 PM

Just for training and processing the existing context (pre fill phase). But when doing inference a token t has to be sampled before t+1 can so it’s still sequential

alt Hacker News

Replies