> Frankly, unless you rewrite your models you don't really have a choice but using nVidia GPUs, thanks to, ironically, Facebook (authors of pytorch). There is pytorch/XLA automatic translation to TPU but it doesn't work for "big" models. And as a point of advice: you want stuff to work on TPUs?
I don't understand what you mean, most models aren't anywhere near big in terms of code complexity, once you have the efficient primitives to build on (like you have an efficient hardware-accerated matmul, backprop, flash attention, etc.) these models are in the sub-thousand LoC territory and you can even vibe-convert from one environment to another.
That's kind of a shock to realize how simple the logic behind LLMs is.
I still agree with you, Google is most likely still using Nvidia chips in addition to TPUs.
> I don't understand what you mean, most models aren't anywhere near big in terms of code complexity, once you have the efficient primitives to build on (like you have an efficient hardware-accerated matmul, backprop, flash attention, etc.) these models are in the sub-thousand LoC territory and you can even vibe-convert from one environment to another.
You're right but that doesn't work. Transformers won't perform well without an endless series of tricks. So endless you can't write that series of tricks. You can't initialize the network correctly when starting from scratch. You can't do the basic training that makes the models good (ie. the trillions of tokens). Flash attention, well that's 2022, it's cuda assembly, and only works on nVidia. Now there's 6 versions of flash attention, all of which are written in Cuda Assembly. It's also only fast on nvidia.
So what do you do? Well you, as they say "start with a backbone". That used to always be a llama model, but Qwen is making serious inroads.
The scary part is that this is what you do for everything now. After all, llama and Qwen are text transformers. They answer "where is Paris?". They don't do text-speech, speech recognition, object tracking, classification, time series, image-in or out, OCR, ... and yet all SOTA approaches to all of these can be only slightly inaccurately described as "llama/qwen with a different encoder at the start".
That even has the big advantage that mixing becomes easy. All encoders produce a stream of tokens. The same tokens. So you can "just" have a text encoder, a sound encoder, an image encoder, a time series encoder and just concatenate (it's not quite that simple, but ...) the tokens together. That actually works!
So you need llama or Qwen to work, not just the inference but the training and finetuning, with all the tricks, not just flash attention, half of which are written in cuda assembly, because that's what you start from. Speech recognition? SOTA is taking sounds -> "encoding" into phonemes -> have Qwen correct it. Of course, you prefer to run the literal exact training code from ... well from either Facebook or Alibaba, with as little modifications as possible, which of course means nvidia.