logoalt Hacker News

rjzzleeptoday at 6:08 AM1 replyview on HN

Gentle reminder that while most money is spent on LLM inference, the vast majority of useful AI use is in fact not LLMs. Also, more and more work is poured into making small models. One thing I like about the whole export controls saga is that people are finding creative ways to squeeze performance out of these devices as witnessed in this post. But, if you then look at solutions like vLLM, vLLM will just fill whatever VRAM is available, no matter the context size, or the model size. So then you have two things to worry about:

First, where do you know exactly what the optimal VRAM assignment per model, per context size is, which seems to be currently based purely on experience and second how do you make sure that only that amount is available to your infra/containers, which is being handled by DRA and stuff like https://project-hami.io

While only tangentially related to the blog post here. The title is picked in such a way that I couldn't help, but put the shameless plug here. When he wrote popping the bubble, I thought we're talking about devices and reducing NVIDIA dependency, but this seems very focused on Cuda.

Disclaimer: I work with Dynamia.ai, the founders of which created HAMi.


Replies

esperenttoday at 9:21 AM

> the vast majority of useful AI use is in fact not LLMs

Can you explain what you mean here? Are you talking about small neural networks doing specific tasks?