logoalt Hacker News

ivapelast Wednesday at 12:54 PM0 repliesview on HN

Does anyone actually run a modest sized app and can share numbers on what one gpu gets you? Assuming something like vllm for concurrent requests, what kind of throughput are you seeing? Serving an LLM just feels like a nightmare.