logoalt Hacker News

zozbot234yesterday at 5:15 PM2 repliesview on HN

Cloud hardware is not inherently more "proper" than what's being proposed here, there's nothing wrong per se about targeting slower inference speeds in an on prem single-user context.


Replies

Aurornisyesterday at 5:19 PM

> Cloud hardware is not inherently more "proper" than what's being proposed here

Cloud hardware can run the original model. Quantization will reduce quality. The quality drop to Q4 is not trivial.

Cloud hardware is also massively faster in time to first token and token generation speed.

> there's nothing wrong per se about targeting slower inference speeds in a local single-user context.

If that's what the user wants and expects then it's fine

Most people working interactively with an LLM would suffer from slower turns.

show 1 reply
cbg0yesterday at 5:20 PM

The quantization for some models can be very detrimental and their quality can drop considerably from the posted benchmarks which are probably at bf16, this is why having considerable RAM can be important.