logoalt Hacker News

Show HN: Find the best local LLM for your hardware, ranked by benchmarks

256 pointsby andyyyy64today at 9:19 AM52 commentsview on HN

Comments

Aurornistoday at 1:48 PM

1. The results of this tool are not good. It’s recommending outdated models like Qwen2.5 series and missing good new models.

2. This could have been a single web page that runs in your browser and lets you enter hardware specs, like all of the other tools like this. It is not a good idea to install and run unknown projects like this on your computer in this age.

3. The project is very obviously vibecoded, down to the README

4. Every comment from this account appears to be AI generated too.

I would recommend not installing and running this on your computer. There is no advantage over other tools and everything about the account and project looks like low effort AI generated content.

show 2 replies
wren6991today at 2:51 PM

I also have a script to find the best LLM for your hardware. Here:

echo "Qwen3.6-27B"

show 1 reply
jordiburgostoday at 11:12 AM

This is very helpful too: https://www.canirun.ai/

show 5 replies
esttoday at 1:44 PM

Why can't you use a web page instead ?

show 1 reply
karmakazetoday at 12:51 PM

Not perfect, but I find the artificialanalysis.ai "Intelligence vs. Output Tokens Used in Artificial Analysis Intelligence Index" chart[0] (scroll down to the titled chart) to be of great use. A proper evaluation needs to compare 3 things together: score, speed, and verbosity. This chart plots score vs verbosity.

[0] https://artificialanalysis.ai/?models=gpt-oss-120b%2Cgemma-4...

porneltoday at 11:01 AM

It looks nice. I've been searching for something like this recently, and was frustrated with rankings that lack latest models or don't clearly distinguish quantizations.

Showing quality loss per quantization is nice.

I'd prefer this as a website, since I'd handle running of the model with a dedicated inference server anyway.

It would be nice to see what's the maximum context length that can fit on top of the baseline.

I was surprised how much token generation speed tanks when using very long context. 30/s can drop down to 2/s. A single speed metric didn't prepare me for that.

I was also positively surprised that some models scale well with batch parallelism. I can get 4x speed improvement by running 8 requests in parallel. But this affects memory requirements, and doesn't apply to all models and inference engines. It would be nice to show that. Some sites fold it into "what's your workflow", but that's too opaque.

KV cache quantization also makes a difference for speed, VRAM usage and max usable context.

On Apple Silicon MLX-compatible model builds make a difference, so I'd like to see benchmarks reassure they're based on the fastest implementation.

Multi-token-prediction is another aspect that may substantially change speed.

show 1 reply
zambellitoday at 2:55 PM

Only thing I would add is the ability to point to new/uncataloged benchmarks. If I have a favorite benchmark that best matches my use-case, the ability to point to it and have it fuzzy match model names or what have you would be a neat feature.

Bigsytoday at 10:41 AM

Brew install is broken

It seems pretty rubbish I have to say, its recommending me loads of qwen 2.5 which are really old and I'm easy running qwen3.5 and 3.6 models on this mac at decent quants

show 3 replies
desireco42today at 3:07 PM

I got really good results when I asked Pi (agent) to install

https://github.com/AlexsJones/llmfit

and tell me which ones are good for me. It would organize it per use and selection was solid.

armcattoday at 1:01 PM

Interesting concept! A suggestion: `whichllm <USE_CASE>` would be more beneficial, i.e. `which coding` or `which text-to-video`.

drzaiusx11today at 1:49 PM

Where are the metrics being sourced from exactly? Externally? or does this project have benchmarks running somewhere for the purposes of this project. Latter would likely be more apples to apples comparisons in case some external sources are biasing etc

llagerloftoday at 10:52 AM

What’s new regarding llmfit?

https://github.com/AlexsJones/llmfit

show 3 replies
zkmontoday at 12:48 PM

"Best LLM" doesn't really depend on hardware alone. It actually depends more on your needs - type of workload, context length needed etc.

karmakazetoday at 1:15 PM

Is there any free hosting for Python scripts? That would be much more convenient for casual use.

Jassssstoday at 10:27 AM

The plan command is clever. How do you handle the VRAM estimation for models with sliding window attention vs full context? Something like Mistral at 32k context uses way less KV cache than Llama at the same context length, but from the README it looks like the estimation is based on a fixed context size. Does it account for that?

show 1 reply
sleepyeldrazitoday at 11:03 AM

I love this community, I started building a simple website for this exactly a couple of hours ago and you made an even more advanced version already. Hats off to you sir.

If i ever decide to actually publish the site, is it alright if I mention you somewhere as a "If you want a more accurate estimation, check out this project:<your repo>", as i think there is value in having a simple website estimate this information for you, and give you instructions/ common flags on how to start it yourself (also a prompt crafted for you to optionally give to an llm to set it up for you), but im going off simple "choose an os, gpu/vram, here's a list of options" and not actually scanning (which is a lot more accurate).

Bekamakharatoday at 2:02 PM

Tried it on a 4060, got Qwen3-14B Q3_K_M. Matches what I actually run. Brew install failed for me too though (macOS 14.5).

swaminarayantoday at 2:34 PM

How it select model? using AI?

raframtoday at 1:32 PM

OP is a newish user, all of their responses here are copied straight from Claude, and this project has an LLM slop readme (count them: 48 em-dashes on the page!) and LLM code. Just not very interesting.

nurettintoday at 2:49 PM

Find the best LLM for your local hardware.

lifts mask

It's qwen.

Jeremy1026today at 1:49 PM

canirun.ai does similar, but doesn't require any installs. I've found it to be pretty accurate for my setup too.

show 1 reply
justindotdevtoday at 12:51 PM

it'd be nice if it had igpu support, it cant even detect it. overall great tool though. happy this exists.

wald3ntoday at 12:41 PM

Cool idea, thanks for making this

kramit1288today at 10:45 AM

accurate memory estimation is key here. it will crash if that accurate and it cant be generic for all local llm. each local llm has different context estimates.

andaitoday at 12:44 PM

Has anyone gotten the old gpt-oss models running? They scored very high on benchmarks but I constantly had strange problems with them.

So two questions there:

(1) is it actually possible to get good results with them (some people said they got good results, which implies that it might have been hard to get them running properly, but if you can, then they're actually good?). Which also implies the second question,

(2) are benchmarks a spook?

---

...Also, is OP Claude?

show 1 reply
macwhisperertoday at 10:53 AM

can you add in the other quants like IQ3_M?

also my personal simple rule of thumb for local ai sizing is:

max model size (GB) = ram (GB) / 1.65

cyanydeeztoday at 11:42 AM

This doesn't correclty detect the unified memory architecture for

GPU 0: STRXLGEN — 8.0 GB (ROCm 6.19.8-200.fc43.x86_64) — BW: N/A CPU: AMD RYZEN AI MAX+ 395 w/ Radeon 8060S — 16 cores (AVX2, AVX-512)

The 8GB is the reserved memory, but it's not the total available memory to the GPU.

Linux sets the unified memory like this on linux: https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...

Don't feel bad though, nvtop doesn't do it correctly either.

pbroneztoday at 11:16 AM

Cool, but it looks like it doesn’t actually test anything on your machine? It does hardware detection and then some lookups. Maybe I missed it but I really want a tool like this to actually run a model on my machine to get the speed numbers.

I’ve been using RapidMLX for this. The integrated speed tests matter because the quality of the backend is a moving target and the quantization / MLX format conversion also matter. It’s not enough to say “oh use this model family with X parameters” you have to add the architecture specific quantization too.

https://github.com/raullenchai/Rapid-MLX

show 1 reply
s3anw3today at 1:22 PM

Good job.

elevententoday at 2:29 PM

[dead]

fstephotoday at 12:39 PM

[flagged]

hacker_martoday at 11:13 AM

[flagged]