logoalt Hacker News

Launch HN: General Instinct (YC P26) – Frontier models on edge devices

15 pointsby guanming0717today at 4:33 PM7 commentsview on HN

Hey HN, Guanming and Bill here from General Instinct (https://general-instinct.com/).

After years of working in robotics, we kept running into the same problem: the best models never fit the hardware we actually had available.

The models that performed best were usually designed around datacenter assumptions: large GPUs, lots of memory bandwidth, and reliable network access. But most physical systems have the opposite constraints.

That led us down the path of figuring out how much of a frontier model could be preserved while still making it practical to run on edge hardware.

As part of that work, we recently open sourced InstinctRazor (https://github.com/General-Instinct/InstinctRazor)

One result we're excited about is compressing Qwen3.5-122B-A10B, a roughly 245 GB BF16 MoE model, into a 48 GiB GGUF. The resulting model is actually smaller than Gemma-4-26B-A4B while outperforming it on benchmarks like MMLU-Pro and GPQA-D etc. we preserve the parts that are always active (router, norms, Gated-DeltaNet/SSM layers, vision pathway, etc.) and quantize the routed experts much more aggressively. We then use on-policy distillation to recover capability lost during quantization.

The model can also run in a "small GPU" configuration where experts are streamed from system RAM. With an 8k context window, peak VRAM usage is around 7.6–8 GB.

If you're interested in the technical details, we wrote up the approach here (https://general-instinct.com/blog/frontier-moe-sub-4-bit)

We're especially interested in hearing from people deploying models onto robots or other edge devices. What models are you trying to run locally today? What has been the biggest bottleneck in getting them into production?


Comments

rohansood15today at 5:40 PM

Have you benchmarked against other 3-bit dynamic quants like Unsloth? I am sorry but this framing against a full precision, newer, smaller MoE just seems misleading. Also, Gemma-4-26B-A4B is not the SOTA for edge. Even at launch, that would be the 31B.

show 1 reply
XenophileJKOtoday at 5:34 PM

I'm still kind of surprised that people are targeting edge deployment of MoE models. By definition they optimize for computation cost at the expense of memory efficiency. We generally need the opposite on the edge.

I'm hoping to see more work in the other direction with cyclic/looped transformers and other memory dense approaches.

VikRubenfeldtoday at 4:50 PM

You've likely heard about this - he'd probably like to talk to you and might potentially give you some good PR.

https://www.youtube.com/watch?v=rAzT5lcezPs&t=467s

show 2 replies