Step 3.5 Flash – Open-source foundation model, supports deep reasoning at speed

109 points • by kristianp • today at 2:32 AM • 34 comments • view on HN

Comments

> 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks with unwavering stability

I don't know anything about TerminalBench, but on the face of it a 51% score on a test metric doesn't sound like it would guarantee 'unwavering stability' on sophisticated long-horizon tasks

culi • today at 8:21 AM

It's nice to see more focus on efficiency. All the recent new model releases have come along with massive jumps in certain benchmarks but when you dig into it it's almost always paired with a massive increase in token usage to achieve those results (ahem Google Deep Think ahem). For AI to truly be transformational it needs to solve the electricity problem

➕ show 1 reply

danieltanfh95 • today at 5:05 AM

Hallucinates like crazy. use with caution. Tested it with a simple "Find me championship decks for X pokemon", "How does Y deck work". Opus 4.6, Deepseek and Kimi all performed well as expected.

➕ show 1 reply

kristianp • today at 2:33 AM

Recent model released a couple of weeks ago. "Mixture of Experts (MoE) architecture, it selectively activates only 11B of its 196B parameters per token". Beats Kimi K2.5 and GLM 4.7 on more benchmarks than it loses to them.

Edit: there are 4 bit quants that can be run on an 128GB machine like a GB10 [1], AI Max+ 395, or mac studio.

[1] https://forums.developer.nvidia.com/t/running-step-3-5-flash...

➕ show 1 reply

janalsncm • today at 9:37 AM

Number of params isn’t really the relevant metric imo. Top models don’t support local inference. More relevant is tokens per dollar or per second.

➕ show 2 replies

Mashimo • today at 10:48 AM

Holy moly, I made a simple coding promt and the amount of reasoning output could fill a small book.

> create a single html file with a voxel car that drives in a circle.

Compared to GLM 4.7 / 5 and kimi 2.5 it took a while. The output was fast, but because it wrote so I had to wait longer. Also output was .. more bare bones compared to others.

➕ show 1 reply

mohsen1 • today at 9:28 AM

SWE-bench Verified is nice but we need better SWE benchmarks. Making a fair benchmark is a lot of work and a lot of money needed to run it continuously.

Most of "live" benchmarks are not running enough with recent models to give you a good picture of which models win.

The idea of a live benchmark is great! There are thousands of GitHub issues that are resolved with a PR every day.

➕ show 1 reply

amelius • today at 10:54 AM

Does it pass the carwash test?

➕ show 1 reply

prmph • today at 9:43 AM

Interesting.

Each time a Chinese model makes the news, I wonder: How come no major models are coming from Japan or Europe?

➕ show 4 replies

wmf • today at 3:35 AM

That reverse x axis sure is confusing.

➕ show 1 reply

sinenomine • today at 9:33 AM

Works impressively well with pi.dev minimal agent.

SilverElfin • today at 4:53 AM

So who exactly is StepFun? What is their business (how do they make money)? Each time I click “About Stepfun” somewhere on their website, it sends me to a generic landing page in a loop.

➕ show 3 replies

agentifysh • today at 6:37 AM

what country is behind this one ?

➕ show 1 reply

alt Hacker News

Step 3.5 Flash – Open-source foundation model, supports deep reasoning at speed

Comments