DeepSWE is the benchmark you want to actually look out for. Only one that aligns with actual user r...

CSMastermind • yesterday at 9:17 PM • 1 reply • view on HN

DeepSWE is the benchmark you want to actually look out for. Only one that aligns with actual user reported results from trying the models.

Replies

ryeguy • today at 12:14 AM

Did you read the blog post? They compare to deepswe and call it out as the worst one for false positives (failed, but the benchmark assessed it as correct). It also has less language variance.

➕ show 1 reply

alt Hacker News

Replies