Hey, thanks for kicking the tires! The run you’re describing was done in mid-April, right after GPT-...

airylizard • 05/15/2025 • 1 reply • view on HN

Hey, thanks for kicking the tires! The run you’re describing was done in mid-April, right after GPT-4.1 went live. Since then OpenAI has refreshed the weights behind the “gpt-4.1” alias a couple of times, and one of those updates fixed the em-dash miss.

If you reran today you’d see the same improved pass rate I’m getting now. That’s the downside of benchmarking against latest model names; behaviour changes quietly unless you pin to a dated snapshot.

For bigger, noisier prompts (or on GPT-3.5-turbo, which hasn’t changed) TSCE still gives a solid uplift, so the framework’s value stands. Appreciate you checking it out!

Replies

thegeomaster • 05/15/2025

> Since then OpenAI has refreshed the weights behind the “gpt-4.1” alias a couple of times, and one of those updates fixed the em-dash miss.

I don't know where you are getting this information from... The only snapshot of gpt-4.1 is gpt-4.1-2025-04-14 (mid-April), and the gpt-4.1 alias still points to it [1].

Just to be sure, I re-ran my test specifying that particular snapshot and am still getting a 100% pass rate.

[1]: https://platform.openai.com/docs/models/gpt-4.1

➕ show 1 reply

alt Hacker News

Replies