Hey, thanks for kicking the tires! The run you’re describing was done in mid-April, right after GPT-4.1 went live. Since then OpenAI has refreshed the weights behind the “gpt-4.1” alias a couple of times, and one of those updates fixed the em-dash miss.
If you reran today you’d see the same improved pass rate I’m getting now. That’s the downside of benchmarking against latest model names; behaviour changes quietly unless you pin to a dated snapshot.
For bigger, noisier prompts (or on GPT-3.5-turbo, which hasn’t changed) TSCE still gives a solid uplift, so the framework’s value stands. Appreciate you checking it out!
> Since then OpenAI has refreshed the weights behind the “gpt-4.1” alias a couple of times, and one of those updates fixed the em-dash miss.
I don't know where you are getting this information from... The only snapshot of gpt-4.1 is gpt-4.1-2025-04-14 (mid-April), and the gpt-4.1 alias still points to it [1].
Just to be sure, I re-ran my test specifying that particular snapshot and am still getting a 100% pass rate.
[1]: https://platform.openai.com/docs/models/gpt-4.1