First thoughts using gpt-5.3-codex-spark in Codex CLI:
Blazing fast but it definitely has a small model feel.
It's tearing up bluey bench (my personal agent speed benchmark), which is a file system benchmark where I have the agent generate transcripts for untitled episodes of a season of bluey, perform a web search to find the episode descriptions, and then match the transcripts against the descriptions to generate file names and metadata for each episode.
Downsides:
- It has to be prompted to do actions in my media library AGENTS.md that the larger models adhere to without additional prompting.
- It's less careful with how it handles context which means that its actions are less context efficient. Combine that with the smaller context window and I'm seeing frequent compactions.
Bluey Bench* (minus transcription time):
Codex CLI
gpt-5.3-codex-spark low 20s
gpt-5.3-codex-spark medium 41s
gpt-5.3-codex-spark xhigh 1m 09s (1 compaction)
gpt-5.3-codex low 1m 04s
gpt-5.3-codex medium 1m 50s
gpt-5.2 low 3m 04s
gpt-5.2 medium 5m 20s
Claude Code
opus-4.6 (no thinking) 1m 04s
Antigravity
gemini-3-flash 1m 40s
gemini-3-pro low 3m 39s
*Season 2, 52 episodesCan you compare it to Opus 4.6 with thinking disabled? It seems to have very impressive benchmark scores. Could also be pretty fast.
I wonder why they named it so similiarly to the normal codex model while it much worse, while cool of course.
can we plese make the bluey bench the gold standard for all models always