First thoughts using gpt-5.3-codex-spark in Codex CLI: Blazing fast but it definitely has a...

postalcoder • yesterday at 8:02 PM • 3 replies • view on HN

First thoughts using gpt-5.3-codex-spark in Codex CLI:

Blazing fast but it definitely has a small model feel.

It's tearing up bluey bench (my personal agent speed benchmark), which is a file system benchmark where I have the agent generate transcripts for untitled episodes of a season of bluey, perform a web search to find the episode descriptions, and then match the transcripts against the descriptions to generate file names and metadata for each episode.

Downsides:

- It has to be prompted to do actions in my media library AGENTS.md that the larger models adhere to without additional prompting.

- It's less careful with how it handles context which means that its actions are less context efficient. Combine that with the smaller context window and I'm seeing frequent compactions.

  Bluey Bench* (minus transcription time):

  Codex CLI
  gpt-5.3-codex-spark low        20s
  gpt-5.3-codex-spark medium     41s
  gpt-5.3-codex-spark xhigh   1m 09s (1 compaction)

  gpt-5.3-codex low           1m 04s
  gpt-5.3-codex medium        1m 50s

  gpt-5.2 low                 3m 04s
  gpt-5.2 medium              5m 20s

  Claude Code
  opus-4.6 (no thinking)      1m 04s

  Antigravity
  gemini-3-flash              1m 40s
  gemini-3-pro low            3m 39s

  *Season 2, 52 episodes

Replies

alexdobrenko • yesterday at 9:17 PM

can we plese make the bluey bench the gold standard for all models always

mnicky • yesterday at 8:29 PM

Can you compare it to Opus 4.6 with thinking disabled? It seems to have very impressive benchmark scores. Could also be pretty fast.

➕ show 1 reply

Squarex • yesterday at 8:28 PM

I wonder why they named it so similiarly to the normal codex model while it much worse, while cool of course.

alt Hacker News

Replies