I gave Codex with GPT-5.5 High this prompt: Identify bugs in [datrs/varinteger...

moonset • today at 12:37 AM • 5 replies • view on HN

I gave Codex with GPT-5.5 High this prompt:

    Identify bugs in [datrs/varinteger](https://github.com/datrs/varinteger) . Do NOT look at the GitHub issues, just inspect the source

It also found the bug that Leanstral 1.5 found and the authors highlighted. I think this bug wasn't especially tricky; it's just a case of too few eyeballs on this repo.

Congrats on the release regardless! Excited for the direction Lean + automated AI proofs are headed.

Disclosure: I work at OpenAI.

Replies

adev_ • today at 2:14 AM

> It also found the bug that Leanstral 1.5 found and the authors highlighted

This is a little bit like someone pointing the moon and you look at the finger.

The formal proof domain goes way beyond just finding bugs.

It has tons of usages in term of functional safety, protocol validation, cryptography, etc...

The fact Mistral tackle this kind of problem is both smart and not so surprising.

Smart because it is niche enough that they do not front face the big competitors (yet).

No so surprising because the French labs have a well known and long time expertise with formal proof tools (Coq and all its Ocaml associated tools). It has been historically mainly pushed by the aerospace and train industries (Airbus, Dassault, Alsthom).

ilc • today at 1:55 AM

Given that they directly compare to GPT-5.5 in their documentation. This comes off as puppy kicking to me. They state it is not SOTA, even IN its domain!

Honestly: Think twice before dragging your firm into what you say.

Disclaimer: I speak for myself. Not any firm I am associated with.

noperator • today at 1:39 AM

Leanstral 1.5 has 6B active parameters. How many parameters does GPT-5.5 have?

➕ show 1 reply

InsideOutSanta • today at 6:34 AM

GPT-5.5 is what, a trillion+ parameter model? I think the insight here is that you can do this with a tiny model.

8note • today at 1:38 AM

the mechanism is whats interesting, rather than whether it could do it.

this sounds like a great tool to add to the toolbelt, as part of the "how do we handle all the code output from LLMs" problem

alt Hacker News

Replies