Some quotes from the article stand out: "Claude after working for some time seem to always stop to recap things" Question: Were you running out of context? That's why certain frameworks like intentional compaction are being worked on. Large codebases have specific needs when working with an LLM.
"I've never interacted with Rust in my life"
:-/
How is this a good idea? How can I trust the generated code?
I'm very skeptical, but this is also something that's easy to compare using the original as a reference implementation, right? providing lots of random input and fixing any disparities is a classic approach for rewriting/porting a system
Hopefully they have a test suite written by QA otherwise they're for sure going to have a buggy mess on their hands. People need to learn that if you must rewrite something (often you don't actually need to) then an incremental approach best.
His goal was to get a faster oracle that encoded the behavior of Pokemon that he could use for a different training project. So this project provides that without needing to be maintainable or understandable itself.
I think it could work if they have tests with good coverage, like the "test farm" described by someone who worked in Oracle.
My answer to this is to often get the LLMs to do multiple rounds of code review (depending on the criticality of the code, doing reviews on every commit. but this was clearly a zero-impact hobby project).
They are remarkably good at catching things, especially if you do it every commit.
> How is this a good idea? How can I trust the generated code?
You don't. The LLMs wrote the code and is absolutely right. /s
What could possibly go wrong?
Same way you trust any auto translation for a document. You wrote it in English (or whatever language you’re most proficient in), but someone wants it in Thai or Czech, so you click a button and send them the document. It’s their problem now.
The author says that he runs both the reference implementation and the new Rust implementation through 2 million (!) randomly generated battles and flags every battle where the results don't line up.