I don’t think anyone serious would recommend it for serious production systems. I respect the Ralph technique as a fascinating learning exercise in understanding llm context windows and how to squeeze more performance (read: quality) from today’s models
Even if in the absolute the ceiling remains low, it’s interesting the degree to which good context engineering raises it
How is it a “fascinating learning exercise” when the intention is to run the model in a closed loop with zero transparency. Running a black box in a black box to learn? What signals are you even listening to to determine whether your context engineering is good or whether the quality has improved aside from a brief glimpse at the final product. So essentially every time I want to test a prompt I waste $100 on Claude and have it an entire project for me?
I’m all for AI and it’s evident that the future of AI is more transparency (MLOPs, tracing, mech interp, AI safety) not less.