The streetlight effect:
> A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is"
All of your suggestions are better but they're hard, so someone casually evaluating an AI isn't going to do them.
The minute an open model breaks through and beats Claude Opus/Fable, it's over.
There are far more opportunities that can be served when the world's intellectuals have the raw weights and can fine tune, splice, distill, and reapply.
Imagine having raw unfettered access to Fable. It can be refit to structural biology. It can be fine tuned on the repo for smaller context requirements. It can be run cheaper and air gapped.
The world wants this.
[dead]
Feels a rather outdated little parable, since nowadays one would expect the police officer to either arrest or shoot the person.
Sure, for casual evaluation, I agree. But are there serious analyses that are evaluating this kind of thing? I mean, these are the kinds of things I evaluate in my own work when a new model comes out, or when I'm evaluating a harness. But this is all very ad hoc and intuitional. I'd love to start bringing rigor to it, but I haven't found much prior art on this. In another thread someone said that's because it's probably impossible to do this rigorously because too much of it is subjective. And that does match my intuition. But I continue to suspect that intuition is wrong.