This looks cool, but I wonder how well their trained compiler generalizes to new task families. They trained on 29 specific types of tasks, with 800 sub tasks and many rephrasings of each one (the specs). They hold out some specs for validation, but don’t seem to have held out a full task family and maybe not even full sub tasks?
If the compiler can’t generalize well to unseen tasks then it’s effectively acting as a fancy router to one of 29/800 predefined LoRAs.
Despite the appeal of such an approach, I find this extremely unsettling.
Imagine if we had declared that Math for FIR filter design in Signal Processing was too difficult, so we’d just test random FIR coefficients until something good came out.
That sounds pretty horrible but at least the frequency response of the resulting filter would be known. We’d at least understand the behavior of the final product.
With LLMs, we don’t even know what we’re getting out of it.
(And no, I don’t see anything wrong with adaptive filters and such. Their behavior can still be quantified)
> PAW reframes the foundation model from a per-input problem solver into a tool builder: invoked once per function definition, it produces a small reusable artifact whose subsequent calls per function application are cheap and offline.
Umm you can just get the LLM to spit out real functions instead of fuzzy functions and just run those real functions through real interpreters, which is also "cheap" and "offline".
I like the goal of this. As expected, I don't really understand the math/concept of this. It sounds like it caches some neural network activity and exports it to be run later. So I suppose this can't be used for things like image or video generation.