Non-expert here who likes reading lots of this kind of research. I have a few questions.
1. Why does it need a zeroth order optimizer?
2. Most GA's I've seen use thousands of solutions. Sometimes ten thousand or more. What leads you to use 60,000 calls per iteration?
3. How do you use populations and "islands?" I never studied using islands.
4. You said the smaller models are often better for "shorter" code. That makes sense. I've seen people extend the context of model with training passes. You think it would help to similarly shrink a larger model to a smaller context instead of using the small models?