Does anyone know what kind of RL environments they are talking about? They mention they used 15k environments. I can think of a couple hundred maybe that make sense to me, but what is filling that large number?
Every interactive system is a potential RL environment. Every CLI, every TUI, every GUI, every API. If you can programmatically take actions to get a result, and the actions are cheap, and the quality of the result can be measured automatically, you can set up an RL training loop and see whether the results get better over time.
Rumours say you do something like: