I don't think its much of an issue
- Rl envs + synthetic data + human annotated
- Usage data from codex/claude code/cursor
Most of the model abilities in coding come from post-training, not pretraining
A better question is what's left for those who don't have access to that. We went from publicly available to vacuumed from private users
A better question is what's left for those who don't have access to that. We went from publicly available to vacuumed from private users