The study was designed to have devs who are comfortable with AI perform 50% of tasks with AI and 50% without. So the problem is the population of "Developers who use AI regularly but are willing to do tasks without AI" is shrinking.
>> Are they worried that by splitting devs into groups of AI experience they might be measuring some confounder that causes people to choose AI / not AI in their careers?
The developer sample size was small (16 people in the original study) and the task sample size is larger (~250 tasks). I think the worry is variance in developer productivity would totally wash out any signal.