if datasets are the new codebases ,then the real IP can be dataset version control. how you fork ,diff ,merge and audit datasets like code. every team says 'we trained on 10B tokens' but what if we can answer 'which 5M tokens made reasoning better', 'which 100k made it worse'. then we can start being targeted leverage