I would really like to see what "appropriately licensed data" means. Cannot imagine they didn't copy all open repo's on GitHub, and can't imagine they asked for permission, or are reproducing license texts from these repo's now. It sounds hand wavy.
P.S. A fairly basic website otherwise, but it unfortunately seems to be hacking scroll for no good reason.
I assume they took the actual repos’ licenses info account. I don’t understand why they should ask for permission when the license would already allow for it.
Recently, GitHub has changed their terms of service to use all user data for AI training unless users explicitly opt out. This is probably the way Microsoft has obtained "appropriately licensed data".
Presumably their position remains that training on public repos is fair use and doesn't require a license. If it doesn't require a license it's still "appropriately licensed".