logoalt Hacker News

ks2048today at 5:10 PM1 replyview on HN

I'll be looking at this in detail. I've started a company to do similar things, https://6k.ai

I'm currently concentrating on better data gathering for low-resource languages.

When you look in detail at data like Common Crawl, finepdfs, and fineweb, (1) they are really lacking quality data sources if you know where to look, and (2) the sources they have are not processed "finely" enough (e.g. finepdfs classify each page of PDF as having a specific language, where-as many language learning sources have language pairs, etc.


Replies

intendedtoday at 5:28 PM

There’s many nation states working on this, have you looked into availability of those data sets?

What languages are you prioritizing?

show 1 reply