I'll be looking at this in detail. I've started a company to do similar things,

ks2048 • today at 5:10 PM • 1 reply • view on HN

I'll be looking at this in detail. I've started a company to do similar things, https://6k.ai

I'm currently concentrating on better data gathering for low-resource languages.

When you look in detail at data like Common Crawl, finepdfs, and fineweb, (1) they are really lacking quality data sources if you know where to look, and (2) the sources they have are not processed "finely" enough (e.g. finepdfs classify each page of PDF as having a specific language, where-as many language learning sources have language pairs, etc.

Replies

intended • today at 5:28 PM

There’s many nation states working on this, have you looked into availability of those data sets?

What languages are you prioritizing?

➕ show 1 reply

alt Hacker News

Replies