logoalt Hacker News

Havocyesterday at 11:46 AM5 repliesview on HN

> When you’re looking at a pre-training dataset in the frontier lab and you look at a random internet document, it’s total garbage. I don't even know how this works at all. It’s [stuff] like stock tickers, symbols, it's a huge amount of slop and garbage from like all the corners of the internet

Seems like there would be low hanging fruit in heavier pre processing then? Something deterministic like a reading level score. Or even a tiny model trained for the task to pick out good data?


Replies

qriosyesterday at 4:17 PM

"low hanging" is relative. At least from my perspective. A significant part of my work involves cleaning up structured and unstructured data.

An example: More than ten years ago a friend of mine was fascinated by the german edition of the book "A Cultural History of Physics" by Károly Simonyi. He scanned the book (600+ pages) and created a PDF (nearly) same layout.

Against my advice he used Adobe tools for it instead of creating an epub or something like DocBook.

The PDF looks great, but the text inside is impossible to use as training data for a small LLM. The lines from the two columns are mixed and a lot of spaces are randomly placed (makes it particularly difficult because mathematical formulas often appear in the text itself).

After many attempts (with RegEx and LLMs), I gave up and rendered each page and had a large LLM extract the text.

azath92yesterday at 12:51 PM

For small models this is for sure the way forward, there are some great small datasets out there (check out the tiny stories dataset that limits vocab to a certain age but keeps core reasoning inherent in even simple language https://huggingface.co/datasets/roneneldan/TinyStories https://arxiv.org/abs/2305.07759)

I have less concrete examples but my understanding is that dataset curation is for sure the way many improvements are gained at any model size. Unless you are building a frontier model, you can use a better model to help curate or generate that dataset for sure. TinyStories was generated with GPT-4 for example.

show 1 reply
embedding-shapeyesterday at 12:41 PM

Makes me wonder what kind of model we could get if we just trained on Wikidata and similar datasets, but pre-processed to be natural language rather than just triplets of data.

haolezyesterday at 12:17 PM

If you can create this filtering model, you have created Skynet and solved AGI :D

ACCount37yesterday at 12:34 PM

Data filtering. Dataset curation. Curriculum learning. All already in use.

It's not sexy, it's not a breakthrough, but it does help.

show 2 replies