logoalt Hacker News

Borealidyesterday at 11:41 PM4 repliesview on HN

If all the training data contains semantically-meaningful sentences it should be possible to build a network optimized for generating semantically-meaningful sentence primarily/only.

But we don't appear to have entirely done that yet. It's just curious to me that the linguistic structure is there while the "intelligence", as you call it, is not.


Replies

dvtyesterday at 11:45 PM

> If all the training data contains semantically-meaningful sentences it should be possible to build a network optimized for generating semantically-meaningful sentence primarily/only.

Not necessarily. You can check this yourself by building a very simple Markov Chain. You can then use the weights generated by feeding it Moby Dick or whatever, and this gap will be way more obvious. Generated sentences will be "grammatically" correct, but semantically often very wrong. Clearly LLMs are way more sophisticated than a home-made Markov Chain, but I think it's helpful to see the probabilities kind of "leak through."

show 1 reply
staticassertionyesterday at 11:45 PM

Sentences only have semantic meaning because you have experiences that they map to. The LLM isn't training on the experiences, just the characters. At least, that seems about right to me.

codebjetoday at 12:37 AM

Why would that be curious? The network is trained on the linguistic structure, not the "intelligence."

It's a difficult thing to produce a body of text that conveys a particular meaning, even for simple concepts, especially if you're seeking brevity. The editing process is not in the training set, so we're hoping to replicate it simply by looking at the final output.

How effectively do you suppose model training differentiates between low quality verbiage and high quality prose? I think that itself would be a fascinatingly hard problem that, if we could train a machine to do, would deliver plenty of value simply as a classifier.

thrownthatwaytoday at 1:00 AM

I’m not up with what all the training data is exactly.

If it contains the entire corpus of recorded human knowledge…

And most of everything is shit