logoalt Hacker News

slt2021last Tuesday at 10:19 PM3 repliesview on HN

Fantastic article by Rachel Thomas!

This is basically another argument that deep learning works only as a [generative] information retrieval - i.e a stochastic parrot, due to the fact that the training data is a very lossy representation of the underlying domain.

Because the data/labels of genes do not always represent the underlying domain (biology) perfectly, the output can be false/invalid/nonsensical.

in cases where it works very well - there is data leakage, because by design LLMs are information retrieval tools. It comes form the information theory standpoint, a fundamental "unknown unknown" for any model.

my takeaway is that its not a fault of the algorithm, its more the fault of the training dataset.

We humans operate fluidly in the domain of natural language, and even a kid can read and evaluate whether text make sense or not - this explains the success of models trained on NLP.

but in domains where training data represents the fundamental domain with losses, it will be imperfect.


Replies

ffwdlast Wednesday at 4:58 AM

This to me is the paradox of modern LLMs, in that it doesn't represent the underlying domain directly, but it can represent whatever information can be presented in text. So it does represent _some_ information but it is not always clear what it is or how.

The embedding space can represent relationships between words, sentences and paragraphs, and since those things can encode information about the underlying domain, you can query those relationships with text and get reasonable responses. The problem is it's not always clear what is being represented in those relationships as text is a messy encoding scheme.

But another weakness is that as you say it is generative, and in order to make it generative we are instead of hardcoding in a database all possible questions and all possible answers, we offload some of the data to an algorithm (next token prediction) in order to get the possibility of an imprecise probabilistic question/prompt (which is useful because then you can ask anything).

But the problem is no single algorithm can ever encode all possible answers to all possible questions in a domain accurate way and so you lose some precision in the information. Or at least this is how I see LLMs atm.

dathinablast Wednesday at 5:48 PM

> works only as a [generative] information retrieval

but even if we for simplicity of the argument assume that is true without question, LLM still are here to stay

Like think about how do junior devs which (in programming) average or less skill work, they "retrieve" the information about how to solve the problem from stack overflow, tutorials etc.

So giving all your devs some reasonable well done AI automation tools (not just a chat prompt!!) is like giving each a junior dev to delegate all the tedious simple tasks, too. Without having to worry about that task not allowing the junior dev to grow and learn. And to top it of if there is enough tooling (static code analysis, tests, etc.) in place the AI tooling will do the write things -> run tools -> fix issues loops just fine. And the price for that tool is like what, a 1/30th of that of a junior dev? Means more time to focus on the things which matter including teaching you actual junior devs ;)

And while I would argue AI isn't full there yet, I think the current fundation models _might_ already be good enough to get there with the right ways of wiring them up and combining them.

show 1 reply
vixen99last Wednesday at 9:02 AM

I wonder to what extent the thought processes that lead to the situation described by Rachel Thomas, are active in other areas. Important article by the way, I agree!