logoalt Hacker News

DoctorOetkertoday at 1:21 PM1 replyview on HN

Determinism is not the issue. Synonyms exist, there are multiple ways to express the same message.

When numeric models are fit to say scientific measurements, they do quite a good job at modeling the probability distribution. With a corpus of text we are not modeling truths but claims. The corpus contains contradicting claims. Humans have conflicting interests.

Source-aware training (which can't be done as an afterthought LoRA tweak, but needs to be done during base model training AKA pretraining) could enable LLM's to express according to which sources what answers apply. It could provide a review of competing interpretations and opinions, and source every belief, instead of having to rely on tool use / search engines.

None of the base model providers would do it at scale since it would reveal the corpus and result in attribution.

In theory entities like the European Union could mandate that LLM's used for processing government data, or sensitive citizen / corporate data MUST be trained source-aware, which would improve the situation, also making the decisions and reasoning more traceable. This would also ease the discussions and arguments about copyright issues, since it is clear LLM's COULD BE MADE TO ATTRIBUTE THEIR SOURCES.

I also think it would be undesirable to eliminate speculative output, it should just mark it explicitly:

"ACCORDING to <source(s) A(,B,C,..)> this can be explained by ...., ACCORDING to <other school of thought source(s) D,(E,F,...)> it is better explained by ...., however I SUSPECT that ...., since ...."

If it could explicitly separate the schools of thought sourced from the corpus, and also separate its own interpretations and mark them as LLM-speculated-suspicions, then we could still have the traceable references, without losing the potential novel insights LLM's may offer.


Replies

jennyholzertoday at 1:28 PM

"chatGPT, please generate 800 words of absolute bullshit to muddy up this comments section which accurately identifies why LLM technology is completely and totally dead in the water."

show 1 reply