logoalt Hacker News

comeonbrolast Thursday at 7:59 PM1 replyview on HN

Imagine if I asked you how many '⊚'s are in 'Ⰹ⧏⏃'? (the answer is 3, because there is 1 ⊚ in Ⰹ and 2 ⊚s in ⏃)

Much harder question than if I asked you how many '⟕'s are in 'Ⓕ⟕⥒⟲⾵⟕⟕⢼' (the answer is 3, because there are 3 ⟕s there)

You'd need to read through like 100,000x more random internet text to infer that there is 1 ⊚ in Ⰹ and 2 ⊚s in ⏃ (when this is not something that people ever explicitly talk about), than you would need to to figure out that there are 3 ⟕s when 3 ⟕s appear, or to figure out from context clues that Ⰹ⧏⏃s are red and edible.

The former is how tokenization makes 'strawberry' look to LLMs: https://i.imgur.com/IggjwEK.png

It's a consequence of an engineering tradeoff, not a demonstration of a fundamental limitation.


hansmayerlast Friday at 7:16 AM

I get the technical challenge. It's just that a system that has to be trained with Petabytes of data, just to (sometimes) correctly solve a problem which a six-seven year old kid is able to solve after learning to spell, may not be the right solution to the problem at hand? Haven't the MBAs been shoving it down our throats that all cost-ineffective solutions have to go? Why are we burning hundreds of billion of dollars into development of tools whose most common use-case (or better said: plea by the VC investors) is a) summarising emails (I am not an idiot who cannot read) b) writing emails (really, I know how to write too, and can do it better) . The only use-case where they are sometimes useful is taking out the boring parts of software development, because of the relatively closed learning context, and as someone who used them for over a year for this, they are not reliable and have to be double-checked, lest you want to introduce more issues in your codebase.

show 1 reply