logoalt Hacker News

ffsm8today at 2:15 PM1 replyview on HN

While you're correct in what tthe audio models are - at least somewhat (they're not exactly like text based llms), you seem to brush his point away too quickly before fully exploring it.

This is a solvable issue, the current model and harnesses just aren't made with that assumption - hence they're doing "best effort while guessing if unsure".

Give it a few more months to years and things will likely settle how he pitched - at least in the context of note taking: only let it become "lore" if it didn't have to guess a word.

Currently there is basically only one mode - and it's optimized for conversation. The note taking is just glued on with that functionality as the backbone, and that's probably not going to stay.


Replies

repelsteeltjetoday at 2:48 PM

> Give it a few more months to years and things will likely settle how he pitched - at least in the context of note taking: only let it become "lore" if it didn't have to guess a word.

I'm hesitant to admit even that. Like any computational linguistics problem, accuracy relies on coverages of all levels: form morphology, through syntax and semantics to speech act and world knowledge.

I worked with state of art speech recognition in healthcare setting. The model was specifically trained on small set of languages with emphasis on covering medical terminology.

It worked great for conversations most of the time, but sometimes messed up very badly. For instance when patient would mention the name of a relative, a street address or phone number. Spelling out an email address would mess it up completely.

It's just like when you're a horrible typist and rely on spell checking: The red squibles are gone, but the story no longer makes sense. Or when you "autofix" a syntax error, but the meaning diverges from your intention.

As the technology improved the number of words decreases, but the mistakes get more severe.