logoalt Hacker News

boothbyyesterday at 5:03 PM3 repliesview on HN

This is why misspellings and homophones are tells of human righting. LLMs strongly prefer word-level tokens, and word substitutions follow semantic similarity and not the more human auditory similarity.


Replies

omneityyesterday at 5:43 PM

Funny, I’ve been cracking[0] at this exact problem with a purpose-built model[1]:

0: https://huggingface.co/posts/omarkamali/593639295164067

1: https://omneitylabs.com/models/sawtone

jddjyesterday at 8:01 PM

Claude the other day wrote code where one of the bytes in the array was 0xO5.

That's zero ex oh (the letter) five

mejutocoyesterday at 5:29 PM

> righting.

> LLMs strongly prefer word-level tokens, and word substitutions follow semantic similarity and not the more human auditory similarity.

Is this an elaborate joke or your full-word misspelling of writing is both agreeing with your statement (word substitutions) and contradicting it (not semantic but only pronunciation similarity)

show 2 replies