I guarantee you there's positional information one way or another. they just don't mention...

make3 • today at 5:33 PM • 2 replies • view on HN

I guarantee you there's positional information one way or another. they just don't mention it because positional embeddings are extremely cheap computationally, not worth mentioning

Replies

neosat • today at 5:39 PM

Agree. Audio has strongly temporal so there is almost certainly some positional encoding one way or another.

mchinen • today at 5:52 PM

Ah yeah, thinking further it's probably just using some positioning embedding based on sequence numbering added in the LLM layers. For vision it needs the patch location as well.

alt Hacker News

Replies