Very interesting. The way I understand this works is that the researchers found a clever architect...

robotswantdata • today at 12:21 PM • 4 replies • view on HN

Very interesting.

The way I understand this works is that the researchers found a clever architectural hack to stop AI from hoarding memory when reading long documents.

Normally, when an AI transcribes a 100 page PDF, it tries to remember every single word it has already ingested. This short-term memory (the KV cache) grows linearly O(N) until the model runs out of VRAM and crashes (or caps it) To avoid this, developers are forced to build janky code that chops PDFs into individual pages, processes them one by one, and glues the text back together.

Unlimited OCR uses Reference Sliding Window Attention (R-SWA) to split the AI's focus into two paths:

Global Reference: The AI keeps full, uncompromised sight of the original document image so it never loses context.

Local Generation: The AI restricts its memory of its own typed text to a tight, moving window (like the last 128 words) and safely forgets the rest.

Will be very interesting for local AI and can’t wait to see what the community builds and extends with it!

Replies

_puk • today at 2:11 PM

This hits a sweet spot I think for conversations too. I've been playing (for quite a while) on trying to encapsulate long running conversations.

You have the overriding context, facts that don't change very often at all. The participants names, their backgrounds etc.

Then you have some very fine grained facts (what they ate for breakfast this morning) which might be useful right now, but are irrelevant outside of a general trend over the longer term.

When trying to reconstruct a conversation you really need to find the right balance without pulling in everything that has ever been discussed.

This definitely is worth further investigation.

➕ show 2 replies

storywatch • today at 3:58 PM

Haven't read the full paper but thr local generation window is a little small, especially since image inputs are especially token heavy. Depending on where the local attention layer is located, it would be nicer if it's bigger e.g. 4096 words at least.

MattRogish • today at 4:01 PM

I do OCR of images, and that's exactly what I do. I take one big image and slice it into many smaller ones, and send those to the LLM. Perfect every time, unlike using the whole image which resulted in hot garbage.

➕ show 2 replies

d675 • today at 1:21 PM

See, leetcode is useful. As I do this leetcode grind, I’ve been why techniques exist / how they’re used irl. Lots of interesting stuff there

➕ show 1 reply

alt Hacker News

Replies