Starting from scratch: Training a 30M Topological Transformer

129 points • by tuned • yesterday at 11:39 AM • 50 comments • view on HN

Comments

If you want to prove (i.e. show that it works and/or it's faster in a real-world scenario) a new alternative to attention without breaking the bank then one of the best ways to do that would probably be to retrain an already existing model, just with swapped attention modules. Then once you have such a model you can do apples-to-apples benchmarks.

This has been done successfully in the past:

https://huggingface.co/featherless-ai/QRWKV-72B

Note that this is a 72B model which would be very expensive to train from scratch, but here they did the conversion for less than $2000.

➕ show 4 replies

tuned • today at 6:32 AM

Thanks to all that have read. I would be glad to answer further scoped questions on the content of the post and the paper. I answered some comments that may clarify the ideas from the redesign.

ashirviskas • yesterday at 1:42 PM

I wonder what if we just crammed more into the "tokens"? I am running an experiment of replacing discrete tokens with embeddings + small byte encoder/decoder. That way you can use embedding space much more efficiently and have it contain much more nuance.

Experiments I want to build on top of it:

1. Adding lsp context to the embeddings - that way the model could _see_ the syntax better, closer to how we use IDEs and would not need to read/grep 25k of lines just to find where something is used. 2. Experiments with different "compression" ratios. Each embedding could encode a different amount of bytes and we would not rely on a huge static token dictionary.

I'm aware that papers exist that explore these ideas, but so far no popular/good open source models employ this. Unless someone can prove me wrong.

➕ show 4 replies

lostmsu • yesterday at 12:48 PM

Comparison with vanilla of the same size/flops budget?

➕ show 2 replies

keyle • yesterday at 1:01 PM

Does this make any sense, to anyone?

➕ show 4 replies

geoffbp • yesterday at 1:49 PM

I dug into this a bit (with AI ofc) and it spat this out. I found it an easy way to visualise and start to understand:

> Standard AI models (like GPT-4) treat data using Global Geometry. They imagine every word as a point floating in a massive, flat, high-dimensional room. To see how two words relate, they draw a straight line between them.

> Local Topology changes the "room" into a landscape (a manifold). Instead of a flat void, the data exists on a curved surface that has hills, valleys, and paths.

➕ show 1 reply

alt Hacker News

Starting from scratch: Training a 30M Topological Transformer

Comments