logoalt Hacker News

keyleyesterday at 1:01 PM5 repliesview on HN

Does this make any sense, to anyone?


Replies

kannanvijayanyesterday at 1:17 PM

I think this is an attempt to try to enrich the locality model in transformers.

One of the weird things you do in transformers is add a position vector which captures the distance between the token being attended to the some other token.

This is obviously not powerful enough to express non-linear relationships - like graph relationships.

This person seems to be experimenting with doing pre-processing of the input token set, to linearly reorder it by some other heuristic that might map more closely to the actual underlying relationship between each token.

show 3 replies
tunedtoday at 6:43 AM

it made sense to me as it is a very simple idea I guess: causal self-attention compute QKV distances computing on the full vectors for Q,K and V; the topological transformer can provide the same computation using Q, scalar K and V. Instead of [N², N², N²] -> [N², N, N²] is used. If generation is confirmed to be on par in terms of quality, the gains are evident.

liteclientyesterday at 1:17 PM

it makes sense architecturally

they replace dot-product attention with topology-based scalar distances derived from a laplacian embedding - that effectively reduces attention scoring to a 1D energy comparison which can save memory and compute

that said, i’d treat the results with a grain of salt give there is no peer review, and benchmarks are only on 30M parameter model so far

show 2 replies
bee_rideryesterday at 5:20 PM

I haven’t read the paper yet, but the graph laplacian is quite useful in reordering matrices, so it isn’t that surprising if they managed to get something out of it in ML.

pwndByDeathyesterday at 2:29 PM

No, its a new form of alchemy that turns electricity into hype. The technical jargon is more.of.a thieves cant to help identity other conmen to one another

show 2 replies