As I understand it, BLT uses a small nn to tokenize but doesn’t change the attention mechanism. MTA ...

janalsncm • 04/02/2025 • 0 replies • view on HN

As I understand it, BLT uses a small nn to tokenize but doesn’t change the attention mechanism. MTA uses traditional BPE for tokenization but changes the attention mechanism. You could use both (latency be damned!)

alt Hacker News