Linear time attention doesn’t work, by principle. Dead end pursuit. Much great research on more efficient quadratic time inference
What about n log n?
What about n log n?