> N tokens looking at N tokens is quadratic
Convolving two arrays can be done perfectly accurately in O(n log n), despite every element being combined with every other element.
Or consider the even more basic sum of products a[i] * b[j] for all possible i, j:
total = 0
for i in range(len(a)):
for j in range(len(b)):
total += a[i] * b[j]
This can be computed in linear time as sum(a) * sum(b).Your logic that 'the result contains terms of all pairs, therefore the algorithm must be quadratic' simply doesn't hold.
This brings me back to DSP class, man learning about FFT was eye-opening.
That's like saying sorting can be done in O(n) because radix sort exists. If you assume some structure, you lose generality, i.e. there'll be some problems it's no longer able to solve. It can no longer approximate any arbitrary function that needs perfect memory over the sequence.
One of my favorite bits of my PhD dissertation was factoring an intractable 3-dimensional integral
\iiint f(x, y, z) dx dy dz = \int [\int g(x, y) dx]*[\int h(y, z) dz] dy
which greatly accelerated numerical integration (O(n^2) rather than O(n^3)).
My advisor was not particularly impressed and objectively I could have skipped it and let the simulations take a bit longer (quite a bit longer--this integration was done millions of times for different function parameters in an inner loop). But it was clever and all mine and I was proud of it.