Look into RWKV.
Yeah RWKV is definitely related in spirit (recurrent state for long context). Here I’m combining local windowed attention with a gated recurrent path + KV cache compression, so it’s more hybrid than fully replacing attention
Yeah RWKV is definitely related in spirit (recurrent state for long context). Here I’m combining local windowed attention with a gated recurrent path + KV cache compression, so it’s more hybrid than fully replacing attention