this also adds new registers to operate on (more state) - 1KB more state at least (512b x 16)
It reuses AMX registers, so I think the only new state is the block scale register (1024 bits)?
It reuses AMX registers, so I think the only new state is the block scale register (1024 bits)?