Wider SIMD would be useful, especially with AVX-512 style improvements. 1024 or even 2048 bits wide operations.
Of course memory bandwidth should increase proportionally otherwise the cores might have no data to process.
Much better to burn the area for multiple smaller units, its a bit more area for frontend handling, but worth it for the flexibility (see Apple's M-series chips vs intel avx*).
I would love to be able to fit small matrices (4x4 or 16x16 depending on precision) in SIMD registers together with intrinsics for matrix arithmetic.
AMX registers are 1024 *bytes*
I wouldn't mind, but might need to increase cache line size on x86, as avx512 has reached the current size.