C# is very fast (see multicore rating). Implementation based on simd (vector), memory spans, stackalloc, source generators and what have you — modern C# allows you go very low-level and very fast.
Probably even faster under .net 10.
Though using stopwatch for benchmark is killing me :-) Wonder if multiple runs via benchmarkdotnet would show better times (also due to jit optimizations). For example, Java code had more warm-up iterations before measuring