You likely tripped over a difference in power management profiles (and capabilities) between Intel and ARM.
You're testing "variability" and latency, and you even mention that "modern Intel CPUs tend to ramp frequency..." but entirely neglect to mention which specific Windows Power Profile you were using.
Fundamentally, you're benchmarking a server operating system on laptops and/or desktop-class hardware, and not the same spec either. I.e.: you're not controlling for differences in memory bandwidth, SSD performance, etc...
Even on server hardware the power profiles matter! A lot more than you think!
One of my gimmicks in my consulting gig is to change Intel server power settings from "Balanced" to "Maximum Performance" and gloat as the customer makes the Shocked Pikachu face because their $$$ "enterprise grade server" instantly triples in performance for the cost of a button press.
Not to mention that by testing this in VMs, you're benchmarking three layers: The outer OS (and its power management), the hypervisor stack, and the inner guest OS.
Both Windows 11 systems are configured with the “High performance” power plan, as are the two Windows Server VMs. In hindsight, I should have included this detail explicitly in the original post instead of only alluding to it.