A test that runs for 15 hours on a high powered rig is going to be hard to reproduce or scale. But I think this addresses a widespread concern, which affects all kinds of cloud services. What you ping is not necessarily what you get.
You can run the whole suite once at the start for each vendor, then roll through each part of it over a two or four week cycle, mimicking regular use. That jeeps the evaluation up to date over time.
My reading of the article is that the first audience for this test is the vendors themselves. The test is long and comprehensive to give the vendor confidence in its own hosting.