I am very interested about this and would like an authoritative answer on this. I even went as far as buying some books on code optimization in the context of HFT and I was not impressed. Not a single snippet of assembly; how are you optimizing anything if you don't look at what the compiler produces?
But on Java specifically: every Java object still has a 24-byte overhead. How doesn't that thrash your cache?
The advice on avoiding allocations in Java also results in terrible code. For example, in math libraries, you'll often see void Add(Vector3 a, Vector3 b, Vector3 our) as opposed to the more natural Vector3 Add(Vector3 a, Vector3 b). There you go, function composition goes out the window and the resulting code is garbage to read and write. Not even C is that bad; the compiler will optimize the temporaries away. So you end up with Java that is worse than a low-level imperative language.
And, as far as I know, the best GC for Java still incurs no less than 1ms pauses? I think the stock ones are as bad as 10ms. How anyone does low-latency anything in Java then boggles my mind.