While working on my game, I tried implementing a custom priority queue.
It ended up being > 2x faster in debug build, but 2x-5x slower in the release build (??!!?) [1]. I haven't learned much about compilers/"lower level" C++, so I moved on at that point.
How it worked:
1.) The P.Q. created a vector and resized it to known bounds.
2.) The P.Q. kept tract of and updated the "active sorting range" each time an element was inserted or popped.
2B.) So each time an element is added, it uses the closest unused vector element and updates the ".end" range to sort
2C.) Each time an element was removed, it updated the ".start" range
3.) In theory this should have saved reallocation overhead.
[1] I believe Visual Studio uses -O0 for debug, and -O2 for release.
I'm curious what you did with the "active sorting range" after a push/pop event. Since it's a vector underneath, I don't see any option other than to sort the entire range after each event, O(N). This would surely destroy performance, right?
These kinds of things are really interesting, and a good example of the importance of performance testing in optimized builds. I encountered something similar and realized the issue was different sizes of structures in debug vs release which ended up causing the CPU cache lines to overlap and invalidate. Though that sounds unlikely here.