That's a brilliant trick. The setup overhead and permission requirements for perf_event might be heavy for arbitrary threads, but for long-lived threads it looks pretty awesome! Thanks for sharing!
Yes you need some lazy setup in thread-local state to use this. And short-lived threads should be avoided anyway :)
Yes you need some lazy setup in thread-local state to use this. And short-lived threads should be avoided anyway :)