> It's also not very great at meeting summaries especially those where many speakers are in a room on the same microphone.
It is astonishingly poor at this. My intuition was that it should be good at this (it is basically a translation problem right? And LLMs are fundamentally translation systems) but the practical results are so poor. Not just mis-identifying speakers (frequently saying PersonX responded to PersonX) but managing complete opposite conclusions from what was actually said.
I'm genuinely intrigued as to what approaches have been taken in this space and what the "hard problem" is that is stopping it being good.