Isn't that something different? If I prompt an LLM to identify the speaker, that's different from keeping track of speaker while processing a different prompt.