yes it transcribes inputs automatically, but not in realtime.
outputs are sent in text + audio but you'll get the text very quickly and audio a bit slower, and of course the audio takes time to play back. the text also doesn't currently have timing cues so its up to you if you want to try to play it "in sync". if the user interrupts the audio, you need to send back a truncation event so it can roll its own context back, and if you never presented the text to the user you'll need to truncate it there as well to ensure your storage isn't polluted with fragments the user never heard.