Alternatively: some people are just better at / more comfortable thinking in auditory mode than visual mode & vice versa.
In principle I don't see why they should have different amounts of thought. That'd be bounded by how much time it takes to produce the message, I think. Typing permits backtracking via editing, but speaking permits 'semantic backtracking' which isn't equivalent but definitely can do similar things. Language is powerful.
And importantly, to backtrack in visual media I tend to need to re-saccade through the text with physical eye motions, whereas with audio my brain just has an internal buffer I know at the speed of thought.
Typed messages might have higher _density_ of thought per token, though how valuable is that really, in LLM contexts? There are diminishing returns on how perfect you can get a prompt.
Also, audio permits a higher bandwidth mode: one can scan and speak at the same time.