I think that can pretty realistically be done with Gemma or Qwen, although maybe with some delay. They run great on android in the Edge Gallery app.
Further, you could allow for voice input by running whisper STT locally, then doing a small context-aware correction pass with Gemma or Qwen to correct words it got wrong.
The issue with those solutions is that it would balloon my app size because I'd need to embed the model, or add a mechanism to download the models afterwards, for something that is essentially a note taking app. But maybe I can make it an option and word it effectively, that is an idea! Your idea for doing a context aware correction pass on STT is very interesting and something I hadn't thought of yet. Thank you for your thoughts!