This is what VLA models are for. They would work much better. Would need a bit of fine tuning but probably not much. Lots of literature out there on using VLAs to control drones.
Did some research, found a model that is exactly that. https://cognitivedrone.github.io/
Did some research, found a model that is exactly that. https://cognitivedrone.github.io/