>Is it possible to ask the vision agent to "map"
No most vision models focus on subset of an image at a time when using image -> text
image -> image uses whole image.