That actually sounds like something Claude could do pretty easily.
Yegge's book describes his coauthor's first vibe coding project. It went through screenshots he'd saved of youtube videos, read the time with OCR, looked up transcripts, and generated video snippets with subtitles added. (I think this was before youtube added subtitles itself.) He had it done in 45 minutes.
And using agents to control other applications is pretty common.