What I'd love to see is how these tools perform with low depth of field shots, e.g. one actor in shot and one actor out of focus in front of them standing in front of a street with moving traffic.
This kind of "cinematic" shots is where automatic masking tools typically fall apart.
https://arxiv.org/abs/2411.02844 this paper is for you