Was the abstract written by ChatGPT? It's an unreadable wall of text.
It was fun trying out the demo, with the "coffee kettle pouring" video it did really well segmenting the man's hand and arm and tracking it (segmenting them in every frame correctly), but with the "Find the ball cup game" video it lost track of the tracked cup in a strange way, it kept track of it correctly while it went behind other cups, but after it wasn't occluded anymore, it switched to an other cup.
It's still impressive to me how it twice kept track between occlusions, but strange how it lost track when it wasn't occluded.
Does anyone know of a method for plugging the output of models like this one with traditional video editing software like Adobe Premiere?
What I'd love to see is how these tools perform with low depth of field shots, e.g. one actor in shot and one actor out of focus in front of them standing in front of a street with moving traffic.
This kind of "cinematic" shots is where automatic masking tools typically fall apart.
"On mobile devices such as iPhone 15 Pro Max, our EfficientTAMs can run at ~10 FPS for performing video object segmentation with reasonable quality"
This is pretty impressive! Lowering the compute requirements will allow more applications to be feasible.
Interesting, I saw this: https://arxiv.org/pdf/2411.11922 on here a few days back but I haven't actually read either paper, anyone who's looked at both care to give us a TL;DR?
I wish these things were described more clearly. Is this single object tracking or multi object tracking? Just a week ago SAMURAI was posted here, which is kind of the same thing, promising SOTA tracking performance using SAM2. But it only allows single object tracking, which makes it useless for many medical imaging tasks.