logoalt Hacker News

ptx12/08/20242 repliesview on HN

It certainly looks like some amount of image matching is going on. Can the model really hear the white/green sign to the left in the first example in figure 3? Can it hear the green sign to the right and red things to the left in the last example?


Replies

sdenton412/08/2024

Yeah, I also saw that sign and thought - 'yeah, this is bullshit.' It's got exactly the same placement in the frame - which would requires some next-level beamforming capability - and also has the same color, which is impossible. There's some serious data leakage going on here.

[edit] The bottom right image is even more suspect. There's a vertical green sign in the same place on the right side of the image, but also some curious red striping in the distance in both images. One could argue 'street signs are green' but the red striping seems pretty unique, and not something where one would just guess the right color.

jebarker12/08/2024

That would be explained by data leakage too, e.g. sampling frames in the train and test data from the same video sequences. There's nothing in the writeup that says suggests the model is explicitly matching audio to ground truth images.

show 1 reply