Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

170 points • by thm • last Sunday at 7:27 AM • 56 comments • view on HN

Comments

> The test works by inserting a semantically important "needle" frame at random positions in long videos, which the system must then find and analyze.

This seems to be somewhat unwise. Such an insertion would qualify as an anomaly. And if it's also trained that way, would you not train the model to find artificial frames where they don't belong?

Would it not have been better to find a set of videos where something specific (common, rare, surprising, etc) happens at some time and ask the model about that?

mikae1 • today at 1:07 AM

Hope this on day will be used for auto-tagging all video assets with time codes. The dream of being able to search for running horse and find a clip containing a running horse at 4m42s in one of thousands of clips.

➕ show 2 replies

re5i5tor • today at 5:47 AM

For anyone using Qwen3-VL: where are you running it? I had tons of reliability problems with Qwen3-VL inference providers on OpenRouter — based on uptime graphs I wasn’t alone. But when it worked, Qwen3-VL was pack-leading good at AI Vision stuff.

➕ show 1 reply

chhxdjsj • today at 3:37 AM

Not so relevant to the thread but ive been uploading screenshots from citrix guis and asking qwen3-vl for the appropriate next action eg Mouseclick, and while it knows what to click it struggles to accurately return which pixel coordinates to click. Anyone know a way to get accurate pixel coordinates returned?

➕ show 6 replies

djmips • yesterday at 10:19 PM

Does anyone else worry about this technology used for Big Brother type surveillance?

➕ show 9 replies

clusterhacks • today at 1:45 AM

I was playing around with Qwen3-VL to parse PDFs - meaning, do some OCR data extraction from a reasonably well-formated PDF report. Failed miserably, although I was using the 30B-A3B model instead of the larger one.

I like the Qwen models and use them for other tasks successfully. It is so interesting how LLMs will do quite well in one situation and quite badly in another.

➕ show 1 reply

visioninmyblood • yesterday at 10:13 PM

I was using this for video understanding with inference form vlm.run infra. It definitely has outperformed Gemini which generally is much better than openai or Claude on videos. The detailed extraction is pretty good. With agents you can also crop into a segment and do more operations on it. have to see how the multi modal space progresses:

link to results: https://chat.vlm.run/c/82a33ebb-65f9-40f3-9691-bc674ef28b52

Quick demo: https://www.youtube.com/watch?v=78ErDBuqBEo

➕ show 1 reply

eurekin • yesterday at 10:57 PM

Insane if true... now I wonder, if I use it to go through some old dance routing video catalogue to recognize and write individual move lists

CSMastermind • today at 4:57 AM

Still not great at the use cases I tested it for but Gemini isn't either. I think we're still very early on video comprehension.

m00dy • today at 5:49 AM

Ive used qwen3-VL on deepwalker lately. All I can stay is that this model is so underrated.

[0]: https://deepwalker.xyz

thot_experiment • last Sunday at 9:22 AM

anyone have a tl;dr for me on what the best way to get the video comprehension stuff going is? i use qwen-30b-vl all the time locally as my goto model because it's just so insanely fast, curious to mess with the video stuff, the vision comprehension works great and i use it for OCR and classification all the time

➕ show 1 reply

spwa4 • yesterday at 10:22 PM

It's so weird how that works with transformers.

Finetuning an LLM "backbone" (if I understand correctly: a fully trained but not instruction tuned LLM, usually small because students) with OCR tokens bests just about every OCR network out there.

And it's not just OCR. Describing images. Bounding boxes. Audio, both ASR and TTS, all works better that way. Now many research papers are only really about how to encode image/audio/video to feed it into a Llama or Qwen model.

➕ show 1 reply

moralestapia • yesterday at 9:37 PM

To me, this qualifies as some sort ASI already.

alt Hacker News

Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

Comments