Hope this on day will be used for auto-tagging all video assets with time codes. The dream of being able to search for running horse and find a clip containing a running horse at 4m42s in one of thousands of clips.
For anyone using Qwen3-VL: where are you running it? I had tons of reliability problems with Qwen3-VL inference providers on OpenRouter — based on uptime graphs I wasn’t alone. But when it worked, Qwen3-VL was pack-leading good at AI Vision stuff.
Not so relevant to the thread but ive been uploading screenshots from citrix guis and asking qwen3-vl for the appropriate next action eg Mouseclick, and while it knows what to click it struggles to accurately return which pixel coordinates to click. Anyone know a way to get accurate pixel coordinates returned?
Does anyone else worry about this technology used for Big Brother type surveillance?
I was playing around with Qwen3-VL to parse PDFs - meaning, do some OCR data extraction from a reasonably well-formated PDF report. Failed miserably, although I was using the 30B-A3B model instead of the larger one.
I like the Qwen models and use them for other tasks successfully. It is so interesting how LLMs will do quite well in one situation and quite badly in another.
I was using this for video understanding with inference form vlm.run infra. It definitely has outperformed Gemini which generally is much better than openai or Claude on videos. The detailed extraction is pretty good. With agents you can also crop into a segment and do more operations on it. have to see how the multi modal space progresses:
link to results: https://chat.vlm.run/c/82a33ebb-65f9-40f3-9691-bc674ef28b52
Quick demo: https://www.youtube.com/watch?v=78ErDBuqBEo
Insane if true... now I wonder, if I use it to go through some old dance routing video catalogue to recognize and write individual move lists
Still not great at the use cases I tested it for but Gemini isn't either. I think we're still very early on video comprehension.
Ive used qwen3-VL on deepwalker lately. All I can stay is that this model is so underrated.
anyone have a tl;dr for me on what the best way to get the video comprehension stuff going is? i use qwen-30b-vl all the time locally as my goto model because it's just so insanely fast, curious to mess with the video stuff, the vision comprehension works great and i use it for OCR and classification all the time
It's so weird how that works with transformers.
Finetuning an LLM "backbone" (if I understand correctly: a fully trained but not instruction tuned LLM, usually small because students) with OCR tokens bests just about every OCR network out there.
And it's not just OCR. Describing images. Bounding boxes. Audio, both ASR and TTS, all works better that way. Now many research papers are only really about how to encode image/audio/video to feed it into a Llama or Qwen model.
To me, this qualifies as some sort ASI already.
> The test works by inserting a semantically important "needle" frame at random positions in long videos, which the system must then find and analyze.
This seems to be somewhat unwise. Such an insertion would qualify as an anomaly. And if it's also trained that way, would you not train the model to find artificial frames where they don't belong?
Would it not have been better to find a set of videos where something specific (common, rare, surprising, etc) happens at some time and ask the model about that?