When I use Codex/Claude to complete a computer vision task, such as extracting assets from an i...

leoncos • last Saturday at 8:09 AM • 10 replies • view on HN

When I use Codex/Claude to complete a computer vision task, such as extracting assets from an image, OpenCV is their default solution. However, I believe that using YOLO and other methods is outdated. The best solution now is to directly use Nano Banana or other AI image models. A paper has proven that image generation models can perform most CV tasks well. I believe the new OpenCV should become a wrapper for VLM or AI image models.

Replies

nicolailolansen • today at 7:18 AM

Whenever you can run a model like Nano Banana or other vision-LLM with the same compute and time performance/restrictions as an OpenCV or YOLO call, you can make that comparison. Until then, I would not call YOLO and OpenCV outdated, it's simply wrong. There's a time and place for big V-LLMs just as there is a time and place for more "traditional" computer vision methods.

wongarsu • today at 8:06 AM

I can get great results from a YOLO model with 30M to maybe 300M params. To get decent CV from a LLM 8B params is the absolute minimum, closer to 30B for interesting tasks

I might be on board about LLMs being the future of OCR (though many would disagree), but for general CV they are very inefficient for very limited benefit

➕ show 2 replies

regularfry • today at 7:34 AM

I've built hardware with a pi zero 2 + pi cam running a mildly fine-tuned YOLO doing local-only object detection as a USB-OTG device, in a use case where any off-device API calls would have been totally unacceptable, and where the object detection was part of the human interaction loop with a hard ceiling of 300ms on the total interaction time of which the object detection was only one process among many.

We're not going to fit Nano Banana or anything like it on a device with 512MB RAM and a GPU old enough to be irrelevant, and again, API calls just aren't on the menu.

➕ show 1 reply

mirsadm • today at 7:29 AM

That is a very uninformed view. Real time CV is not going to be doing that anytime soon.

sebmellen • today at 8:17 AM

Great, let me know when those models can run on-server and process/analyze streams of ID images with less than 100ms of latency. You’ll need to make sure you have a massive set of training data including all manner of slightly blurred and slightly distorted ID cards

➕ show 1 reply

serf • last Saturday at 1:28 PM

do you realize how many edge or unconnected nodes do OpenCV work?

some SBC w/ an industrial camera that is doing pick-place or go/no-go operations on a conveyor belt against a singular object type doesn't need a huge image-gen/llm model governing it.

I mean have you even considered the kind of performance an opencv function can get w/ just mask-matching? I mean even with a fancy YOLO model these answers get thrown out in 1.5-50ms ; this is just a wholly different time scaling.

Qhemlomo • today at 9:48 AM

100.000 pictures take a lot of time with LLMs.

Its a lot better, faster, cheaper to use LLMs for initial labeling together with hand finetuning and then training YOLO with this.

Training YOLO takes a few hours and is then very fast.

_the_inflator • today at 11:16 AM

"When I use..."

Dude, in business we think in terms of large numbers, internationally easily in billion times processing images. This wouldn't cut it.

Also, do you buy the mega expensive super individually designed shoes from the best shoemaker there is to march along though some dirt or simply stick to gumboots?

OpenCV is used behind the scenes for many of the fancy stuff those major AI provider pretend to do. Claude is a huge system and not a LLM anymore.

kryptiskt • today at 8:04 AM

If I want to identify and measure the size of round things in my orange sorter machine, I shouldn't have to resort to an unnecessarily complicated solution just because some AI bros can't understand that not everything needs to be an AI model.

Like, the AI model tools already exist, all that would be accomplished if OpenCV pivoted would be to take it away for people who want to do low-level vision programming. It wouldn't add anything useful to the world, just destroy an excellent library.

TZubiri • today at 7:22 AM

I am confused, how can functions that output images help with functions that should take images as input?

➕ show 1 reply

alt Hacker News

Replies