logoalt Hacker News

saltcuredyesterday at 5:44 PM1 replyview on HN

In theory, a computer should be able to do the same. It could do sensor fusion with even more sense modalities than we have. It could have an array of cameras and potentially out-do our stereo vision, or perhaps even use some lightfield magic to (virtually) analyze the same scene with multiple optical paths.

However, there is also a lot of interaction between our perceptual system and cognition. Just for depth perception, we're doing a lot of temporal analysis. We track moving objects and infer distance from assumptions about scale and object permanence. We don't just repeatedly make depth maps from 2D imagery.

The brute-force approach is something like training visual language models (VLMs). E.g. you could train on lots of movies and be able to predict "what happens next" in the imaging world.

But, compared to LLMs, there is a bigger gap between the model and the application domain with VLMs. It may seem like LLMs are being applied to lots of domains, but most are just tiny variations on the same task of "writing what comes next", which is exactly what they were trained on. Unfortunately, driving is not "painting what comes next" in the same way as all these LLM writing hacks. There is still a big gap between that predictive layer, planning, and executing. Our giant corpus of movies does not really provide the ready-made training data to go after those bigger problems.


Replies

dcrazyyesterday at 8:43 PM

Putting your point another way, in order to replicate an average human driver’s competence you would need to make several strong advancements in the state of the art in computer vision _and_ digital optics.