I don't think they meant literally cameras only can create reliable distance measures. At the risk of putting words in their mouth, I would guess they meant "cameras as the only input to a distance model". the "model" doing all the heavy lifting, covering the points that you quite rightly point out are needed
Several companies, most notably Tesla, have done this well enough to drive in all manner of traffic. I'm not going to comment about if lidar is strictly needed or not to achieve better-than-human safety, that's yet to be proven one way or another by anyone. The point is that cameras + local inference can do a pretty good job at distance estimation