It is inference latency most of the time. These VLA models take in an image + state + text and spit ...

ajhai • 04/23/2025 • 0 replies • view on HN

It is inference latency most of the time. These VLA models take in an image + state + text and spit out a set of joint angle deltas.

Depending on the model being used, we may get just one set of joint angle deltas or a series of them. In order to be able to complete a task, it will need to capture images from the cameras, current joint angles and send them to the model along with the task text to get the joint angle changes we will need to apply. Once the joint angles are updated, we will need to check if the task is complete (this can come from the model too). We run this loop till the task is complete.

Combine this with the motion planning that has to happen to make sure the joint angles we are getting do not result in colliding with the surroundings and are safe, results in overall slowness.

alt Hacker News