logoalt Hacker News

bob1029today at 6:08 AM0 repliesview on HN

> Most AI runtimes can only execute one inference call to a model at a time.

I don't know if the LMAX article makes much sense in this space. The latency of calling the models dominates any sort of local effects and the volume of the requests tends to be very low relative to any kind of meaningful system limits.

More broadly, parallel agent systems are of questionable utility compared to ones that operate over the problem serially. Parallelism bringing any uplift implies that the effects of different parts of the problem don't interact much. If things do depend on other things over time, then you have something that can only be solved with one serial narrative.

The case of high frequency trading works really well because the shared resource (global sequence) is extremely contentious and local. You actually get meaningful benefits out of these principles. OpenClaw is not the same kind of problem. Running an actual inference model on the CPU is maybe a different story.