14× faster embeddings: how we rebuilt the ONNX path in Manticore

47 points • by snikolaev • today at 3:49 AM • 7 comments • view on HN

Comments

Unlike GPUs, CPUs aren't designed for massive parallelism. Because of this, batching inference won't necessarily give you a speed boost here. In fact, it can actually slow the process down.

Instead, I'd recommend exploring CPU-specific AI optimizations. For instance, leveraging AVX512_BF16 instructions could reduce the inference time by 2x or 3x compared to the results in the article. OpenVINO supports this really well on Intel CPUs, and converting an ONNX model to OpenVINO is straightforward.

➕ show 2 replies

minimaxir • today at 6:38 AM

We really need a replacement for all-MiniLM-L12-v2 that can create more robust embeddings with the same compute.

You can technically do Q4 quantization for larger embedding models but I am not sure if that plays nice with ONNX.

➕ show 2 replies

electroglyph • today at 6:01 AM

ONNX is my first suggestion to people looking for speed gains on CPU

alt Hacker News

14× faster embeddings: how we rebuilt the ONNX path in Manticore

Comments