We really need a replacement for all-MiniLM-L12-v2 that can create more robust embeddings with the same compute.
You can technically do Q4 quantization for larger embedding models but I am not sure if that plays nice with ONNX.
it's a pain in the ass to do properly.
what we really need it something like auto-round for ONNX
it's a pain in the ass to do properly.
what we really need it something like auto-round for ONNX