Machine vision has always been resource intensive... and if you are doing trained ML projects the hardware choices are actually very limited.
To enable Intel TBB, CUDA, and CPU specific compiler optimizations... one will almost certainly need to re-build the library, and customize your application build.
Some tasks degrade in performance on a GPU, and others are 740 times faster... ymmv. =3