AsmJit has only one place where a lot of time is spent - bin-packing. It's the least optimized ...

Asm2D • last Thursday at 8:44 PM • 1 reply • view on HN

AsmJit has only one place where a lot of time is spent - bin-packing. It's the least optimized part, which has quadratic complexity (at the moment), which starts to show when you have like hundreds of thousands of virtual registers. There is even a benchmark in AsmJit called `asmjit_bench_regalloc`, which shows that a single function that has 16MB alone, with 65k labels and 200k virtual registers takes 2.2 seconds to generate (and 40ms of that is time to just call `emit()`).

If this function is optimized, or switched to some other implementation when there is tens of thousands of virtual registers, you would get orders of magnitude faster compilation.

But realistically, which query requires tens of megabytes of machine code? These are pathological cases. For example we are talking about 25ms when it comes to a single function having 1MB of machine code, and sub-ms time when you generate tens of KB of machine code.

So from my perspective the ability to generate SIMD code that the CPU would execute fast in inner loops is much more valuable than anything else. Any workload, which is CPU-bound just deserves this. The question is how much the CPU bound the workload is. I would imagine databases like postgres would be more memory-bound if you are processing huge rows and accessing only a very tiny part of each row - that's why columnar databases are so popular, but of course they have different problems.

I worked on one project, which tried to deal with this by using buckets and hashing in a way that there would be 16 buckets, and each column would get into one of these, to make the columns closer to each other, so the query engine needs to load only buckets used in the query. But we are talking about gigabytes of RAW throughput per core in this case.

Replies

vladich • last Thursday at 8:53 PM

I have a test of 200Kb query that AsmJit takes 7 seconds to compile (that's not too bad both LLVM and MIR take ~20s), while sljit does it in 50ms. 200Kb is a pathological case, but it's not unheard of in the area I'm working on. It's realistic, although a rare case. Last 10-15 years most OLTP workloads became CPU bound, because active datasets of most real databases fully fit in memory. There are exceptions, of course.

➕ show 1 reply

alt Hacker News

Replies