Thanks for the feedback, since I have relied in other thread related to O_DSYNC which a lot of folks have already suggested, and I will not repeat it here.
For the benchmark results, and they were mainly due to metadata management. We have implemented our own KV store, see internal here [1], which is more efficient than ext4 namespace management, even after doing very aggressive fs tuning for that [2] (plus 65536 sharding for each leveled dir).
[1] https://fractalbits.com/blog/metadata-engine-for-our-object-...
[2] https://github.com/fractalbits-labs/fractalbits/commit/12109...