Here's some food-for-thought questions:
First, a correctness question:
- if a struct contains padding, the value of these padding bytes is undefined. What happens if you use such a struct as a key?
And then some integration questions:
- how do you make this typesafe against mixing up 2 distinct uses of hashmaps in the application?
- how do you make this typesafe for its keys and values, i.e. remove accident-prone casts on read access?
- how do you not call the compare and hash functions through function pointers? (Calling through function pointers is both performance relevant as well as a target for exploitation by overwriting the pointer)
(None of these have "nice" answers. But thinking about them is IMHO extremely valuable since these are the most challenging in organizing and engineering an actual codebase itself.)
A bug, I believe: If you "put" three colliding strings A and then B and then C, and then "delete" B, you won't be able to find C anymore.
Thanks for sharing. Simplicity is a good thing.
In order to speed it up by reducing the number of malloc() calls, it may be worth adding a simple arena memory allocation measure: by allocating one larger block (e.g. 1 MB) initially and then doubling the memory size each time you run out, all malloc()/calloc() calls can become local salmagundi_alloc() calls that are just macro invocations that return an arena buffer pointer and also increment said pointer as a side effect.
I also recommend you have a look at Chris Hanson's book "C: Interfaces and Implementations", which has a few neat C API tricks that your code could benefit from (e.g. for reducing name space pollution, for avoiding null pointer argument errors, API method naming etc.).
Row 100 of hm_put() clears all contents of the item struct, thus losing the pointer to where the previous value was stored. But then row 107 needs this pointer to pass it to realloc().
Additionally, in case of overwrite, the memory previously allocated for the key is leaked.
All in all, I don't think row 100 is really needed, and removing it wouldn't do any harm.
Finally, deletes are hard in hash maps. Either you decide that deletes are not allowed, or you implement them with tombstones or re-insertions. A half-backed solution is not an option.
Hope these suggestions help you.
Considering there seems to be at least a one memory bug (hm_put/key from another comment), I would strongly recommend running the tests in valgrind or similar if you haven't already. Doing C without it usually ends in some kind of disaster for me.
I'm curious about the #ifdef guard, which is of a form I've never encountered.
#ifndef BD9DF82A4540BB19368E48E4747C0706
#define BD9DF82A4540BB19368E48E4747C0706
Does some IDE do that for you? Did you just pick a random sequence? Why not just a variant of #ifndef _salmagundi_h that is more common?Other than that, I strongly echo other's advice about avoiding so many mallocs.
This is neat, and remarkably small. I personally need more comments in order to follow, say, the growth logic. I see it rehashes but I don't see how it doesn't potentially overwrite entries on the way.
Algorithm for hash collisions is just to find the next empty slot (linear probing)? What happens when the original entry is removed, are the ones in subsequent slots cycled backwards? I don't see that...
Also the name is new to me! TIL 'salmagundi' is an English word (not even a loan-word) for a type of salad made of many ingredients: an excellent name for a hash map that can contain many effectively random items.
My C is quite rusty, so apologies for stupid questions.
In the hm_put function, when you overwrite, why do you malloc and copy the key again, and what happens with the old key pointer and the old value pointer? (no free for the key and the memset zeros everything?)
In the code example, I guess the variable `k_sz` is undefined in the call `hm_get(map, k, k_sz)`. Hope that helps.
In reviewing hm_put, a few things stand out in the area of handling cases where memory allocation fails. The first problems are at lines 113-116 where memory is allocated and written to, then the allocations are validated in line 117 after they are written to with memcpy. An allocation failure would cause a crash. Secondly, more about semantics, is at line 105 where item->v is free’d after it is found to be NULL, to me it is counter intuitive and unnecessary. Reading further (and back to the first area of concern) at lines 118-119, after an allocation failure, both are free’d which, to me, is acceptable and convenient. In light of that, it seems that line 105 is there for the sake of the reviewer who may not like the apparent asymmetry of the two sites.
I want to thank everyone here who pointed out deficiencies in this little library.
Good catches :)
I followed the link in the docs to the page on the hash function. There's a DJB-2 algorithm on that page, but no DJB-1
If you’re interested in this, I wrote a blogpost on a simple c hash table without deletion.
Purely out of interest, and probably my bad, but what is:
(void)k_sz;
doing? I've seen fussy people doing: (void) printf( "hello" );
but not what you have written. As I say, probably ignorance on my part.I suggest improving the probing a little. Something like
idx = (idx + 1) % map->cap;
becomes iterCount++;
idx = (idx + iterCount) % map->cap;
This web site is amazing. We have, on one hand, people explaining machine sentience and quantum breakthroughs, and on the other hand, we have people refusing to use "#pragma once" in C because it's still too novel and doesn't work in every compiler yet.
Never change, HN.
I recently coded a linear-probing hash map in C and I highly recommend you to use fuzz tests. These evolve naturally from unit tests: next step table unit tests, next step property test, next step fuzzing, all steps are incremental, hence easy.
Having a fuzz test was in-va-lu-ab-le. I caught several bugs right there. The delete function was particularly tricky.
I ended up with two fuzz tests as my hash table has a key feature: convergence. Having same contents, it would have exactly same bits in the buffer. In other words, it is independent of the insertion/deletion order. For this, I added another fuzz test. I would add a third one if I realize there is an important invariant I did not fuzz test. That is not much work, but so much useful!
https://github.com/gritzko/librdx/blob/master/abc/fuzz/HASH....
https://github.com/gritzko/librdx/blob/master/abc/fuzz/HASHd...
P.S. In your case, I recommend clang's libfuzzer with ASAN.
https://llvm.org/docs/LibFuzzer.html#fuzzer-usage