This blog is about some of my thinking that went into the “non-state-machine” parts of the Non-Blocking Hash Table. As with all engineering, “it depends”. What makes sense in one context can lose out in another situation. Your Mileage May Value, Caveat Emptor, etc.
Bruce Holt writes:
But two things give me the sh*ts! Non-prime table size, and linear probing.
I see how linear probing benefits your algorithm, especially as having a link field would expand your state diagram quite a lot. I guess you must run at a pretty low load factor – I’m working on very memory-constrained devices and want to keep my load factor high.
But what really makes me nervous is the non-prime table size. That means that if you’re getting a lot of collisions to the same bucket and chaining a lot then you still will after a resize – the only thing that will improve is that if you’re lucky those collisions will now be going to two buckets instead of to one.
Ok, the 2 decisions I assumed would give me the most flak. Let’s take on the link-field vs closed-table one.
First up: memory footprint. Assuming a prime-sized table with a 75% load factor, and each entry in the table has an entry object. Each object in the table has a 2-word object headers plus the memoized hash code, the link field, and of course the key & value. On with HotSpot you’ll round up the object size to a 2-word boundary. This means that for each Entry object you’ll use 2 (header)+ 1(hash)+ 1(link) + 2 (k/v) = 6 words. Plus also the table is 3/4 full and thus you need 4/3rds word per entry in the table itself – about 7.33 words per table entry.
For the closed power-of-2 table, I’m using a load factor of 25%. I get to skip out on the headers and link word. So I pay (1 (hash) + 2 (k/v))*4 = 12 words per entry; for a load factor of 50% it’s 6 words per entry. I waffled back and forth here quit a bit, eventually deciding for desktop and server machines the larger footprint was worth the better reprobe rate. But I can well imagine doing the reverse here on an embedded device – in that case the 6 words per entry are LOWER than the 7.33 words per entry of the prime table.
Next: reprobing costs. Break up the space of all hash tables into 4 categories: cold tables, those with bad hash functions (including ones with the same value for all Keys – don’t laugh, I’ve debugged customer performance problems that stemmed from this!), large tables that don’t fit in cache, small tables that fit in cache.
- cold tables: Ignore ‘em. Premature optimization.
- hot, but bad hash: No one is going to be expected-constant-time lookup here. The link-field version turns into a gruesome pointer-chasing game. The size-1 reprobe turns into a linear scan. For specific applications, you can make sure that you Don’t Care about this option because you can make sure all your hot tables are well-behaved. For a generic library that has to deal with all comers, including the losers, the linear scan is a better fall-back option.
- hot, Big table: The cost of the cache-miss dwarfs the cost of the probe function, be it MOD or AND. The cost of a miss & re-probe for the MOD is Yet-Another-Cache-Miss, but because MOD has good spread you rarely take this option. Cost estimate: 1.01 (reprobe rate)*(300 (cache miss) + 30 (MOD) + 3 (load/cmp/branch))=336. The cost of a stride-1 reprobe is a tiny cache-hitting load; you can afford to take lots of these. Cost estimate: 300 (cache miss) + 1.2 (reprobe rate) *(1 (AND) + 3 (load/cmp/branch)) = 305. _Cache-miss dominates and the reprobe fcn cost is ignorable._I pulled the 1.2 number out of my hat; in the Misty Past I achieved this reprobe rate on a power-of-2 table (25% load factor) for all the strings in the source code of a large C program. I totally SWAG’d the MOD reprobe rate. YMMV here, but if you have a decent hash function at all then the MOD-vs-AND question is ignorable.
- hot small table: The MOD cost dominates: 1.01*(30 (MOD) + 3 (ld/cmp/br)) = 33 vs 1.2 *(1 (AND) + 3 (ld/cmp/br)) = ~5.
So either the MOD cost dominates or is ignorable (cold table or constant-hash or cache-miss dominates).
The reprobing argument also applies to the question of what happens when you get local hotspots in the table. On average, such hotspots have not led to large reprobe counts (anecdotal evidence, my experience only), and a linear-reprobe can sustain lots of reprobes at low cost. The idea of excessive reprobing leading to a table-grow is a new experiment I’m trying out here. The problem here is that in a concurrent table computing the exact size is expensive and everybody has to agree exactly on when to regrow the table (lest somebody decide to grow the table instead of inserting in it and another thread decide to insert instead of grow).
Bruce Holt writes:
The thing is … with a hash table, the table stays the same size for quite a while, so what you’re taking the modulus with respect to stays the same for quite a while, so you can computer an integer reciprocal of the size when you resize the hash table, and then use the same reciprocal for a long time. So for a hash code of H and a table size of N (with reciprocal R = 2^32/N) you’re looking at:
bucket = H – ((RH)»32)N
bucket -= N if bucket >= N
OK, this is about ten times the cost of a simple AND, assuming you can grab just the upper half of a multiply, and cheap conditional execution (both of which I have on ARM).
What do you think? Did you consider and reject this?
Nope! This is an interesting idea as a way to get a cheaper mod. I have been using a different approach that gets me reprobe rates closer to MOD but with the cheap AND function: reprobe using a different value for each key, something that’s prime modulo a power of 2 (i.e., any odd value). I use the original key OR’d with 1. So if the first probe is at index 16, I’m reprobing with offsets of 17; if the first probe is at index 12345 I’m reprobing with 12345. Same bad locality as MOD on a reprobe but waaay cheaper to compute. Since every key gets a different reprobe chain you don’t get the local hotspots that a stride-1 probe does and thus get the better average reprobe rate.
Bruce Holt writes:
But .. you’re doing calls to Object.hashCode() before it, and Object.equals() after it so I’ve got to think the speed difference might not be that significant.
Object.hashCode() is well optimized in HotSpot (load the object header, some bit-fiddling & a branch). I assume it’s cheap on IBM’s and BEA’s JVMs as well. The equal’s call is guarded as heavily as I can (must have equal hashes, unequal key pointers, etc) but in the end everybody must call equals() and this is generally a virtual-call as well. On Azul hardware v-calls cost roughly as much as a AND-LD/CMP/BR reprobe cost. On other hardware v-calls cost roughly as much as a MOD-LD/CMP/BR reprobe.
My final bit of advice to you: Since you’re looking at the embedded space, I assume you have a closed application set and a finite number of hashtables. Inspect your hot hashtables for having good hashes, and use the closed power-of-2 table with a 50% load factor with more generous table-grow requirement so your table tolerates longer reprobe chains. The 50% load factor will use less memory than a prime-table-with-links, and with a good hash function you don’t need the better spread that MOD gives you. If the tables are small use my variant-reprobe-size trick, if the tables are large stick with the stride-1. And always, profile first!