[Part 2 : ‘put’ of my NonBlockingHashMap presentation will appear later]
I presented my NonBlockingHashMap at Google yesterday. Overall, the talk went over very well with lots of intelligent questions. This bodes well for presenting this stuff at JavaOne. To be blunt, I was very concerned that I’d be talking over most folks’ heads at JavaOne, but maybe not.
As part of the Q&A somebody asked me about a narrow race to that happens when installing a new table. The HashTable auto-resizes, and periodically a new table has to be installed. During the install process many threads can all independently decide that a table-resize is required and then all go and create a new empty table and attempt to install it. Only one succeeds in the install (via a CAS) and the rest just drop their extra tables on the floor and let GC reclaim them. I had supposed that the extra tables created here never amounted to very much – the race seems very narrow and is clearly very rare for modest sized tables (as based on a bunch of profiling). And I said as much during Q&A.
But I had observed that for high thread&cpu counts (500 to 750+) and very large tables (4million entries, times 2 words (Key,Value) times 8 bytes (64-bit VM), so a 64Mbyte array) some odd timing problems which I suspected were GC related. So I profiled a little more… and Lo! The Google engineer was right (sorry I didn’t get your name!)… a 64Mbyte array takes some time to create – because it has to be zero’d. During that time more threads figure out that the table needs resizing, so they also make 64M arrays. The sum of them start triggering various slow-path GC issues (emergency heap-growth, large object allocation space, GC decisions, etc) which gives more time for more threads to discover that a table resize is needed. Pretty soon I’m making a few hundred 64M arrays, all of which are dead (except the one winner) and the GC is swallowing a major hiccup.
My fix goes against the spirit of Non-Blocking algorithms: I stall all but the first few threads making huge arrays. The stalled threads are basically waiting to see if a new table shows up. The algorithm is still non-blocking in the end: if the proposed new table doesn’t show up the stalled threads eventually get around to making one themselves. But by stalling a little I nearly always avoid the problem; threads that have “promised” to make a new array are in fact busy making one – I just need to give them a little more time.
This issue points out one of the interesting discrepancies between Java and C. In C, I could malloc a few hundred 64M arrays relatively quickly: they would all end up mmap’ing virtual memory but NOT physical memory. Then when I freed the extra’s, I get all that virtual memory back – but I’d never end up touching any physical memory. Once I had a winner determined, I could initialize that 64M array in parallel. In Java, I only get the pre-initialized array and it’s normally zero’d by a single thread. It takes some substantial time to zero a 64M array.
I could also go with ArrayLets in Java- basically an array of arrays. For a 4M element array I’d actually make a single 2048-element 2-d array and zero only that. The inner arrays would get lazily created on demand. This would let me spread out the cost of zero’ing both over time and across threads. Biggest downside: every HashTable access would require an extra indirection (and range check) to get the real elements.
At the moment, I’m going to live with my sleazy stall-on-resize-of-huge-arrays. It appears to work really well, and it’s really simple and fast.
Moral: Peer review is Good, and I’ll be a little less smug about saying ‘that race is really rare’ in the future!