Recently I re-did HotSpot’s internal locking mechanism for Azul’s JVM. The old locking mechanism is approaching 15 years old and features a number of design decisions that are now out-dated:
- Recursion counts are kept as a NULL word on the stack for every recursion depth (i.e., counting in Base 1 math) in order to save a few instructions and a few bits of memory. Both are now in vast plentiful supply. On the 1st lock of an object, it’s header is moved into the stack word instead of a NULL and this means that GC or other locking threads (or threads installing a hash code) all need to find and update the header word – which can now be “displaced”. This mechanism is complex, racey and error prone.
- The existing mechanism requires a strong memory fence after a Compare-And-Swap (CAS) op, but on most machines the CAS also includes a memory fence. I.e., HotSpot ends up fencing twice for each lock acquire, once to CAS the header and again moving the displaced header to the stack. Each memory fence costs about a cache-miss on most X86 CPUs.
- The existing mechanism uses “Thin Locks” to optimize for the very common case of a locked object never being contended. New in Java7, +UseBiasedLocking is on by default. This optimizes the common case even more by not using any fencing for locks which have never (yet) changed threads. (See this nice IBM paper on how to do it). The downside in the OpenJDK implementation is that when an object DOES have to change thread-ownership, the cost is so high that Sun has choosen to disable biased locking for whole classes of locks to avoid future thread-ownership-change costs.
- When a lock does see contention it “inflates” and then the “inflated” lock is much more expensive than a fast-path “thin lock”. So even the smallest bit of contention will cause a lock to be much more expensive than the good case.
- JVM internal locks and locked Java objects use 2 utterly different code bases. This adds a lot of complexity to an already complex system. The two classes of locks are used in slightly different ways and do have different requirements, BUT they both fundamentally implement a fast-path locking protocol over the OS provided locking abstraction. At Azul Systems, we found that these two locking systems have a lot more in common than they do in difference.
My new locking implementation has met a number of goals:
- No “displaced header”, ever
- Only one strong memory fence to lock contended locks instead of 2.
- The normal fast-path is as good as the +BiasedLocking case. Revoking a biased-lock is cheap enough to do it on a case-by-case basis.
- Inflation happens due to a one-time contention, but then low-contention or no-contention behavior quickly reverts back to a cost which is nearly as good as the normal fast-path. i.e., uncontended inflated locks pay only an extra cache-hitting indirection load.
- One code base for all locks
As with the OpenJDK, Azul’s new locks are implemented as some bits in every Object’s header word, plus an ObjectMonitor structure when the lock “inflates”. Internal VM locks use a plain Monitor which ObjectMonitor subclasses (actually most VM locks are a subclass of Monitor called Mutex which supports only plain lock actions; Monitors support wait & notify as well).
Within the 64-bit object header word (Azul uses only 1 header word instead of 2), we reserve 32 bits for locking & hashcodes. If all 32 bits are zero, the object is unlocked & un-hashed; new objects appear this way. If the low bit is set, the remaining 31 bits are a hashCode. If the low 2 bits are ‘00’ the object is BiasedLocked to a thread, and the remaining 30 bits are the thread ID of owning thread (we can compute the thread ID of a thread in 1 or 2 clock cycles). If the low 2 bits are ‘10’ the remaining 30 bits are the address of an ObjectMonitor structure; e.g. the low 32 bits of the header contains a C++ pointer to an ObjectMonitor, plus 2.
The low 32-bit patterns of the object header are:
00000000 – The most common case is that on object is never locked and never hashed; it’s low 32bits of header remains zero throughout it’s lifetime.
xxTIDx00 – The next most common case is that the object is locked but not hashed, and never changes owners. Many many Strings fall into this category. The first lock acquire will CAS the owning thread into the header, and the object is now Biased-Locked. Future lock attempts will simply do a load/compare/branch to confirm that the object remains biased-locked to the owning thread. Literally the code to do this is:
// R01 - holds object // R02 - holds self-thread-ID ldu4 R03,[R01+4] // load low 32-bits bne R03,R02,slow-path-locking // thread-id bits do not match? // bits match, so we own the lock!
xxHASHx01 – The next case is that the object is hashed but not locked, and we are trying to get the hashCode:
// R01 - holds object ldu4 R02,[R01+4] // load low 32-bits beq R02,0,not_hash // zero header so no hash shr4 R02,1, // low-bit to carry flag bcc not_hash // "branch carry clear", if low bit was zero it was not_hash // R02 - holds 31 bits of hash
xxMONx10 – The next case is that the object requires an ObjectMonitor, either because it is both locked AND hashed, or because it was biased-locked once and saw contention. The monitor is a 32-bit pointer (so limited to the low 4Gig), but can be directly used. This snippet of code assumes we already failed the fastest path lock-acquire given above (the snippet is actually a called subroutine so the code size does not really matter):
// R01 - holds object // R02 - holds self-thread-ID // R03 - holds loaded header word which might be a monitor: slow-path-locking: extract R04,R03,2 // test bit#1 for being set bne R04,1,not_monitor // branch if we need to inflate the lock // Here we have code to check for self-biased in the monitor: ldu4 R04,[R03-2+owner_field_offset] // load ObjectMonitor.owner beq R04,R02,lock_is_owned // thread-id bits match? // and here we carry on with slow-path locking // including testing for recursive locks and // attempting CAS acquire, short spin-waiting, etc
And so on. If all the various fast-path attempts fail we eventually end up calling into the VM and blocking on the OS-provided mutex. Along the way we’ll attempt a bunch of other tests (e.g. recursion & spin-waiting) and if all things fail we need to block, we’ll also do a fast/short crawl of the call stack for profiling. There are a lot of other interesting issues here, including heuristics on when to make a lock not-biased by default (producer/consumer design patterns make objects which always change hands at least once), or when to deflate an inflated lock that has “settled” into a new thread (or unlocked state), or how to achieve a modicum of fairness under extreme contention. Since this blog is already getting long (and as raw Assembly code ta-boot!) I’ll stick with a short discussion of the bias-lock revoke mechanism.
Revoking a Biased Lock
Suppose thread T1 holds the lock on Object O – via the very fast-path mechanism: O’s header word contains the bits “xxT1xx00”, and that thread T2 needs to acquire the lock on O. What kind of interactions are possible? For starters, assume T2 has failed the fast-path attempts (it’s got the wrong thread ID!) and already entered the slower-paths in the VM. A simple inspection shows T2 that O is biased-locked to T1. What does T1 know about this situation? Turns out that thread T1 knows almost nothing:
- T1 may have acquired O’s lock in the misty past and may not now have access to a pointer to O.
- T1 may be rapidly acquiring and releasing the lock on O, all without fencing or atomic operations. Moment by moment, cycle by cycle, the actual Java state for T1 may have it holding & releasing O’s lock. T1 isn’t really tracking this, he is just periodically observing that he holds the bias lock on O.
- T1 may be blocked in long latency I/O operations or on other locks.
- T1 may be a dead exited thread.
- The id for thread T1 may have been recycled to a new thread, T1’, which has never seen a pointer to O cross it’s JVM state, ever. Yet the lock for O is now biased to T1’!
What T2 can do is to ask T1 to reach a safepoint – a point in the code which matches some Java bytecode, and then decide if T1 really holds the lock on O or not. T2 does this by setting a self-service task on T1; T1 periodically polls for such tasks at safepoints, and when it sees the request it breaks out of it’s normal execution and handles the task (periodic polling typically 1 or 2 cycles every few thousand instructions). T2 then waits for T1 to “do something” with the lock (but not really: what if thread T1 has already exited?).
Suppose T1 is still running and spots the poll request. It then does a self-stack-crawl to count the number of times it’s supposed to hold the lock. JIT’d code includes annotations to describe what locks are held (and how often, if recursive), and the interpreter counts lock acquires in any case (with slow simple Base-1 counting but we don’t care about speed in the interpreter). If this lock count is positive (T1 really holds the lock now!) T1 inflates the lock, slaps in the recursion count into the new ObjectMonitor struct, goes back to running… but now each unlock by T1 will lower the recursion count. When the lock is unlocked for the last time, T1 does the usual wakeup on T2 and T2 then competes to grab the now-free lock. If this lock count is zero (T1 holds the lock biased but not for-real) then it releases the lock and notifies T2 as normal. In any case, the bias on O is revoked and the lock reverts to a state where T2 is allowed to compete for ownership.
This is all great if T1 is actively running and polling for self-service tasks, but what if T1 is blocked or exited? After T2 prods T1 with the poll request, T2 then attempts to do the very same self-service task on T1’s stack remotely. To do this T2 has to acquire the “lock” on T1’s stack. All Azul threads unlock their stack whenever they block (or otherwise stop polling) – this same stack-lock mechanism is used by GC threads. If T2 can get T1’s stack-lock, then T2 can safely crawl T1’s stack (and T1 is not allowed to execute random Java code until he gets his own stack-lock back). If T2 canNOT get T1’s stack-lock, it must be because T1 is busy executing Java code – and hence T1 is also polling for self-service tasks and will (shortly) get around to crawling his own stack. In any case either T1 or T2 will shortly get around to crawling T1’s stack and discovering the proper lock recursion count for Object O.
And if T1 is exited, T2 can also detect this from T1’s old thread-id (thread-id’s map back to some type-stable per-thread data). In this case, T2 can just freely revoke the bias on O and bias O to himself.
Well, that’s enough for now! Hope you enjoyed this (very long overly complex) discussion of our biased-locking implementation.