Various pending goodies…
Load-Linked/Store-Conditional vs Compare-And-Swap
Pretty much all modern CPUs include some form of instruction for doing an atomic update – required for shared-memory multi-CPU machines (X86 has lots and lots!). There was a long period of debate in the CompSci community one what constituted the “minimal” needed instruction to do useful multi-CPU work. Eventually the community has decided that the Compare-And-Swap (CAS) instruction and the Load-Linked/Store-Conditional (LL/SC) instruction combo both are (1) sufficient to do useful work (“infinite concensus number”) and (2) relatively easy to implement in real hardware. X86’s, Sparc’s and Azul’s CPUs use CAS. IBM’s Power, Alpha, MIPS & ARM6 all use LL/SC. ARM5 only has an atomic-swap.
LL/SC and CAS are slightly different in how they work, leading to subtly different requirements on algorithms. With LL/SC, you first Load-Linked a word. The hardware marks the cache line as “linked”. You then manipulate the word (e.g. add 1 to implement a simple shared atomic counter). Finally you issue a Store-Conditional to the same address. If the cache line is still “linked”, the store happens. If not, then not. The line remains linked as long as it does not leave this CPU’s cache; e.g. no other CPU requests the line. Any attempt to take the line away causes SC failure (retry is up to the algorithm being implemented). “Weak” LL/SC implementations can easily lead to live-lock – if Load-Linked requests the cache line in exclusive mode (required to do the Store-Conditional), then each LL causes all other CPUs to lose their “link” – and their SC’s will fail. I suspect most modern implementations of LL do not request the line in exclusive mode – avoiding the obvious live-lock failure. The downside is that a simple uncontended LL/SC on a word not in cache requires 2 cache-miss-costs: the original miss on the load, and a second miss to upgrade the line to exclusive for the SC.
With CAS, you typical first load a word with a normal load, then manipulate it. Finally you issue the CAS which compares the memory contents with the original loaded value: only if they match does the swap happen updating memory. CAS can succeed if the original cache line leaves the CPU and returns holding the same value – this allows the classic ABA bug. In some cases, Garbage Collection can nicely side-step the ABA bug; you can never find an aliased copy of an “A” pointer unless all copies of A die first – including the one loaded before the CAS. Similar to LL/SC, there can be 2 cache-miss costs: one for the original load and again to upgrade the line for the CAS. Azul has a load-exclusive instruction to avoid with this – a plain load but the line is loaded exclusively. With CAS you can issue any other instructions between the original load and the CAS; typically with LL/SC there’s a small finite number of operations that can happen between the LL and the SC without losing the “link”. E.g., guarding a one-shot init function by atomically moving pointer from NULL to not-NULL: with LL/SC you must load the line before the SC; with CAS no separate load is needed (e.g. an “infinite distance” between the original load and the CAS).
These atomic instructions only need to guarantee atomicity of update, not ordering w.r.t. other memory operations. Nothing in the academic literature requires any sort of global ordering on these atomic instructions. Instead the usual academic assumption is that all memory operations are strongly ordered – which is obviously not true on all modern hardware. Practitioners are required to insert memory fences as needed to achieve the desired ordering. Nonetheless most implementations of CAS include a strong ordering: X86 and Sparc certainly do. Azul’s CAS does not, and this allows Azul to e.g. implement simple performance counters that do not drop updates and also do not force ordering w.r.t. to unrelated memory operations. _(As an experiment, try writing a multi-threaded program to increment a simple shared counter in a tight loop without locking. Report back the %-tage of dropped counts and the throughput rate. Then try it with an Atomic counter. My simple tests show with a handful of CPUs it’s easy to achieve a 99%+ drop-rate – which basically makes the counter utterly useless). _I am less familiar with the fence properties of common LL/SC implementations. Any of my gentle readers wish to report back on the situation for Power, MIPs, ARM, etc?
Some time ago I reported on a build tool I’ve been using. Currently ‘build’ is being used to build HotSpot internally to Azul. We’ve got it building 400+ files, plus build rules for building portions of the JDK, compiling w/gcc & javac and also ‘javah’, tar, jar, strip, binary signing, etc. In addition to the typical parallel-make functionality, it includes auto-discovery of c++ header files, intelligent load-balancing of big builds, etc. It’s blazingly fast, it’s easy to hack (being written in plain-olde Java). I hunted around awhile and couldn’t find a good place to dump a medium-large blob of code, so here’s a sample build.java in 4 parts:
A “HotSpot -server” aka C2 compiler IR visualizer
No I didn’t do it! This deserving grad student did it.
Here is the reference to the visualizer for the ideal graph of the HotSpot server compiler:
It is a JNLP application hosted on a university server (http://ssw.jku.at/General/Staff/TW/ has also a test file to download), but the easiest way to get a first impression. The source code of the tool is hosted onhttp://kenai.com/projects/igv The server compiler instrumentation is part of OpenJDK and also included in Sun’s weekly builds of Java 7.
More C2 Goodies
… One of the things I am currently looking at is determining the right phase-ordering of optimizations applied to a particular program being optimized. I have some nice (un-published) results for JikesRVM,
but it would be nice to replicate the research for HotSpot…
I have strong arguments for nearly all orderings of our passes, so I’m curious to know about your phase-ordering results.
The only obvious transform we might be missing is PRE, but our peephole opts pretty nearly subsumes PRE (they are not mathematically equivalent – I can write programs where either PRE or the peephole stuff makes progress against the other). In practice, PRE will find essentially nothing once the peephole opts are done. You’ll notice that we do the peephole pass alot; it’s totally incremental and provably linear bounded. In other words, if there’s nothing to be gained then there’s no cost in trying. The peephole pass includes amongst other things all pessimistic forms of copy propagation, value equivalence, constant propagation (especially the null/not-null property), constant test folding, repeated test folding, dead code elimination, load-after-store opts (aliasing analysis is included for free during building of SSA form), algebraic properties, and lots more.
For HotSpot, the optimization ordering is:
- Build an inline tree, including class hierarchy analysis. This is the one pass I’d be willing to move, as it happens too early.
- (a single unified pass:) parse bytecodes, inline, build SSA form (the IR remains in SSA form always), do peephole opts over SSA form. This pass typically costs 40% of compile-time budget.
- Fixed-point the peephole opts over SSA form, once all backedges are known after parsing.
- Loop opts round 1: “beautify loops” (force polite nesting of ill-structured loops), build a loop-tree & dominator tree, split-if (zipper-peel CFG diamonds with common tests, plus some local cloning where I can prove progress), peel loops (required to remove loop-invariant null checks)
- Fixed-point the peephole opts over SSA form
- Loop opts round 2: “beautify loops” (force polite nesting of ill-structured loops), build a loop-tree & dominator tree, lock coarsening, split-if & peel – but if these don’t trigger because there’s nothing to gain, the do iteration-splitting for range-check-elimination & a 1st loop unrolling.
- Fixed-point the peephole opts over SSA form
- Conditional Constant Propagation (the optimistic kind, instead of the pessimistic kind done by the peephole pass)
- Iterate loop (split-if, peel, lock coarsen – but these typically never trigger again and take very little time to check), unrolling & peephole passes, until loops are unrolled “enough”. On last pass, insert prefetches. Typically this iterates once or twice, unless this is a microbenchmark and then unrolling might happen 8 or 16 times.
- Remove tail-duplication, and a bunch of other minor code-shaping optimizations e.g. absorb constant inputs into deoptimization-info in calls, or commuting Add ops so that 2-address machines can do update-in-place.
- Convert “ideal” IR into machine code IR.
- Build a real CFG for the 1st time, including a dominator tree, loop tree. Populate with frequencies from earlier profiling.
- Global latency-aware (loop-structure-aware) scheduling.
- Replace null-checks with memory references where appropriate.
- Register allocate. Internally this has many passes. This pass typically costs 40% of compile-time budget.
- Sort basic blocks to get good control flow ordering (forward branches predicted not-taken, backwards predicted taken, etc)
- Some last-minute peephole opts.
- Emit code into a buffer, including OOP-maps & deoptimization info.
I got too much travel coming up!
Apr 30, May 1st – DaCapo in Boston. Favorite small group; GC & Java focused. Website: http://www.cs.tufts.edu/research/dacapo/. I had to turn down the invite to anSSA seminar in France because the dates conflicted. Very sad. I hope one of the attendees will post a trip report.
May 11-May 15. IFIP WG2.4 (International Federation of Information Processors, Working Group 2.4 – a really old European group with a random mix of industry & academia.) It’s a nice group to preview JavaOne talks, and always the meetings are a week-long rambling discussion in some quaint resort.
June 2-June 5. JavaOne. I owe slides for 3 talks by end of this week. ‘enuf said.
July 6-10th – ECOOP, as a paid-for invited speaker. A free vacation to Italy! 🙂