2011 FCRC Trip Report
Or how I went to FCRC and all you got was this lousy blog…
This year’s travel report is really amazing!!!! I drove from my house to downtown, about 20 minutes away. I parked. I walked up the stairs. End of travel report. 🙂
As usual for these trip reports, I only go to a selection of talks, and I tend to brutally summarize, and I tend to summarize more and more as the conference runs on and on and my brain gets full. In this case I spent a good day at ISMM, then a full day at EVAL – which was really fun, but a workshop and I didn’t have time to write many notes.
Here’s the titles in some sort of “Cliff found interesting” order.
No links to the talk reviews, because I can’t figure out how to make WordPress do internal links.
Keynote, Memory Management – Very nice talk from Christos
NUMA multi-core – correct throughput is loads/stores over time
GC object scanning: discover that 6 ref patterns cover 90% objects
Picking better data structures – neat
Object placement & cache – big speedups to spread out hot fields
GC for haskell – thread-local heaps
Cooperative GC on over-committed servers
Panel discussion, just my predictions
RECON – A better data-race finder & EXPLAINER
CSmith – random compiler test cases.
GC on a GPU
Programmer picks when to do approximate data – strange
SC-OK x-forms for data-race-free programs (but no programs are DRF!)
LLVM x-forms – no good, just as hard to write the prover as the compiler
LRU-MRU cache, yuck needs pre-profiling run
parallel trace-jit compilation – boring, HS done for 10yrs
PRE w/profile data for placement – boring, HS done for 15yrs
Memory Management Beyond free()
Goal: management critical resource w/out burden on developer or overhead
60’s DRAM capacity & complexity of overlays lead to Virtual memory
management in OS’s (TLB’s).
90’s DRAM capacity & complexity of (de)allocation lead to automatic garbage
Requires: automatic & low overhead
But also: memory latency & bandwidth. Locality. New functionality.
You can buy bandwidth but not latency.
You can bribe Intel but not God (speed of light).
Memory is now the major differentiating factor.
True for both cell-phones, servers and data-centers.
64-bit FP op: 1ns latency, 20pJ energy.
local memory: 100ns latency, 20nj energy (100x slower, 1000x more energy)
Interconnect has much of the latency & energy costs.
In the past you could count on caching, and on freq&power scaling.
Abstraction mismatch- issues are related physical world not virtual.
Programmers can express “what” but not “how” – but “how” counts for a
Look at problem from systems/hardware guy.
Data Center Scale
– 100k sq ft, 100k servers, 10MWatt to 100MWatt power, cooling.
– Used for search, games, mail, social networks, cloud…
– Sensitive to performance, reliability, cost
– The overhead of cooling, distribution, etc… is about 10%.
– Constrained by energy consumption by the servers themselves
Many apps use DRAM for distributed storage.
Primary motivation: low latency. NO outliers. 99% case is really good.
User queries involves 100’s of accesses with bad locality.
To meet 0.1sec human response time: cannot go to disk more than 10 times.
Other parts of the service (outside of disk) is getting faster, e.g. network takes micro-secs.
Search: mostly read, can use flash instead of disk.
Social Networking: 10’s of Tbs in memory (Facebook: 150Tb). Sharded and replicated across many servers. Read/write (unlike search). Now seeing people use large in-memory file-systems (stripped across) many machines (RAMCloud@stanford, Sinfonia@HP)
What part of this should be automatically managed?
Degree of Sharding & Replication? Depends on desired latency, availability, server size, QoS…
Sharding function? Degree of batching? When should access be batched and group’d and handled sequentially?
All these issues managed manually; tedious; error-prone, critical to performance.
Bandwidth in data centers is typically UNDER utilized. Large scale analytics and search use 1% to 5% of bandwidth (but maybe 99% of CPU). Totally latency bound – over network typically. Could use Cell-phone memory! Cell-phone memory has about the same latency and capacity as server memory, but uses about 1/2 the energy and 1/2 the bandwidth. Can stack cell-phone memories (form factor wasn’t designed for stacking) similar to memory module sticks. May still want some high-bandwidth high-power memory.
Another place for automated memory management: which data goes in low-power low-bandwidth places, and vice-versa.
Multicore Memory Management
Can we do something with MapReduce?
Lots of parallelism. 32-core, 256 hw threads. (Sparc T2)
Local memory is 3x faster than remote memory.
Performance: scales up well to 64 cpus, then crashes hard above 64 cpus.
Dies for no locality & out of interconnect bandwidth.
Fix: hard-make much memory-placement stuff. Then avoid the performance “crash” ) flatlines as you add more CPUs intead of fall.
What can we do automatically?
Can do locality-aware tasks & locality-aware task-stealing.
Worth 2x to 3x speedup.
Huge improvements in reducing bandwidth.
Conclusion: task-aware scheduling is worth tons
Memory Management in NUMA Multicore Systems
Trapped between Cache Contention and Interconnect bandwidth
Assume many CPUs, few worker threads, much shared cache, NUMA memory layout.
Can have all workers local to memory – but other half of CPUs are idle.
Can have have 1/2 workers local & 1/2 remote – but then have interconnect
CLIFF NOTES: “correct” throughput metric for an OS scheduler to use is not cycles/ticks allocated to a thread, but load/store instructions executed. Figure on average loads & stores will slow down if running cache-remote and run faster if closer to local memory – for the same worker thread. OS can bounce the worker thread around & measure average instructions/tick, and maybe figure out a better CPU mapping for each thread.
A Comprehensive Evaluation of Object Scanning Techniques
Visiting nodes in a directed graph. Transitive closure over a large graph. Describe the heap: a collection of references. Really only care about whether each word is a pointer or not. Object layout in control of the runtime; so object layouts are routinely similar across objects classes. Inheritance constrains layout (e.g. Sable: pointers added on left, non-pointers on right).
Measuring heap composition: live objects (vs allocated objects). Piggy back on collector during GC scan. Collected the 32 most common bitmasks of references. Observe: 6 patterns count for over 90% of all objects in all heaps. Indeed 2 patterns account for over 50% of all objects. So lots of room for optimizing certain common cases.
– Inlining common case
– Compiled vs interpreted. Specialize a method to scan each type. Invoke via virtual dispatch or Big Switch. Non-specialized ones are interpreted.
– Encoding & packing meta-data: Want to encode the few common cases, perhaps in the object header. Hotspot: layout helper word. He is proposing stealing bits from the klass ptr (which we already did with the KID work!)
– Indirection to metadata (zero if in header, 1 if in vtable, 2 for kid-to-klass-to-data, etc
Can save maybe 10% to 20% with specialized scanning.
Removing 1 level of indirection to metadata only saved 5%
Using bitmap vs a list of offsets is worth another 10%
Using 3 bits in the header for the most common 7 patterns (and 1 for fall-back) worth 20% increase in scanning speed.
CLIFF NOTES: For HotSpot, Azul’s work with a perl script for inlining might be further refined to make special methods for the hot 6 layout patterns.
Brainy: Effective Data Structure Selection
Have a heuristic tool for picking data structures.
Did a massive check of 1000 programs, to see if their heuristic agreed with the programs. Found ~50% disagreement, i.e., tool thinks a different DS would do better. Heuristic uses both hardware (e.g. Core2Duo vs Atom vs Nehelam) and runtime profiling, and can select DataStructure. Uses C++ STL Library to detect existing data structures. So compile C++ program, then run once w/profiling, then get a report on which call-chains should use which different data structures.
Heuristic: looks at software features: # of invocations of which interface functions, data-type size, # of data elements touched, etc. Obtained by profiling.
Heuristic: looks at hardware perf counters: branch mispredicts, L2 behavior, etc.
Feature space is large: 30+ dimensions. Need to train heuristic!
Object placement & caches
Object placement in memory can impact performance.
Show’s an obvious direct-mapped cache example.
Shades of Azul CPUs placing 8 stacks over the same 4-way L2 lines
Assume you have power-of-2 free-list design. Then have many carefully aligned objects of a fixed size. These then have the perfect (bad) alignment for cache line conflicts. Also free-lists are perhaps kept on 4M alignment, which leads to TLB conflicts
Testing on a Niagra2 – L1 is 8Kb, 4-way 16b lines. Make a ring of 256 objects with 1 hot pointer to the next object. Rest of object is untouched. Time how fast to cycle the ring.
Performance on Solaris libumem – sucks. When the nodes*256 exceeds cache indices (MUCH SMALLER THAN cache size), then throughput sucks. Hoard works much better – tends to get better memory layout for the hot fields. Same effect on Nehalem: 32Kb, 8-way, 64b lines.
Bump-ptr allocations and best-fit can also get hammered this way. Real issue: some (1) field is hot in a large count of same-sized objects. Allocating those same-sized objects tends to get those hot-fields to stride uniformly through the cache, which tends to lead to high cache-conflict rates.
Fix for standard free-list allocators.
Index-aware size classes: make sure free list does layout such that all cache indices are touched before reusing addresses. Easy: chose free-list block sizes which are relatively prime to cache-line size; e.g. odd-sized blocks. Might have some fragmentation issues for very small blocks. i.e., don’t want to make size-2 blocks into size-3 blocks. Instead lose 1-cache-line every full set of blocks that would fill a cache (i.e., lose 16b out of every 16Kb on Niagra).
Seeing +/- 50% for a wide range of C apps; their allocator does quite well.
Superblock coloring – trying to get allocations from different threads to spread out throughout the shared-cache layers. If you have thread-local free-lists, then 2 threads can get the same layout for the same data structures and those structures can now collide in cache. Need to spread out the per-thread malloc free lists.
Multi-core GC for Haskell
Looking at cache-local young-gen. Well understood problem.
- Avoid context switch overhead: mutators do their own GC.
- local roots done locally (per-core remembered sets)
- Do not do load-balancing!
Nurseries still very small, so are frequent. And are Stop-The-World, so involve at least 2 barriers. GCs happening every millisecond. No load balancing, and threads allocating at different rates: so many threads stalled for slowest thread doing GC. Also have issues with standard slow-thread-hitting-safepoint.
So want to avoid a not-STW collector. Shared old-gen. Local heap can be collected while other cores run. Not really a concurrent GC. Can we disallow inter-local-heap pointers (Azul’s Stack Based Allocation)?
Truly small thread-local heaps, plus the large shared common old-gen. They are looking at the write-barrier model, following by an “escape” and fixup. Tried that, didn’t like it.
Allow local ptrs in the global heap. Use a Read-Barrier to detect when loading a Foreign-Local pointer. Ok to load Self-Local pointers from the global heap. Foreign-Local pointer loads force the foreign object to be promoted to the heap. Found a way – in Haskell ONLY – to make the read-barrier cheap. Basically Haskell already does a function-invoke per pointer load to do lazy evaluation, and they piggy-backed on that.
New local GC delivers about 15% better performance on average for these Haskell benchmarks than old collector, on a 24-cpu machine.
Read-Barrier is a huge win to avoid moving everything into the heap. Able to keep much more objects thread-local.
CLIFF NOTES: SBA does not need to move objects. Always allocate in the general heap, but maintain the thread-local invariant. An escape must visit transitive closure of escaping, but only needs to flip 1 header bit denoting local versus remote. How do we track allocation sites? Can I stick a pre-word (or post-word?) on thread-local objects stored in the general heap?
CLIFF NOTES: fix the SBA-vs-Java-heap model: allocate a Java large array and then do SBA inside of it like a TLAB. Escaped objects: fragment the large array into chunks around the escaped objects. End up with a linked list of allocation fragments. When a chunk gets too small – “leak” it to the general heap for normal GC to get. Should end up with a list of large chunks which do normal thread-local GC’s efficiently. Mark/Compact within these large chunks. Deals with tightly-compacted non-moving chunks well – but want to keep the thread-local notion.
“Waste not, want not” – Resource-based GC in a shared environment
GC-vs-custom free; GC can be competitive but needs usually 5x the memory. But paging kills GC. Want to use as much memory as you can… without paging. True available RAM changes constantly in virtualized systems.
Can hide a few page faults, but want to avoid a “fault storm”. Process checks the fault-rate. Ignore if at a low rate. At a high rate, trigger a full-GC, shrink heap till it fits in RAM.
But have many processes sharing RAM. Processes need to cooperate. So processes can do FullGC when they detect swap on OTHER processes. Also use shared whiteboard to serialize GCs amongst cooperating processes (because FullGCs temporarily need more memory). When FullGC is done, JVMs working set is reduced (due to compaction) – even with no change to -Xmx flags.
But can do better: can predict a coming page-fault storm. Track Resident Set Size, RSS. If RSS drops unexpectedly, then OS is steal pages from this process – this process is using too much RAM. Triggers a GC early; post-GC process is touching less RAM.
More better: original idea is very conservative: ANY process detecting slightly too-high paging would trigger ALL processes doing a FULLGC. New model: the largest heap user (the “sucker”) would trigger a FullGC. Lets system perform well when over-subscribed, but smoothly transition to using all available RAM when memory is available.
Panel – memory predictions for the next 20 yrs
(my) Sure-bet predictions:
– it’ll be managed by DLmalloc
– HotSpot will be running on a high number of machines …
and running DLmalloc internally
– Processor-in-Memory will be real, and the processor will running X86 natively
– Same trends keep going: memory is denser “bigger” “wider” but not much faster
– Memory is the new disk: versions of very dense “very slow” memory, e.g. flash
– Faster/smaller memory
– And still CPU cache hierarchies
– Tension between physical constraints and language abstractions will continue
– Placed memory vs “new”
– GPU’s running e.g. Java++
RECON: Finding and Understanding Concurrency Errors with Reconstructed Execution Fragments
Fun pictures of attempts to herd cats
Root cause of bug is often far removed failure.
Full execution tools show “too much info”, need only the slice of 2 bad threads.
The “what happened” tools show “too little”- not enough info on how it happens
Want to show the interleaving of failing threads just on the failing passing of broken data between threads.
Tool runs program-under-test many times. Collects “communication graphs” for program runs. Labels runs as broken or not-broken. Then tool sorts the inter-thread communications as “part of buggy runs” or not-buggy runs. Then they sort out the most likely communications leading to bugs. Collection slowdowns are around 15x to 20x.
Communication-graphs: nodes are instructions, edges are inter-thread communication via shared memory. Have a time-stamp for communicating nodes. A reconstruction is a time-ordered sequence of memory-ops around the failing communication edge.
“Reconstructions” are average behavior around a bug-crash run, just those mem-ops near a suspicious communication edge. Rank all the reconstructions and pick the highest ones. (lots of complex math here)
Nice idea here…
CSmith – randomly generated test cases for compiler correctness. Very stable
and finding real bugs in GCC. Produces C99 only. Open source.
Iterative Data-Parallel Mark-Sweep on a GPU
“GC on a GPU”
Objects in GPU memory. Many local caches, either NUMA or real distributed.
Thread-local heaps with thread-local GCs.
GPU cores are very simple. Batches of 16 cores run in lock-step.
Has a language? Can allocate on a GPU. ‘new’ can return NULL. Programmer has to manually invoke GC on a GPU. Can invoke very-wide data-parallel invokes. Essentially hidden data-parallelism, not thread-parallelism.
GPU: best performance is when all cores run in lock-stack.
Model: no need to crawl stack; cooperative GC invocation.
Can find root-set in parallel (all entries in some arrays).
Can now scan arrays and object fields in parallel.
Lousy performance for single-linked lists.
Typical sequential Java sucks.
Want arrays of objects, maps, hashes, high fan-out trees, etc.
Classic parallel marking with work-stealing. Problem: stealing is fairly expensive. Want to steal less often (and steal more work?). Implementation is exact but sweeping, not moving. Allocation is standard striped free-lists on powers-of-two.
More on marking: mark objects in parallel using field-by-field scanning. But want to do big arrays in a different way…. but in SIMD model. Put aside large arrays for later. Then do large-arrays with a array GPU kernel and save objects. Flip back and forth between large arrays and objects.
Not compacting – blows locality?
Fun power savings modes: let programmer insert approximate-data keywords, insert approximate-math keywords, or cruddy-hardware. Get 20% to 50% power savings, e.g. when doing picture-related-stuff, while keeping perfect math for example in security & OS stuff.
Safe Optimizations for Shared-Memory Concurrent Programs
Many optimizations are safe for sequential programs, are NOT safe for concurrent ones. Looks like he’s working on C code and not Java with volatiles.
Ahh… ‘volatile’ in Java or ‘atomic’ in C++0x fixes his program. So everything is OK for Data-Race-Free programs, but most multi-threaded programs have data races. Many x-forms are “not safe” for data-race-free programs. GCC has over 200 compiler passes – ugh, tough problem to decide if any given pass causes a problem. But when refined to just memory traces, many fewer passes really re-order memory.
Mem-trace-preserving x-forms are always safe.
Removing a dead read is OK.
Adding a dead read is OK.
Removing read-after-read is OK only if does not span e.g. lock
Still, using safe-for-SC program x-forms only, still leave some data-race-full programs undefined.
Evaluating Value-Graph Transform Validation for LLVM
(Does LLVM’s graph x-forms do the right thing?)
Another code-generator, but this one for LLVM.
Tries to see if the optimized code has similar semantics.
Using x-forms at the AST level.
This one looks painful: the AST x-forming has to have all the x-forms of the underlying optimizing compiler.
Potential of a LRU-MRU Collaborative Cache
Cache replacement policy
LRU – victim is least recently used. Practical but not optimal
OPT – victim is farthest future use. Optimal but not practical
Initial cache is strict LRU?
2ndary cache is MRU evict? (mostly recently written evict). 2ndary cache is filtered by LRU cache. Only sees things that are not already LRU.
Profile cache behavior – some static accesses should use MRU, and some LRU.
Blah – profile required to drive cache behavior?
Using a fully associative cache?
Results presented as improvements to hit-ratio instead of performance.
I can’t tell how he chooses MRU vs LRU, but some form of early-eviction for never-be-used-again stuff helps
Parallel Trace-JIT Compilation
HS done parallel compilation for 10yrs…
PRE – with profiling data for placement.
Sorta yawn’ers: HotSpot’s doing GVN+GCM w/profile data for 15yrs and it’s morally equivalent (similar quality code generated, simpler, faster)