Ben Zorn and I had a long talk while working off dinner at CGO yesterday. One of the things we noticed was how Ben’s recent work on surviving data races and Azul’s work with detecting data races in the Java Collection classes had a lot in common: both came about because data races are killing the average programmer, both run with little overhead and detect only a limited set of data-races, and are easy enough for somebody to hand-code into their code today.
I assume the reader knows what a data-race is (2 threads accessing the same memory location and at least one is writing), and has wrestled with debugging ones in the past. The problem is that nobody really knows how to debug data-races. There are plenty of academic papers on how to detect them and they are all unrealistically slow, or don’t work on large real-world programs, or throw a zillion false positives, or are otherwise broken. (Ok, the REAL problem is that we don’t know the Right Way to do concurrent programming, but that’s a talk for another blog).
So Ben & Azul were both looking for some easy way to detect common data-races – something people could put into practice with little or no effort, and that wouldn’t kill performance and wouldn’t raise a zillion false alarms.
First Azul’s solution: we added a little code to the Java Collections. You have to install some code on the -Xbootclasspath or run Azul’s JVM with -XX:+UseLockedCollections. Then the newer Java 4 unlocked collections (e.g., HashMap) run locked – no data-races but no concurrency (or speed) either. If this fixes your bug, likely you are faced with calling an unlocked collection in a racey fashion. But where, in your million line program, are two threads calling into the same HashMap with one of them writing? Now run with the flag -XX:+UseLockedCollections2. This time, HashMap tracks when the collection is being both used and modified (by bumping an Atomic) – and throws a Java exception in both threads. You get stack trace-backs for both threads involved in the data-race. It’s a really low-cost technique, and although it only catches races in specially annotated code (e.g. HashMap) there are a few high-payoff areas in the Java libraries that seem to be involved in more than their fare share of data-races.
Now Ben’s solution “ToleRace”, which I am going to paraphrase badly. The idea here is that this piece of code in front of you, that you have pored over for the last day, is known correct but some other piece of unknown code is messing up your good code by breaking the rules and accessing the data in a racey way. Probably somebody just forgot to synchronize.
Turns out you can tolerate (hence “tole-race”) most races if you copy-in/copy-out around the locked region. I.e., at the start of the synchronized region you copy-in all interesting shared variables into local copies (all those things the locked region is going to read anyways). Then you run the critical section on the local copies. At the end, right before you unlock, you test to see if any of the shared variables (not your local copies) changed while you held the lock. If so, then you detected have a data-race – but only in this thread! The guilty party as escaped (compare this to Azul’s solution where you get stack traces for both threads)! But all is not lost… by looking at which values your locked code read and which values it wrote you can often choose to either copy-out your local variable to the original shared variable, or leave the original untouched. By doing so you effectively can choose to act “as if” your code ran either before or after the other unknown guilty party – instead of simultaneously. I.e., you have removed the effect of the race, or tolerated it. Not all races can be tolerated of course, and tracking all the variables becomes really tedious for more than a handful. Nonetheless it is possible to tolerate some kinds of data-races.
I’m sure Ben is using some kind of tool to auto-insert the right checks (making it rather hard for You to fixup your code right now), but perhaps you can do something similar to ToleRace or Azul’s -XX:UseLockedCollections2 if you need to hand-rescue a few synchronized blocks.
More on concurrent programming in another blog; it’s just too rich a topic to put down.