March 7th, 2008

Another round of Micro-benchmark Advice

RSS icon RSS Category: Personal
Fallback Featured Image

I ran across this article,
and since Heinz is a friend I thought I’d try to figure out what’s going on.   Here’s what I came up with:
There are 3 or 4 conflicting effects and which one dominates at any point in time “depends”.  All of the effects can be removed with some care.

  • OSR: all code is in a loop in main.  The -server compiler makes good code for hot looping methods; the next time that method is called the good code runs.  Alas, ‘main’ is never called again.  So after a time (slowly) interpreting the code, HotSpot makes mediocore code for “the middle of the method” and does an On-Stack-Replacement of the interpreter frame for the compiled frame. The -client compiler is invoked for loop-containing methods immediately, but  makes less optimized code.  Fix: make all timing methods from modest-count outer loops which then call methods which themselves have a long trip count loop:
    • for( int i=0; i<100; i++ ) test_one();
    • void test_one() { for( int i=0; i<1000000; i++ ) do_stuff(); }
  • Profiling ends compilation: after compiling the hot loop the -server compiler notices that it’s reaching code that’s (1) never been executed and (2) full of classes that have never been loaded.  It stops compiling, and issues an “uncommon-trap” – HotSpot jargon for flipping from compiled code back to the interpreter.  The -client compiler usually compiles all the code in a method no matter how hot or cold.   Fix: Run all test code during the warmup period which will force all classes loaded.  Call all work methods from some top-level dispatch function which itself will be profiled, hot and compiled.
  • Inline Caches: HotSpot uses an inline-cache for calls where the compiler cannot prove only a single target can be called.  An inline-cache turns a virtual (or interface) call into a static call plus a few cycles of work.  It’s is a 1-entry cache inlined in the code; the Key is the expected class of the ‘this’ pointer, the Value is the static target method matching the Key, directly encoded as a call instruction.  As soon as you need 2+ targets for the same call site, you revert to the much more expensive dynamic lookup (load/load/load/jump-register).  Both compilers use the same runtime infrastructure, but the server compiler is more aggressive about proving a single target.  Fix: either expect the calls to be single-target and fast, OR force all calls to be multi-target and slow.  The multi-target solution is easier for this kind of test.
  • Bi-morphic (NOT poly-morphic) call site optimization: Where the -server compiler can prove only TWO classes reach a call site it will insert a type-check and then statically call both targets (which may then further inline, etc).  The -client compiler doesn’t do this optimization.  Fix: either Do or Do Not allow 2 targets for the result of calls.  Usually it’s easy to arrange for 1 target (the norm, and inlined case) OR many more than 2 targets.
  • X86 BTB: Some X86 chips include a branch-target-buffer prediction mechanism, which can sometimes predict the target of indirect branches.  Fix: this one’s harder to control, but a light-weight pseudo-random selection of targets will often defeat the hardware. i.e., make an array of Foo objects populated with various random selections of Foo subclasses, and make virtual calls against those.

Good luck with those micro-benchmarks,

Leave a Reply

Time Series Forecasting Best Practices

Earlier this year, my colleague Vishal Sharma gave a talk about time series forecasting best

October 15, 2021 - by Jo-Fai Chow
Improving NLP Model Performance with Context-Aware Feature Extraction

I would like to share with you a simple yet very effective trick to improve

October 8, 2021 - by Jo-Fai Chow
Feature Transformation with the H2O AI Hybrid Cloud

It is well known throughout the data science community that data preparation, pre-processing, and feature

October 7, 2021 - by Benjamin Cox
Introducing DatatableTon – Python Datatable Tutorials & Exercises

Datatable is a python library for manipulating tabular data. It supports out-of-memory datasets, multi-threaded data

September 20, 2021 - by Rohan Rao
H2O Release 3.34 (Zizler)

There’s a new major release of H2O, and it’s packed with new features and fixes!

September 15, 2021 - by Michal Kurka
From the game of Go to Kaggle: The story of a Kaggle Grandmaster from Taiwan

In conversation with Kunhao Yeh: A Data Scientist and Kaggle Grandmaster In these series of interviews,

September 13, 2021 - by Parul Pandey

Start your 14-day free trial today