March 7th, 2008

Another round of Micro-benchmark Advice

RSS icon RSS Category: Personal
Fallback Featured Image

I ran across this article,
and since Heinz is a friend I thought I’d try to figure out what’s going on.   Here’s what I came up with:
There are 3 or 4 conflicting effects and which one dominates at any point in time “depends”.  All of the effects can be removed with some care.

  • OSR: all code is in a loop in main.  The -server compiler makes good code for hot looping methods; the next time that method is called the good code runs.  Alas, ‘main’ is never called again.  So after a time (slowly) interpreting the code, HotSpot makes mediocore code for “the middle of the method” and does an On-Stack-Replacement of the interpreter frame for the compiled frame. The -client compiler is invoked for loop-containing methods immediately, but  makes less optimized code.  Fix: make all timing methods from modest-count outer loops which then call methods which themselves have a long trip count loop:
    • for( int i=0; i<100; i++ ) test_one();
    • void test_one() { for( int i=0; i<1000000; i++ ) do_stuff(); }
  • Profiling ends compilation: after compiling the hot loop the -server compiler notices that it’s reaching code that’s (1) never been executed and (2) full of classes that have never been loaded.  It stops compiling, and issues an “uncommon-trap” – HotSpot jargon for flipping from compiled code back to the interpreter.  The -client compiler usually compiles all the code in a method no matter how hot or cold.   Fix: Run all test code during the warmup period which will force all classes loaded.  Call all work methods from some top-level dispatch function which itself will be profiled, hot and compiled.
  • Inline Caches: HotSpot uses an inline-cache for calls where the compiler cannot prove only a single target can be called.  An inline-cache turns a virtual (or interface) call into a static call plus a few cycles of work.  It’s is a 1-entry cache inlined in the code; the Key is the expected class of the ‘this’ pointer, the Value is the static target method matching the Key, directly encoded as a call instruction.  As soon as you need 2+ targets for the same call site, you revert to the much more expensive dynamic lookup (load/load/load/jump-register).  Both compilers use the same runtime infrastructure, but the server compiler is more aggressive about proving a single target.  Fix: either expect the calls to be single-target and fast, OR force all calls to be multi-target and slow.  The multi-target solution is easier for this kind of test.
  • Bi-morphic (NOT poly-morphic) call site optimization: Where the -server compiler can prove only TWO classes reach a call site it will insert a type-check and then statically call both targets (which may then further inline, etc).  The -client compiler doesn’t do this optimization.  Fix: either Do or Do Not allow 2 targets for the result of calls.  Usually it’s easy to arrange for 1 target (the norm, and inlined case) OR many more than 2 targets.
  • X86 BTB: Some X86 chips include a branch-target-buffer prediction mechanism, which can sometimes predict the target of indirect branches.  Fix: this one’s harder to control, but a light-weight pseudo-random selection of targets will often defeat the hardware. i.e., make an array of Foo objects populated with various random selections of Foo subclasses, and make virtual calls against those.

Good luck with those micro-benchmarks,

Leave a Reply

What are we buying today?

Note: this is a guest blog post by Shrinidhi Narasimhan. It’s 2021 and recommendation engines are

July 5, 2021 - by Rohan Rao
The Emergence of Automated Machine Learning in Industry

This post was originally published by K-Tech, Centre of Excellence for Data Science and AI,

June 30, 2021 - by Parul Pandey
What does it take to win a Kaggle competition? Let’s hear it from the winner himself.

In this series of interviews, I present the stories of established Data Scientists and Kaggle

June 14, 2021 - by Parul Pandey
Snowflake on
H2O Integrates with Snowflake Snowpark/Java UDFs: How to better leverage the Snowflake Data Marketplace and deploy In-Database

One of the goals of machine learning is to find unknown predictive features, even hidden

June 9, 2021 - by Eric Gudgion
Getting the best out of’s academic program

“ provides impressively scalable implementations of many of the important machine learning tools in a

May 19, 2021 - by Ana Visneski and Jo-Fai Chow
Regístrese para su prueba gratuita y podrá explorar H2O AI Hybrid Cloud

Recientemente, lanzamos nuestra prueba gratuita de 14 días de H2O AI Hybrid Cloud, lo que

May 17, 2021 - by Ana Visneski and Jo-Fai Chow

Start your 14-day free trial today