May 6th, 2012

What's Going On?

Category: Personal
Fallback Featured Image

As alluded to in my last blog, here’s my fun hack de-jeur: “Whats going on?”
I’ve got a multi-node setup with UDP packets slinging back and forth, and each node itself is a multi-cpu machine.  UDP packets are sliding by one another, or getting dropped on the floor, or otherwise confused.  I’m in a twisty maze of UDP packets all alike (yes, I played the game back in the day).  Then something crashes, and pretty quickly the network is filled with damage-control packets, repair & retry packets, more infinite millions of mirror reflection packets.  What just happened?  I press my handy little button and…
… a broadcast of “dump, ship and die” hits the wires (a few extra times for good measure).  All my busy Nodes stop their endless chatter and dump the last several seconds of packets towards my laptop, slowly & reliably, via TCP.  Each node has been gathering all the packets sent or received to (well, the first 16 bytes of each) in a giant ring-buffer, along with time-stamp info and the other party involved.  After I ship all this data from every node to the one poor victim (that I pressed my button on), every other node dies (to prevent further damage).
The Last Survivor gathers up a bunch of very large UDP packet dumps and starts sorting them.  Of course, you can’t just sort on time, that would be too easy.  No, all the nodes are running with independent clocks; NTP only gets them so close in time to each other.  Instead I have to sort out a giant Happens-Before relationship amongst my packets.  I am helped (above and beyond some sort of home-brew wire-shark) by my application understanding it’s own packet structure.  I know certain packets must be strongly ordered in time, never mind what the clock says.  For example, I only send out an ACK for task#1273 strictly after I receive (and execute) task#1273.  Paxos voting protocols follow certain rules, etc, etc.
In the end, I build a very large mostly-correctly-ordered timeline of what was just going on, as seen by each Node itself, and then HTML’ify it and pop it up on the browser.  Voila!  There for all the world to see is the blow-by-blow confusion of what went wrong (and generally, the follow-on error “recovery” isn’t all that healthy, so more broken behavior follows hard on the heels of broken behavior).
Basically, I’m admitting I’m a tool-builder at heart.  As soon as I realized that standard debuggers don’t work in this kind of situation, and wireshark couldn’t sort based on domain-specific info (and pretty-print the results, again using domain-specific smarts), I went into tool-building mode.  As of this blog, I’ve found several errors in my cloud setup already; e.g. a useless abort-and-restart of a Paxos vote if a heartbeat arrives mid-vote from an ex-cloud-member (that’s alive and well and wants to get back in the Cloud), and some infinite-chatter issues getting key replication settled out as nodes come and go.
On other fronts, my car came back from the body shop, only to turn around and go back to the engine shop: the timing belt had slipped.  The work was done under warranty and I’ll go pick up my car on Monday.  I can hardly wait!!!
My GFs car’s brakes have been squealing for weeks; they finally started shuddering and we decided it was time to fix them.  She’s driving a 1993 Nissan Maxima with 220K miles on it; weird things start breaking at that age, but mostly the car just soldier’s on.  But it was time for the brakes.  We pulled the rear pads & looked at the rotors: one of them was shot.  Fortunately a new rear rotor was only $25, plus another $22 for pads (tax, brake grease, still under $50).  We couldn’t get the dang pistons to move back! We tried at least 5 different wrench/jig/clamp combos to no avail.
We figured the pistons must have been jammed with debris, so with great trepidation we pulled the brake fluid line, the emergency brake cable and pulled the whole unit to my workbench.  I popped the piston out manually.  It looked clean and good… and had this funny thing in the middle… stupid me, failed to check the internet again… it’s the anti-slip mechanism for the emergency brake.  You have to spin the piston to screw it back into the cylinder.  Sigh.  It took us another 1/2hr to find the right tool to spin the dang thing, but it finally went in without too much trouble.  After that it was another hour to reassembly all the parts, and then we had to bleed and bleed and bleed the line.  As of this writing, the pedal is still to soft, I suspect we need to bleed it some more.
Daughter is at the Old Salts Regatta, plus a ton of driving to meet people for 0xdata, plus a much needed dinner out… and down 2 cars (GF’s brakes-in-progress and my car in the shop), made for a very complicated week.
Cliff
registration.  The show is slated to sell out, so be sure to register today and get your 20% discount with our code:  0XDATA20, register here.
Also, H2O CEO and co-founder SriSatish Ambati will give a talk at the Big Data Science Meetup on Monday, Feb. 10 in Ballroom E the night before Strata kicks off!
We hit a growth spurt over the last 9 months and have seen amazing customer traction.  Now at 4000 followers, and 45 meetups later, we’re excited to make 2014 a banner year for H2O.  Thank you all for your continued support in the H2O movement, and we look forward to seeing at the show
Best wishes,
H2O team
Join the Movement. h2o.ai

Leave a Reply

H2O.ai Automatic Machine Learning on Red Hat OpenShift Container Platform Delivers Data Science Ease and Flexibility at Scale

Last week at Red Hat Summit in Boston, Sri Ambati, CEO and Founder, demonstrated how

May 14, 2019 - by Vinod Iyengar
6 Tips to Having it All

I posted this blog on Medium two years ago, thought I'd share a slight rework

May 12, 2019 - by Ingrid Burton
AI/ML Projects — Don’t get stymied in the last mile

Data Scientists build AI/ML models from data, and then deploy it to production – in

May 3, 2019 - by Karthik Guruswamy
Hortifrut uses AI to Determine the Freshness of Blueberries

Who doesn’t love sweet, delicious blueberries? Providing a steady supply of beautiful, tasty berries to the

May 2, 2019 - by Ingrid Burton
Fallback Featured Image
Can Your Machine Learning Model Be Hacked?!

I recently published a longer piece on security vulnerabilities and potential defenses for machine learning

May 2, 2019 - by Patrick Hall
Fallback Featured Image
H2O Driverless AI Updates

We are excited to announce the new release of H2O Driverless AI with lots of improved

April 25, 2019 - by Venkatesh Yadav, VP Customer Success

Join the AI Revolution

Subscribe, read the documentation, download or contact us.

Subscribe to the Newsletter

Start Your 21-Day Free Trial Today

Get It Now
Desktop img