May 6th, 2012

What's Going On?

RSS icon RSS Category: Personal
Fallback Featured Image

As alluded to in my last blog, here’s my fun hack de-jeur: “Whats going on?”
I’ve got a multi-node setup with UDP packets slinging back and forth, and each node itself is a multi-cpu machine.  UDP packets are sliding by one another, or getting dropped on the floor, or otherwise confused.  I’m in a twisty maze of UDP packets all alike (yes, I played the game back in the day).  Then something crashes, and pretty quickly the network is filled with damage-control packets, repair & retry packets, more infinite millions of mirror reflection packets.  What just happened?  I press my handy little button and…
… a broadcast of “dump, ship and die” hits the wires (a few extra times for good measure).  All my busy Nodes stop their endless chatter and dump the last several seconds of packets towards my laptop, slowly & reliably, via TCP.  Each node has been gathering all the packets sent or received to (well, the first 16 bytes of each) in a giant ring-buffer, along with time-stamp info and the other party involved.  After I ship all this data from every node to the one poor victim (that I pressed my button on), every other node dies (to prevent further damage).
The Last Survivor gathers up a bunch of very large UDP packet dumps and starts sorting them.  Of course, you can’t just sort on time, that would be too easy.  No, all the nodes are running with independent clocks; NTP only gets them so close in time to each other.  Instead I have to sort out a giant Happens-Before relationship amongst my packets.  I am helped (above and beyond some sort of home-brew wire-shark) by my application understanding it’s own packet structure.  I know certain packets must be strongly ordered in time, never mind what the clock says.  For example, I only send out an ACK for task#1273 strictly after I receive (and execute) task#1273.  Paxos voting protocols follow certain rules, etc, etc.
In the end, I build a very large mostly-correctly-ordered timeline of what was just going on, as seen by each Node itself, and then HTML’ify it and pop it up on the browser.  Voila!  There for all the world to see is the blow-by-blow confusion of what went wrong (and generally, the follow-on error “recovery” isn’t all that healthy, so more broken behavior follows hard on the heels of broken behavior).
Basically, I’m admitting I’m a tool-builder at heart.  As soon as I realized that standard debuggers don’t work in this kind of situation, and wireshark couldn’t sort based on domain-specific info (and pretty-print the results, again using domain-specific smarts), I went into tool-building mode.  As of this blog, I’ve found several errors in my cloud setup already; e.g. a useless abort-and-restart of a Paxos vote if a heartbeat arrives mid-vote from an ex-cloud-member (that’s alive and well and wants to get back in the Cloud), and some infinite-chatter issues getting key replication settled out as nodes come and go.
On other fronts, my car came back from the body shop, only to turn around and go back to the engine shop: the timing belt had slipped.  The work was done under warranty and I’ll go pick up my car on Monday.  I can hardly wait!!!
My GFs car’s brakes have been squealing for weeks; they finally started shuddering and we decided it was time to fix them.  She’s driving a 1993 Nissan Maxima with 220K miles on it; weird things start breaking at that age, but mostly the car just soldier’s on.  But it was time for the brakes.  We pulled the rear pads & looked at the rotors: one of them was shot.  Fortunately a new rear rotor was only $25, plus another $22 for pads (tax, brake grease, still under $50).  We couldn’t get the dang pistons to move back! We tried at least 5 different wrench/jig/clamp combos to no avail.
We figured the pistons must have been jammed with debris, so with great trepidation we pulled the brake fluid line, the emergency brake cable and pulled the whole unit to my workbench.  I popped the piston out manually.  It looked clean and good… and had this funny thing in the middle… stupid me, failed to check the internet again… it’s the anti-slip mechanism for the emergency brake.  You have to spin the piston to screw it back into the cylinder.  Sigh.  It took us another 1/2hr to find the right tool to spin the dang thing, but it finally went in without too much trouble.  After that it was another hour to reassembly all the parts, and then we had to bleed and bleed and bleed the line.  As of this writing, the pedal is still to soft, I suspect we need to bleed it some more.
Daughter is at the Old Salts Regatta, plus a ton of driving to meet people for 0xdata, plus a much needed dinner out… and down 2 cars (GF’s brakes-in-progress and my car in the shop), made for a very complicated week.
Cliff
registration.  The show is slated to sell out, so be sure to register today and get your 20% discount with our code:  0XDATA20, register here.
Also, H2O CEO and co-founder SriSatish Ambati will give a talk at the Big Data Science Meetup on Monday, Feb. 10 in Ballroom E the night before Strata kicks off!
We hit a growth spurt over the last 9 months and have seen amazing customer traction.  Now at 4000 followers, and 45 meetups later, we’re excited to make 2014 a banner year for H2O.  Thank you all for your continued support in the H2O movement, and we look forward to seeing at the show
Best wishes,
H2O team
Join the Movement. h2o.ai

Leave a Reply

AI-Driven Predictive Maintenance with H2O Hybrid Cloud

According to a study conducted by Wall Street Journal, unplanned downtime costs industrial manufacturers an

August 2, 2021 - by Parul Pandey
What are we buying today?

Note: this is a guest blog post by Shrinidhi Narasimhan. It’s 2021 and recommendation engines are

July 5, 2021 - by Rohan Rao
The Emergence of Automated Machine Learning in Industry

This post was originally published by K-Tech, Centre of Excellence for Data Science and AI,

June 30, 2021 - by Parul Pandey
What does it take to win a Kaggle competition? Let’s hear it from the winner himself.

In this series of interviews, I present the stories of established Data Scientists and Kaggle

June 14, 2021 - by Parul Pandey
Snowflake on H2O.ai
H2O Integrates with Snowflake Snowpark/Java UDFs: How to better leverage the Snowflake Data Marketplace and deploy In-Database

One of the goals of machine learning is to find unknown predictive features, even hidden

June 9, 2021 - by Eric Gudgion
Getting the best out of H2O.ai’s academic program

“H2O.ai provides impressively scalable implementations of many of the important machine learning tools in a

May 19, 2021 - by Ana Visneski and Jo-Fai Chow

Start your 14-day free trial today