As alluded to in my last blog, here’s my fun hack de-jeur: “Whats going on?”
I’ve got a multi-node setup with UDP packets slinging back and forth, and each node itself is a multi-cpu machine. UDP packets are sliding by one another, or getting dropped on the floor, or otherwise confused. I’m in a twisty maze of UDP packets all alike (yes, I played the game back in the day). Then something crashes, and pretty quickly the network is filled with damage-control packets, repair & retry packets, more infinite millions of mirror reflection packets. What just happened? I press my handy little button and…
… a broadcast of “dump, ship and die” hits the wires (a few extra times for good measure). All my busy Nodes stop their endless chatter and dump the last several seconds of packets towards my laptop, slowly & reliably, via TCP. Each node has been gathering all the packets sent or received to (well, the first 16 bytes of each) in a giant ring-buffer, along with time-stamp info and the other party involved. After I ship all this data from every node to the one poor victim (that I pressed my button on), every other node dies (to prevent further damage).
The Last Survivor gathers up a bunch of very large UDP packet dumps and starts sorting them. Of course, you can’t just sort on time, that would be too easy. No, all the nodes are running with independent clocks; NTP only gets them so close in time to each other. Instead I have to sort out a giant Happens-Before relationship amongst my packets. I am helped (above and beyond some sort of home-brew wire-shark) by my application understanding it’s own packet structure. I know certain packets must be strongly ordered in time, never mind what the clock says. For example, I only send out an ACK for task#1273 strictly after I receive (and execute) task#1273. Paxos voting protocols follow certain rules, etc, etc.
In the end, I build a very large mostly-correctly-ordered timeline of what was just going on, as seen by each Node itself, and then HTML’ify it and pop it up on the browser. Voila! There for all the world to see is the blow-by-blow confusion of what went wrong (and generally, the follow-on error “recovery” isn’t all that healthy, so more broken behavior follows hard on the heels of broken behavior).
Basically, I’m admitting I’m a tool-builder at heart. As soon as I realized that standard debuggers don’t work in this kind of situation, and wireshark couldn’t sort based on domain-specific info (and pretty-print the results, again using domain-specific smarts), I went into tool-building mode. As of this blog, I’ve found several errors in my cloud setup already; e.g. a useless abort-and-restart of a Paxos vote if a heartbeat arrives mid-vote from an ex-cloud-member (that’s alive and well and wants to get back in the Cloud), and some infinite-chatter issues getting key replication settled out as nodes come and go.
On other fronts, my car came back from the body shop, only to turn around and go back to the engine shop: the timing belt had slipped. The work was done under warranty and I’ll go pick up my car on Monday. I can hardly wait!!!
My GFs car’s brakes have been squealing for weeks; they finally started shuddering and we decided it was time to fix them. She’s driving a 1993 Nissan Maxima with 220K miles on it; weird things start breaking at that age, but mostly the car just soldier’s on. But it was time for the brakes. We pulled the rear pads & looked at the rotors: one of them was shot. Fortunately a new rear rotor was only $25, plus another $22 for pads (tax, brake grease, still under $50). We couldn’t get the dang pistons to move back! We tried at least 5 different wrench/jig/clamp combos to no avail.
We figured the pistons must have been jammed with debris, so with great trepidation we pulled the brake fluid line, the emergency brake cable and pulled the whole unit to my workbench. I popped the piston out manually. It looked clean and good… and had this funny thing in the middle… stupid me, failed to check the internet again… it’s the anti-slip mechanism for the emergency brake. You have to spin the piston to screw it back into the cylinder. Sigh. It took us another 1/2hr to find the right tool to spin the dang thing, but it finally went in without too much trouble. After that it was another hour to reassembly all the parts, and then we had to bleed and bleed and bleed the line. As of this writing, the pedal is still to soft, I suspect we need to bleed it some more.
Daughter is at the Old Salts Regatta, plus a ton of driving to meet people for 0xdata, plus a much needed dinner out… and down 2 cars (GF’s brakes-in-progress and my car in the shop), made for a very complicated week.
registration. The show is slated to sell out, so be sure to register today and get your 20% discount with our code: 0XDATA20, register here.
Also, H2O CEO and co-founder SriSatish Ambati will give a talk at the Big Data Science Meetup on Monday, Feb. 10 in Ballroom E the night before Strata kicks off!
We hit a growth spurt over the last 9 months and have seen amazing customer traction. Now at 4000 followers, and 45 meetups later, we’re excited to make 2014 a banner year for H2O. Thank you all for your continued support in the H2O movement, and we look forward to seeing at the show
Join the Movement. h2o.ai