2018/07/28 - Incident Report - Witness + Failover Crash

in #witness-update6 years ago (edited)

I just wanted to take a few moments now that the smoke is clearing to highlight what happened and why my witness missed so many blocks today. It was a combination of events that caused both my primary witness to crash and the failover to not react accordingly.

TLDR - It was my fault and I'm taking steps to prevent this in the future.

My primary Witness Node experienced a segfault exception running 0.19.10 at around 7am this morning, stopping the server (exception available here). My failover script was using http://wallet.steem.ws, which is a full node that failed recently as well, but was in the process of a replay. The requests from wallet.steem.ws were being proxied out to a secondary set of servers while the primary replayed, to avoid any interruptions in service.

So why didn't the failover trigger an account update and swap to one of the other 2 backup producer nodes I had running? Because I was using http://wallet.steem.ws in my failover python script's configuration and not https - and the upstream providers I had fallen back on during my node's replay now required https. Not only was the script not running properly, but since it couldn't even initialize properly (by getting the current witness status), it wasn't even ringing alarms of any problems. So the script was just infinitely restarting and trying to reinitialize.

So - the takeaway and processes that were needed to prevent this again are:

  • Rework the failover script to alert if the script can't even start and/or connect to any rpc node.
  • Rework the failover script to run in multiple locations against multiple different rpc nodes.
  • Rework the rpc upstream failover protocol/pool to force route http -> https traffic without errors.
  • Ensure that all future monitoring scripts use https by default.
  • Potentially force a 301 redirect on http to https traffic, permitted that doesn't mess with the RPC connections.

Worst part was this occurred about 0700 UTC - which is right around when I fell asleep for the night. With no alarms going off and no indication there was a problem (besides the slack/discord messages, which weren't alerts), it was able to go on nearly 7 hours, all while 2 other block production nodes stood idle, but ready, to assume their responsibilities.

Hopefully others can learn from my mistakes! Either that or they're going to end up having a rough Saturday like I did.

Sort:  

It happens! Sounds like you're on top of getting a more robust witness setup in response.

That's always the goal. As frustrating as it can be in the moment of a cascading failure like this, learning from those mistakes should ultimately be the goal!

BTW, the external link for @jesta listed on the Witness page is broken.

Should be fixed now, I apparently forgot to move my blog when I swapped servers /facepalm

Hi, @jesta found you on witness page but not sure what this actually is can you please explain? Anway I am glad I found you just look through your profile I quite like that you are helping newcomers keep it up.

Testing you, hidden unforeseen errors. Good to hear you caught up on sleep, not great waking up to this.

While you fought code, I wrangled with my camera watching the blood moon....

Thanks for letting us know what problems Witnesses run into, it is interesting to learn.

i like it

Dear @jesta (or should I say aaroncox), as you might (or might not) have heard "we" (too long to explain here, assume "a bunch of nutty guys") are in the process of customizing the steem blockchain code in order to use it in an institutional project, called European Financial Transparency Gateway or EFTG in short.

We are looking to reuse the code of steemdb next to it. steemdb is open source but what we found in github is 2 years old and seems not to have been updated since.

I was wondering whether we could find some kind of arrangement to reuse the latest version of the steemdb code in our project ?

If you agree to explore this further, you can also reach me on Discord (@sorin.cristescu#0699)

Thanks in advance
Best regards
Sorin

The code itself isn't something I'm actively maintaining at the moment, it's in dire need of rebuilding, and will need it even more so after HF20.

If you're not looking to adopt HF20 for the fork, what exists on Github might be a good starting place for the code, but it's very likely it won't receive any further updates until after HF20. If you are planning on adopting HF20 in your fork, I'd recommend waiting for new explorer code or perhaps as the HF enters testnets, trying your own hand at it :)

Highly rEfiled.

2yrs steeming.jpg

Should you and other top witnesses not setting price feed bias percentages, with a SBD Debt Ratio at 6+% ... there is to much steem being printed.

Or is it that you and the witnesses know this an play dumb? Because at 10% the SBD floor will be gone, and a bail-in is just what you guys want so that users and community be pickpocketed?

Who else is going to pay to bring that 15+ million SBD Debt down? Who is going to burn it?

Looking forward to your answer.

Greets,
I would like to know if you PHP repo on github is complete and uptodate.
I would like to use it to bring steem on my webstie, might include votings and such.
Thank you

Also, why didn't you include powerup in vessel ? :) Just to know

The PHP repo is not complete or up to date, it was a project that really didn't get much traction back when I was working on it.

As for powerup in Vessel, it's just a feature that I never got around to :D

ok, NP and thank you for your reply

Thanks a lot for being such good witness, I updated my vote, I do that every 2-3 months and I'm still voting you as a witness and I included you in this post to introduce you to a very good steemian : One of the best steemians