Lessons learned from failing as a witness

in #witness-update8 years ago (edited)

About a week ago, my witness node crashed abruptly. No errors, no trace. My guess is that the steemd process has been killed by the kernel when the system ran out of ram (32GB setup).

My setup had been a mess. A manually configured server, and a few scripts running on my PC (feed, killswitch). I had no backup node.

To make matters worse, when my node crashed, my kill-switch hadn't been running, and I was AFK. This caused me to miss 90 blocks. I tried re-building the node, but steemd failed me on the first attempt. So I had to re-sync twice, which caused a whole day outage for my witness.

My negligence got me kicked out of the witness pool.

Never give up

At first, I was going to give up, and quit as a witness. But the thought of that alone made me really sad. Being a witness is part of my Steemit identity, and I've decided to turn my failure into an opportunity to create an awesome setup.

The setup

I now have a 4-server cluster, in 2 datacenters (Paris and Amsterdam).
1.) Witness node A, running latest version of steemd (ie 0.19.0.rc1)
2.) Witness node B, running a known stable version of steemd (ie 0.18.2)
3.) conductor node, publishing price feeds and handling failovers.
4.) A public seed node (seed.steemdata.com:2001)

Witness Nodes

The witness nodes are quad-core xeon's, with 32GB ram each, and 2 SSDS.
I am hosting steemd in docker, and its data volume is mounted to second, larger SSD.
The first SSD also has a 16GB swap file.

The node setup is still manual, and fairly arduous, but fortunately it only has to be done once per server.

Management

Managing a witness can be messy, which is why I developed a neat command line app (conductor).

Generating Keys

I will begin by generating signing keys.
I will generate new signing key every time I deploy a new node, for security and double-signing prevention reasons.

conductor key-gen

.
keys.png

Here I generate 2 key-pairs. Each witness node gets its own private key. This is very important to get right, because having the same signing key on more than one node could lead to double-signing, which is a quick way to get into serious trouble.

Setting up automated failover

Once both witness nodes are up and running, its time to prepare the failover plan.

If my main witness node A is signing with key SK-A, and my backup witness node B is signing with key SK-B, I will enable my witness with SK-B, and make it failover to SK-B.

conductor kill-switch --second-key <SK-B>

or more concretely (see screenshot above):

conductor kill-switch --second-key STM5kZnvU8R1zqSn6yg6ERiGfffDupxznRureyJrLQfSp6QcKsTUk

Here is what happens in the event of a node failure:
1.) Witness node A fails. Once 10 blocks are missed, kill-switch will change the signing node to backup node B, by assigning its public key (SK-B).
2.) If witness node B fails as well, my witness will be automatically disabled to avoid missing more blocks.

Enabling the witness

Now that the failover is configured, its time to enable the witness.
This is done with conductor enable command, and assigning primary node key to it (SK-A).

conductor enable <SK-A>

or more concretely (see screenshot above):

conductor enable STM5yjjAg4ApWoMHLTHbAss7EFC3LxyiEoAcivaTNKBtpSaD5WtyJ

Running the price feeds

The last component of running a witness is providing an accurate, reliable feed.

I've developed a feed publishing tool into conductor as well:

conductor feed

You can see more options for feeds here.

Thank you

Thank you for supporting my witness. It means a lot to me. Any suggestions for further improvements are welcome.

Sort:  

Hey. Some time ago I was actually thinking about a way to help witnesses and other Steem app developers get informed whenever there are issues with their servers (downtime/functionality etc.) when they aren't online. It would eventually imply a phone call with a prerecorded message to be placed to the owner of the witness/app and the trigger would be from one of the other witnesses to avoid sibyls.
What I though of obviously preserves privacy and ensures a transparent flow to trigger a call to the affected owner. It would also be offered for free.
As much as we'd want, you can't be on 24/7 or even forecast issues.

Would something like this be of interest to you as a witness?

A twillio sms would be preferable to a phone call. Also, if its something running on the server, it has to be open-source.

I've implemented something like this for my trading platform tymoraPRO that gives alerts and warnings whenever any network line, service, or datafeed goes down or doesn't check in within the designated amount of time. At that point, my server-monitor app sends an alert to prowlapp that immediately pops up on my mobile device (better than SMS, since you can include more information if necessary, and without potential SMS fees either). Of course, you could also easily link twillio as well (and/or), as I recall the twillio API is pretty similar and straight-forward.

https://www.prowlapp.com/
API: https://www.prowlapp.com/api.php

If you really want to go all the way with this, here are a few other open source projects that may already do most of the work for you and provide potentially much more robust (albeit more weighty) solutions as well:

libraries:
https://github.com/uniqush/uniqush-push
https://github.com/jreese/znc-push

complete systems:
https://github.com/huginn/huginn
https://github.com/Netflix/Hystrix
https://github.com/OpenNMS/opennms

If you need any help setting something up, let me know, I'd be happy to assist where I can.

this is a great idea and some of the witnesses might want to implement it. The idea i had in mind had zero tech overhead for their servers as it would rely on other humans triggering the alert through the steem blockchain

That aspect could relatively easily be incorporated as well, though you're still talking some tech overhead. You still need the monitor app or script that triggers the SMS, even if it's triggered by humans. But you do bring up an interesting point. Technically, the same app could monitor as many witnesses who'd also like their servers monitored for either discrepancies or outages, etc. as well.

It would definitely be open source and fully transparent in its process. Option for SMS instead of phone call is also an option. The idea of a phone call is that you can ignore a text when you're sleeping generally. It would buzz just like when you received an email.
Also, the idea of a phone call is that you can get a burner number, register that one with my app, and get it redirected to your real number without exposing that one to anyone.

It also depends on everyone's priorities, maybe someone doesn't want to be awaken by a particular app being down :) and would prefer to see it when they wake up.
OK. Ill think about it and work out a plan then.

Hi @furion

What are the qualification to become a witness? I think I have the skill set to tackle this challenge, consider it another chapter of my IT life :)

Let me know sir.

Thank you,
@Yehey

It was me, i hax0red ur steem node, I couldnt help myself

here is a screenshot of my setup

and here is how i got into your Steemit server

Thanks for your honest account as a witness. We know the hard work all the witnesses put in to growing the community. It's great that you're learning and growing. I'm glad that you didn't give up and wish you the best of luck!

Are you logging both RAM and CPU every minute to a text file via cron? This would help you see how your server was doing up and to a minute before the crash.

Also if you know what block you crashed at, you could go looking for it, to see what it contained.

Good idea, I should setup some monitoring/logging.

I love conductor @furion, started playing with it yesterday. And you fixed a bug I encountered within hours of reporting it!

<3

Sounds like many database admins I know - this happens on the backend frequently. Your new setup is very similar to an AG group and that will do well in time. You experienced failure, you learned, and you're bouncing back. That's winning in my book.

i really have no idea what you are talking about :-) but it is great that you get back up instead of give up.

Yeah, this.
I was just reading on pretending to myself that I understand until your comment hit me in the face.

i appreciate your candor.

Onward ever forward as one endeavors to persevere my friend @furion and keep up the good work

That's the spirit! :-)
With hones attitude and hard work big mistakes magically turn into extra valuable experince....