Mainnet upgrade, July 2019

in #edge6 years ago (edited)

mainnetUpgrade.png

Scale

Important in the networks growth is its ability to accept the an increasing number of Hosts without compromising the speed in which vital coordination components calculate and synchronise topology.

Initial builds of the network manifest distributor operated on geographically spread Consul servers, adjacent to the Stargate services. Whilst decentralised in respect to their lack of a single point of failure, these nodes required consensus on all operations, including healthchecks, key/value updates and service metadata. Put simply, if a Host were to tell its nearest Consul server that it was up and healthy, the same data would need to be fully propagated before the write was considered successful. Operationally, this latency just doesn’t scale, so we started to see some failing health checks whilst telemetry painted a far more serene landscape.

One of the largest single changes in the upcoming release is a shift away from globally propagated service data in favour of a multi datacenter approach, more akin to your typical VPC. The services continue to register with their local Consul server, operating alongside the nearest Stargate, but that’s where the syncronisation stops. Consul offers a little flexibility in the way it connects to its peers, so whilst a multi datacenter setup does not share service data globally it does permit access by proxy – something the network relies upon on rare but vital occasions.

Performance

When debugging the latency experienced by the single data center configuration the team discovered a number of performance improvements to the way service data is stored. Whilst the refactors were complex, the result can be simplified to two key areas:

Health

The first health checks consisted of periodic writes to a key/value directory on a per-device basis. A pruning process existed to remove services that exceeded the maximum TTL.

The latest iteration of the health check process now uses Consul’s inbuilt healthcheck methods, including GRPC standard health endpoints for Stargates and Gateways, and TTL checks for Hosts. This meant that we were able to completely remove the key/value data which had a significant reduction in latency due to the way the Consul propagates service health.

Metadata

Until recently all devices wrote information such as their current connections and build digest to Consul key/value. Whilst it appeared to be largely an efficient method, benchmarking showed us that storing this data within the core Service metadata reduced the number of Consul requests by other services. If, for example, a Host wanted to find the closest operational Gateway it would first need to make a call for all services of type ‘Gateway’, before individually requesting the KV metadata of each Gateway before an informed preference could be made. By moving this data to the Service itself, what could be a query of 1+n became simply a query of 1.

Security

ACL tokens were initially created to offer distinguishably different permissions to the three core services, with the Host application having the lowest level of trust, and the lowest staking. The Host token had read-only access to non-sensitive information, and a small amount of write access to its own part of the network manifest making it almost impossible to use it for evil.

A single ACL token for all Hosts does come with a downside, and one we discovered when migrating to multi DC on testnet. Firstly, Consul doesn’t allow ACL migration which poses a risk when embedding a single token into the Host. It also means ACL for a single machine cannot be revoked should an account be deactivated.

The next release will include a refactored method for retrieving ACL tokens to use with Consul, which required the introduction of Vault, an accompanying service which extends Consul ACL with policy templates which act as a blueprint for the onboarding mechanics ahead of self-onboarding.

What you need to do?

If you’re lucky enough to have a Founding Node in operation you’ll need to have it up and running between the 8th and the 12th July. During this time we’ll be issuing a unique token to the machine before we switch off the old ACL and migrate the network over. Don’t worry if you currently have a support ticket open or your device is disconnected under our instruction – we will reach out to you individually after the update and get you back up and running.