Automatic failover of blockchain nodes is a tough question. Failover of validator nodes even more so. There are multiple reasons to that:
-
increased complexity (blockchain is already quite complicated);
-
redundant nodes need additional servers, additional servers take money;
-
only some projects work well with High Availability (HA), so operators are likely to ignore such options;
-
automated failover usually requires a single point of truth, which can be difficult to achieve as blockchain networks are decentralized.
That said, automatic failover, if implemented, puts staking providers like Everstake in a more beneficial position. If you research and can handle everything stated above (always DYOR).
For those of you interested in setting up an HA design for your Solana nodes, Everstake DevOps engineers have prepared a set of recommendations on brewing a relatively simple automatic failover based on their own professional experience.
For more general advice on ensuring continuous validation for your nodes, be sure to refer to our earlier entry.
How to Concoct Solana Validator HA
The basic requirements are relatively simple. Here’s what you’ll need:
-
two servers, both acting as dumb worker nodes with secondary identities in the start configuration;
-
one Ops team member on duty;
-
one awake DevOps (optional);
-
proper monitoring.
If possible, servers should be geographically remote and their providers different since it would decrease the chance of a total outage.
We advise against using etcd to store the tower is not a good idea since the disruption of connection with etcd for whatever reason will cause the node to terminate and restart (check your systemd unit restart policy). To host it, you would need to get one more virtual machine on top of that.
Having ensured all those things, proceed with the actual steps.
1. Generate and use secondary identities. We suggest making them with a memo in the beginning since this can help later on if you need to debug the changes.
solana-keygen grind –starts-with sub:5
2. Prepare the start config for the secondary node as shown below:
–identity ./keys/slave01.json
–vote-account 9QU2QSxhb24FUX3Tu2FpczXjpK3VYrvRudywSZaM29mF
–authorized-voter ./keys/validator-keypair.json
The failover design per se is fairly simple. If you (or your monitoring) detect that the current primary validator node is misbehaving, non-reachable, or delinquent for several minutes or more, just trigger the manual failover by assigning the primary identity to your hot-standby node as shown below:
solana-validator set-identity validator-keypair.json
Once you do that, the node will start voting as a primary. Still, we strictly recommend you to ensure that your previous validator node is stopped or at least restarted. Otherwise, corner cases may occur.
You can detect changes in log file:
WARN solana_validator::admin_rpc_service] Identity set to EvnRmnMrd69kFdbLMxWkTn1icZ7DCceRhvmb2SJXqDo4
WARN solana_core::replay_stage] Identity changed from sLv1Y83JdyBCHUFuezKUgT5MNENZHSBNLyqiGDLMRLz to EvnRmnMrd69kFdbLMxWkTn1icZ7DCceRhvmb2SJXqDo4
You can also observe identity changes on the node by continuously running the following command:
solana-validator monitor
Note that the Solana documents specify it as follows: “It is not necessary to guarantee the primary validator has halted before failing over to the secondary, as the failover process will prevent the primary validator from voting and producing blocks even if it is in an unknown state.”
That said, if you need to perform a routine upgrade sequence, there’s a catch. Since transitioning between servers when the validator is about to produce a block can lead to skipping blocks, we recommend finding a maintenance window before triggering the failover. Here’s how you can do it:
solana-validator wait-for-restart-window –min-idle-time 10 –identity validator-keypair.json
&& solana-validator set-identity validator-keypair.json
Also, you might need to consider transferring the tower file to a new node on transition. In this case, the suggested design can also be helpful for upgrading your validator node with virtually no downtime. For that, you have to migrate the key to the server you had upgraded earlier.
The chart below shows the number of delinquent nodes over time, suggesting that having an HA design in place is quite important.
In particular, out of 1,900+ validators, the mean value of delinquent nodes within a seven-day timeframe is about 80. Considering the fact that the network needs to account for the stake delegated to those nodes in order to attain consensus, the real delinquent stake share is about 0.46%.
On a final note, be sure to check out the official documentation before proceeding with any actual operations. We also highly recommend that you run proper tests on the testnet first.
Everstake has a dedicated OPS team that monitors the infrastructure and services 24/7/365, in addition to our professional DevOps team that manages blockchain services. We maintain backup hot-standby servers for our validators and researched and enrolled a new key migration scheme to further minimize downtime. Check out our Twitter for the latest updates, and peruse our blog for further information, news, and best industry practices.