On 27 August 2020 Deribit experienced a system outage from 5:15 UTC. The platform was brought back up for clients at 9:22 UTC and after final checks and a 5 minute order cancellation period, trading commenced at 9:27 UTC.
Our platform uses redundant load balancers to connect to multiple nodes, gateways to the platform, connecting to a single master node.
Yesterday, we experienced a hardware failure in this master node.
Deribit is even preparing a disaster recovery facility in Zurich (ZH4) to act as an immediate failover in events where multiple modes are impacted. And as a first step, we already migrated our test environment there earlier this week. Unfortunately, the setup was not ready yet to act as a backup for production trading, soon however it will be.
Remote (IPMI) connectivity to the master node did not function, and engineers present in the LD4 facility were not able to restart the master node. Therefore, the next failover was activating one of the regular nodes to become the new master. This was done successfully after which trading could commence again.
In the future, we should be able to use the Zurich facility or preferably switch to one of the other nodes faster. Once we also have instant redundancy for the master, downtime would be negligible.