Construction of a geo-redundant data center -

Construction of a geo-redundant data center

The Schwarz Group from Neckarsulm is an exciting example: the leading international trading company created a DR strategy with Continuous Data Protection (CDP) to protect its critical virtual infrastructure from regional disasters over a long distance, even in the immediate vicinity of a nuclear reactor.

Many companies have invested in high availability solutions in recent years to keep the IT services of their organizations always available. In practice, this is still generally often done via synchronous mirroring between two or more clusters within a distance of up to 50 kilometers. In the event of a component or location failure, the virtual infrastructure – and in particular the critical workloads within it – can be restarted from another location. In principle, this also works for disaster recovery, when special circumstances (such as natural disasters) cause a data center to fail partially or completely.

Being able to provide geo-redundancy for an entire site means the highest level of resilience within a BC/DR strategy in terms of technology. This means that an organization can also protect its entire virtual infrastructure against regional or national disasters. In practice, it can restore them in an emergency in a DR data center more than 200 kilometers away. What sounds simple can hardly be implemented in practice with traditional technologies. The Schwarz Group from Neckarsulm recently put a new geo-redundant data center into service. Modern technologies were used, which can serve as a reference architecture.

Metrocluster is a potential “single point of failure”

The Schwarz Group has its headquarters in Neckarsulm, Baden-W├╝rttemberg, just twenty kilometers from the Neckarwestheim nuclear power plant in Obrigheim. The company operates a metro cluster of multiple data centers around its headquarters to ensure high availability (HA) of its virtual infrastructure. The virtualized footprint consists of more than 40,000 VMs, 5,000 of which are considered mission-critical and provide global IT services for all facilities of the group. However, the geographic centralization of these production workloads made the structure a potential single point of failure.

In the event of an accident at the nuclear power plant, not only would the company headquarters be affected, but potentially all data centers within the Metrocluster network as well. In the event of a regional disaster, such as an earthquake, a flood or an accident at a nearby nuclear reactor, the entire company could be affected. For example, the 12,900 branches worldwide could be affected. In order to eliminate this risk, the Schwarz Group commissioned its internal IT as early as 2015 to eliminate the Metrocluster’s potential single point of failure.

Almost finished!

Please confirm your email address!

Click on the link in the email we just sent you. Also check your spam folder and whitelist us.

More information about the newsletter.

Minimum distance between sites of 200 kilometers

For this purpose, a new, geo-redundant DR data center should be set up, which could take over the operation of all critical workloads in the event of a regional disaster. The team then went in search of a suitable location for a new DR data center within a radius of 400 kilometers. The distance between the new data center and the metro cluster at the headquarters in Neckarsulm should be at least 200 kilometers. This also corresponds to the recommendation of the BSI, which increased the minimum distance between two geo-redundant data centers from just five to 200 kilometers at the beginning of 2020.

At the end of the search and planning, Schwarz-IT decided from numerous options to build a completely new DR data center that could take over the critical workloads in an emergency. It was built on the former site of a disused coal-fired power plant in Riedersbach near Salzburg. In addition to the sufficient distance from Neckarsulm (approx. 300 kilometers as the crow flies), the location offered ideal conditions for setting up a data center: reliable power supply at low prices, direct access to cooling water and a fast fiber optic connection, which made two dedicated lines with 40 Gb/s possible.

Distance requires switching to asynchronous replication and CDP

With the new DR data center, the first important building block of the new DR strategy was laid. Another important element was the choice of replication solution, because at a distance of more than 50 kilometers, synchronous replication, as used within the metro cluster, is no longer possible due to the latency – especially at a distance of more than 300 kilometers. In principle, there are several potential technologies on the market to achieve geo-redundancy. However, hardware-based solutions at the storage level were very quickly ruled out as alternatives by Schwarz-IT, since replication was to take place at the hypervisor level.

The solution for securing locations from the provider of the virtualization platform used also proved to be insufficient to meet the set requirements. The ability to replicate classic snapshots was neither up-to-date nor compatible with the size of the environment and the limited bandwidth. In order to be able to implement the new geo-redundant DR strategy as desired, Schwarz-IT decided to replicate asynchronously. Instead of replicating snapshots at regular intervals, blocks should be replicated continuously. After careful consideration, the Schwarz Group decided to use a special software solution whose replication is based on Continuous Data Protection (CDP) and runs at the hypervisor level.

Streaming blocks via CDP is the most sensible method

The replication of individual blocks with a CDP engine turned out to be much more sensible compared to the regular, classic replication of snapshots. A replication based on snapshots is generally hardly possible even with smaller environments for BC/DR – and certainly not with the size of the environment to be replicated by the Schwarz Group. In fact, the continuous streaming of blocks via a CDP engine was the only technologically viable way to replicate the Schwarz group’s very large replication delta over a distance of 300 kilometers.

The Schwarz Group tested your preferred solution itself as part of a PoC and then started productive use of the new structure at the end of 2019. Since then, the solution has offered disaster recovery for all of the group’s critical VMs with very short RTOs and RPOs. Replication via the CDP engine does checkpoints every 5-10 seconds, which means all VMs can be recovered with an RPO of just 5-10 seconds since then. The recoverability of individual VMs can also be tested via the platform with just a few mouse clicks, should such a test be necessary for the audit department.

Conclusion: CDP helps to eliminate the single point of failure

The use of a CDP solution to secure geo-redundant virtual infrastructures enabled the Schwarz Group’s new DR strategy, as it offers hypervisor-level replication over more than 300 kilometers without latency problems. In this way, the Schwarz Group managed to add another level of redundancy to their existing HA layer and eliminated the possible single point of failure. Since then, all productive workloads have been protected even better than before – against a nuclear incident, an earthquake, a flood or even a pandemic with a regional or national curfew.

No more stupid work - Previous post No more stupid work
6 steps to closing the skills gap in the company Next post 6 steps to closing the skills gap in the company