Everything you always wanted to know about data center upgrades (but were afraid to ask)

In the week of March 4th to 8th, 2024, a task force was dispatched from Munich to refurbish Retarus’ SEC1 data center in Secaucus, New Jersey, upgrading it to state-of-the-art EVPN/VXLAN network technology based on the Arista 7050X3 and 7280R3 series as well as installing fiber-optic cabling. All of this during ongoing operations. And best of all: Our customers didn’t even notice it happening.

This kind of “open-heart surgery” of course calls for meticulous planning and precision. Uwe Geuss, our Director Operations and a key member of the task force, has written up the following report:

EVPN/VXLAN – Exchanging switches during ongoing data center operations

Switches are the beating heart of every data center, as they direct data traffic and ensure that information flows smoothly between the various components. Upgrading this infrastructure is of critical importance when it comes to keeping pace with continually advancing technologies and maximizing data center performance.

Why are we replacing our network infrastructure?

The rapid evolution of technologies, escalating network performance requirements and a demand for higher capacity are just a few of the reasons we decided to upgrade our switch infrastructure. The benefits of employing cutting-edge, higher performance Arista switches and EVPN/VXLAN technology include achieving a higher band-width, lower latency and much greater flexibility.

7050X3 Series
Description	Arista 7050X3 Series fixed configuration leaf and spine switches
Switching Throughput	6.4 Terabits/sec
Maximum Forwarding Rate	2 Bpps
40/100G Interfaces	Up to 32
10/25G Interfaces	Up to 128

Step 1: Planning is everything

Before we commenced physically exchanging the switches, it was essential to plan the procedure thoroughly. This involved analyzing the existing infrastructure in the data center, assessing the customer traffic and scheduling the steps required for exchanging the components. Setting up a detailed plan minimizes downtime and ensures a smooth transition, which is of crucial importance for the satisfaction of our customers.

The configuration of the network devices had already been prepared in advance using Ansible, ensuring that the transition itself could be achieved in a way that was uniform, quality-checked multiple times and automated as far as possible.

The actual work steps and tasks were preceded by a PoC phase lasting 1.5 years, in which several manufacturers had to meet Retarus’ demanding requirements for the new network infrastructure.

Step 2: Exchanging the switches

Replacing the switch infrastructure on site is a complex process requiring meticulous coordination. It comprises migrating the servers, renewing network cards in existing systems, installing the new switch hardware, configuring the system and physically removing the old switches. The transition to the totally different EVPN/VXLAN network technology involved wide-ranging adjustments and reorganization on the logical layer of the network. At this stage, a highly experienced team of experts from the fields of networks, infrastructure services, application management and data centers played a key role in ensuring everything ran smoothly.

Step 3: Testing, testing and more testing

Once the infrastructure has been replaced completely, comprehensive testing is of crucial importance. By simulating various scenarios, it is possible to ensure that the new switches fulfill all the requirements and function reliably in productive operations. This step minimizes the risk of errors and outages in regular operations.

Step 4: Documentation and training

Detailed documentation of the new setup is essential for simplifying future maintenance tasks. Furthermore, the IT Operations team members need to be trained accordingly, so they become familiar with the new infrastructure and are able to react quickly should the need arise.

Regarding the first two steps, I’d like to go into more detail and describe the activities a bit more closely.

Planning

The planning was divided up into two distinct areas – server hardware and network.

With regard to server hardware, it was necessary to determine which systems had to be relocated physically in the data center and which network cards would need to be replaced due to the new requirements.

In addition, we specified via which switch and to which switch port each of the servers would be connected in future. This crucially enabled us to prepare the switch configuration in advance.

At the same time, the order in which the systems would be updated was defined, because in regular operations we are always only able to disconnect a small portion of the devices from the network.

Last, but not least, we were able to use this information to plan the schedule for the expert task force and supporting staff. For each system, an application manager was first required to suspend the services, after which an infrastructure engineer had to carry out changes on the operating system level. In the data center, the server could then physically be rebuilt. Subsequently, the system was booted up again, reconfigured, brought back into operative service and checked.

The network planning, on the other hand, focused on all activities which physically and logically needed to be undertaken on the network infrastructure before, during and after the physical rebuild.

We proceeded as follows:

Initial preparations – all activities which could be carried out in advance without having a direct impact on ongoing operations, for instance automating the configuration processes using Ansible or connecting new switches beneath each other without yet integrating them into the existing infrastructure
Integrating the new switch infrastructure into the existing network fabric
Shifting the internet and data center connections from the old switch environment to the new one
Updating/Transferring the firewall configurations to the new infrastructure
Transferring the firewall zones (DMZs) into the new environment
Tidying up the configurations and decoupling from the old fabric
Redundancy and failover tests for network components

The whole plan for rebuilding the network comprised eight chapters and contained more than 300 individual steps. These tasks were carried out by four network experts, ensuring that the four-eyes principle could be maintained at all times. While the steps were being carried out, all of the colleagues involved participated in a group call in which each and every step was communicated clearly. In this way, we were able to avoid misunderstandings and optimize the coordination of the various activities.

Conclusion: An eye on the future

Replacing the switch infrastructure in a data center is not only a technological advance, but also a strategic move to overcome and benefit from the technical challenges the future holds. With detailed planning, highly competent employees and comprehensive testing, companies can ensure that their data centers are prepared to meet the requirements of the modern digital era.

A little anecdote to conclude with: Of course, even the most meticulous planning is of little use if you can’t get into the data center. And this is precisely what happened to our colleagues on March 3rd, even though their access cards were actually equipped with all necessary rights. The external security service was also unfortunately unable to help them on the weekend. Thanks to a personal escalation message to a good contact in the appropriate position, the team was already awaited at 03h30 on Monday morning and their access to the Retarus cage was assured. Sometimes, it’s “who” you know that matters …

All in all, the upgrade was carried out from 03:45 to 14:00 local time, Monday through Wednesday. And it was physically as well as technically demanding. For the three days, the smart watch Uwe Guess was wearing counted 8,600, 12,600 and 9,600 steps respectively. The other two task force members covered similar distances. The three of them still found the time to set up a camera to take a time-lapse video. We’re very pleased to present it to you here:

There are a whole raft of reasons why Retarus has always operated its own services rather than relying on hyperscalers. We consciously and deliberately take responsibility for the complete stack, from the rack to the servers and software – right through to the carriers. That is precisely why we only source an empty cage with electricity and cooled air from our data center providers. We create all other added value ourselves – to the benefit of the overall solution and our customers alike.

Reporting from the Home Front (II)

From Bernie Sanders to Rand Paul: Politics Power by SMS

Our DE2 data center has been relocated