Network outage
2024-7-31 15:1:55 Author: www.adainese.it(查看原文) 阅读量:2 收藏

Post cover

In 2013, I presented a critical case that could cause a complete isolation of a datacenter. Eleven years later, the situation remains the same.

Let’s reflect on a few points.

Open Reflections

Complexity: Complexity has increased, with many more software layers than before. This complexity leads to misunderstandings, design errors, or unforeseen “blind spots” that can be particularly destructive. The term “complexity” has become a mantra, but we have brought it upon ourselves.

Simplification: Paradoxically, it is the opposite of the previous mantra and has been the driving force behind the introduction of various technologies to enable “things” previously deemed incorrect. For those who remember, STP was created this way. To simplify processes and make them more agile and faster, a series of software layers were introduced, leading to the previous point.

Hybrid Devices: Thanks to the two previous mantras, devices that behave partly like hosts and partly like switches were born. In the slides, I specifically refer to embedded devices in blade chassis. I described a corner case on HP in the slides, but this scenario is still valid for other vendors.

The Scenario

Let’s discuss the scenario at hand.

The scenario involves a datacenter network that can be implemented in legacy mode or via fabric. Connected to this network are blade chassis with the aforementioned hybrid devices. Virtualization systems are generally running on the blades, but this is not strictly necessary. If, for any reason, a blade creates an L2 loop, in the absence of protections, the loop propagates through the fabric and all connected devices and chassis.

In the three cases I have experienced from 2013 to today, the fabric generally handles the traffic, but the hybrid switches do not. If the hybrid switches transport FCoE, the damage is obviously greater.

Best practices generally require configuring protection mechanisms on the fabric, which introduces the problem.

As explained in the slides, if a blade or VM creates a real or apparent loop, it activates the fabric’s protection mechanism, shutting down the port from which the loop originates. However, the failover mechanism of the virtualizer or hybrid switches moves the loop cause to a different interface, which is also shut down to protect the fabric. In a few minutes or seconds, the entire blade chassis is isolated along with its entire workload.

The problem is that, as network engineers, we tend to protect “the fabric” without realizing that the fabric actually extends to hybrid switches and virtual switches of the virtualizer. Complexity leads to segmenting the infrastructure into different themes: the fabric is the network team’s responsibility, while blades and virtualization belong to the computing team. According to this logic, the network team protects its perimeter from its perspective, which is incomplete.

The solution should involve implementing loop protection mechanisms at the fabric’s edge, considering it as a whole, to isolate the single VM (or server) causing the loop, rather than the entire blade chassis.

Final Notes

Some final notes:

  • The loop prevention mechanism should be handled by the access switch closest to the potential loop source (which can be VMs or servers).
  • The loop prevention mechanism should include some sort of probe. Not all loops can be identified via BPDU guard.
  • The described scenario can also occur due to virtualizers installed on rack servers, not just blades.
  • To date, I have only observed human errors/bugs, but if I were to plan an effective DOS, I would consider it.
  • We cannot prevent 100% of problems; we can only improve our ability to identify and respond to them quickly.

References


文章来源: https://www.adainese.it/blog/2024/07/31/network-outage/
如有侵权请联系:admin#unsafe.sh