title: "The Availability Blind Spot: What Eight Years of Server Rooms Taught Me About the A in CIA" description: "Most organisations treat High Availability as a cradlepoint problem. After eight years of 2am callouts and hundreds of server room visits, I can tell you it almost never is." date: "2026-05-01" category: "Architecture" tags: ["High Availability", "Network Architecture", "Infrastructure", "SPOF", "Redundancy", "CIA Triad"] author: "Stephen Nnamani" readingTime: "7 min" image: "/images/blog/high-availability-infrastructure.png"

The Availability Blind Spot: What Eight Years of Server Rooms Taught Me About the A in CIA

There is a pattern I have observed so consistently, across eight years of deploying and supporting IT infrastructure in small offices, enterprise environments, data centres and network closets across dozens of sites, that I have stopped being surprised by it. I have started being concerned by it instead.

When organisations think about High Availability, they think about the internet connection. They commission a second ISP. They install a cradlepoint as failover. They tick a box, call it resilience, and move on. And then, at two in the morning, my phone rings.

I have been on the receiving end of those calls more times than I can count. A site is down. Production is stopped. Someone needs it fixed now. I troubleshoot. And the culprit, overwhelmingly, is not the ISP. It is a switch that failed. A firewall that crashed. A router that stopped responding. Hardware that sat inside the building, inside the rack, inside the network closet. Hardware that no one had thought to make redundant because the organisation's entire conceptual model of availability started and stopped at the internet pipe.

That is the blind spot. And it is expensive.

Availability Is a Property of the Entire Stack

The CIA triad (Confidentiality, Integrity, Availability) is foundational to how security and infrastructure should be designed. Most practitioners understand Confidentiality and Integrity at a reasonable depth. Availability is where the thinking tends to flatten.

Availability, in its correct meaning, is the guarantee that systems, services, and data are accessible to authorised users when they need them. Not the guarantee that the internet is up. Not the guarantee that packets can leave the building. The guarantee that the entire service chain, from the user's device through the switching fabric, through the routing layer, through the firewall, through the servers, through the storage, is operational end to end.

A single point of failure anywhere in that chain is an availability risk. And if you have one of anything in that chain, you have a single point of failure.

That is not a theoretical concern. That is the conversation I have had in server rooms with network engineers who have dual ISPs and a single core switch. The internet never goes down on their site. And yet the site still goes down. Because the thing that failed was not the internet.

The Architecture of Genuine Redundancy

Eliminating single points of failure is not one decision. It is a discipline applied across every layer of the infrastructure stack. Each layer has its own failure modes, and each requires its own redundancy strategy.

At the network layer, the default gateway is one of the most overlooked single points of failure in mid-sized environments. Every device on the network sends traffic that isn't local to a single IP address, the default gateway. If the router at that address fails, the network does not partially degrade. It stops. Protocols like HSRP and VRRP exist precisely to solve this problem: multiple physical routers share a single virtual IP, and if the active router fails, a standby takes over in seconds, transparently, without reconfiguration on any endpoint. This is not exotic technology. It has been available for decades. And yet I walk into environments where a single router carries that responsibility with nothing behind it.

Link redundancy deserves the same attention. Link Aggregation (LACP, EtherChannel) combines multiple physical connections into a single logical link. If one cable fails, or one port dies, the remaining links absorb the traffic. Coupled with Spanning Tree Protocol, which allows engineers to build physically redundant paths through the switching fabric without creating broadcast loops, the network layer can be made genuinely resilient at modest cost. The obstacle is almost never budget. It is the absence of the architectural thinking that recognises these as availability controls rather than optional enhancements.

The hardware layer is where I encounter the most visible negligence. A single power supply feeding a core switch. A single UPS on a single power circuit. No A/B power feeds. High Availability infrastructure demands that critical hardware, specifically switches, routers, firewalls and servers, carry dual power supplies connected to independent power circuits, so that the loss of a UPS, a breaker, or a feed does not take the device with it. Switch stacking takes this further: multiple physical switches operating as a single logical unit, so that if one chassis fails, the stack continues without interruption. These are not exotic configurations. They are standard practice in environments that have genuinely thought through their availability posture.

At the server and application layer, the same principle extends upward. Load balancers distribute traffic across pools of servers; if one server fails, the load balancer routes around it, and the service continues. Clustering binds servers together so that failover of the active node to a passive node is automatic and transparent. RAID prevents a single disk failure from becoming a data loss event. Each of these mechanisms addresses the same root question: what happens to this service if this specific component stops working? The answer should never be "everything stops."

The geographic layer is where availability thinking must extend when the threat model includes physical catastrophe. Multi-availability-zone deployment in cloud environments, and traditional disaster recovery sites for on-premise infrastructure, both address the scenario where an entire location becomes unavailable through fire, flood, power failure, or any other event that takes a building offline. Organisations that have not answered the question of where their business continues in that scenario have not completed their availability architecture. They have approximated it.

The Visit That Stays With Me

I have visited hundreds of server rooms over eight years. I have seen raised-floor data centres with precision cooling and redundant power grids. I have also seen network closets where the core switch shared a power strip with a printer, where the UPS battery had not been tested in four years, where a single firewall appliance with no failover, no spare, and no configuration backup stood between the business and complete network collapse.

But there is a subtler failure mode I have encountered that concerns me even more than the absence of redundant hardware: the presence of redundant hardware that has never been commissioned. I have walked into rack rooms and found a standby switch sitting fully racked, powered off, with no cables connected to it. Waiting. The intention was there. The procurement happened. Someone made a business case, got budget approval, and purchased the device. But it was never wired into the topology, never configured, never tested. The plan, as far as I could tell, was to commission it after the primary failed. Which is precisely the moment when you have no time, no preparation, and no margin for error.

I have seen the same pattern with spare devices still in their original packaging, sitting on a shelf in the server room. A router in a box. A switch in a box. Available in the sense that they exist on the premises, but not available in any operational sense. The organisation had done the hardest part, the purchasing, and stopped short of the part that actually creates resilience: integration, configuration, and testing under normal conditions, before anything breaks.

Redundancy that has not been commissioned is not redundancy. It is inventory. And inventory does not failover.

In almost every case, the organisation knew they needed good internet connectivity. They had addressed that. What they had not addressed was everything between that internet connection and their users' screens. The assumption, rarely stated but structurally embedded in their architecture, was that the internal infrastructure was reliable by default. That hardware does not fail. That availability is something you buy from an ISP, not something you engineer into the stack.

Hardware fails. It fails at two in the morning. It fails during the busiest hour of the trading day. It fails without warning and without consideration for the business impact.

What the A in CIA Actually Demands

Availability is not a feature you add at the end of an infrastructure project. It is an architectural discipline applied from the beginning, across every layer, with the same intentionality that security teams bring to access controls or encryption.

The question is not whether you have a failover internet connection. The question is: if any single component in your infrastructure, any one switch, any one power supply, any one firewall, fails right now, what stops working, and for how long?

If the honest answer is "everything, until we can source and replace the hardware," that organisation does not have a High Availability architecture. It has an internet availability arrangement. And at some point, probably at two in the morning, those two things will be revealed as very different.

Stephen Nnamani is a cybersecurity analyst and infrastructure practitioner with over eight years of experience deploying and supporting IT infrastructure across enterprise and SMB environments. Connect on LinkedIn or explore his technical work at cloudtechengine.com.