Designing a highly available architecture involves leveraging systematic redundancy and automated failover to minimize the impact of single points of failure, ensuring continuous business operation.
The starting point for high availability design is very simple: no single device, link, or path should become a fatal weakness in the entire system. Based on this idea, there are two golden rules in design:
Redundancy: Provide backups for all critical components. Critical devices (such as core switches, firewalls, and routers) should have backup devices, critical links should have backup physical paths, and data centers should ideally have backup sites.
Fast, automatic fault detection and switching: Redundant devices or links cannot be merely decorative. There must be a reliable mechanism (such as protocols or software) that can monitor the status in real time. Once a failure of the primary component is detected, it should automatically switch traffic and tasks to the backup component within a timeframe that is almost imperceptible to users (usually within seconds).
Architectures designed according to these principles typically achieve availability of over 99.99%.
Layered design: From physical connections to upper-layer services
A robust, highly available network needs to be built in layers, with each layer providing protection from bottom to top.
Layer 1: Equipment and Link Redundancy
This is the physical foundation for high availability, primarily addressing failures in network equipment and physical lines. Redundancy for critical equipment mainly includes core and aggregation layer switches, as well as egress routers and firewalls; each should be deployed in duplicate or configured in redundant mode (e.g., primary/backup or dual-primary). Link redundancy involves connecting servers and switches via multiple physical links. The most standard method is to use Ethernet link aggregation, which bundles multiple physical links into a single logical channel, increasing bandwidth and providing seamless failover in case of link failures.
# Example of configuring link aggregation (Bonding) on a Linux server (mode=4, i.e., LACP)
# Edit the network configuration file and create the bond0 interface
sudo nano /etc/netplan/01-netcfg.yaml
# Add similar configuration (adjust according to your network interface name):
network:
version: 2
bonds:
bond0:
interfaces: [enp3s0, enp4s0] # Bind two physical network interfaces
addresses: [192.168.1.10/24]
gateway4: 192.168.1.1
parameters:
mode: 802.3ad # LACP mode
mii-monitor-interval: 100 # Millisecond-level link status monitoring
Cross-device link aggregation: A more advanced deployment uses cross-device link aggregation, allowing the server to connect to two different switches simultaneously. This ensures that even if one switch completely fails, the server can still be accessed via the network.
Layer 2: Network Layer Redundancy and Intelligent Routing
Once physical devices and basic links are redundant, protocols are needed to enable the network to "learn" to automatically select a backup path when the primary path fails.
Dynamic Routing Protocols: In complex networks, especially when multiple branches or data centers are interconnected, static routes cannot adapt to changes. Dynamic routing protocols such as OSPF (Open Shortest Path First) or BGP (Border Gateway Protocol) are required. They can exchange network path information in real time. When a link breaks, all routers will recalculate and converge to a new optimal path within seconds to tens of seconds, achieving automatic rerouting of network traffic.
First-Hop Redundancy Protocols: For terminal devices (such as servers and PCs) within a LAN, if they only have one default gateway (usually the core switch), it becomes a single point of failure. Configuring VRRP or HSRP protocols allows two or more routers to be virtualized into a "virtual router" with a virtual IP address as the terminal gateway. The terminal always points to this virtual IP. When the primary router fails, the backup router immediately takes over the virtual IP, requiring no configuration changes to the terminal.
Layer 3: Service Layer and Load Balancing
Even with network connectivity, services (such as websites or app interfaces) may still experience issues. This is where the load balancer mentioned earlier comes in.
Load Balancer Cluster: The load balancer itself cannot be a single point of failure. Tools like Keepalived are needed to group two or more load balancers into a cluster, providing a single virtual IP (VIP) to the outside world. All user requests are sent to this VIP, and the cluster internally decides which device will actually handle the request. When the primary device fails, the VIP will "drift" to the backup device.
Intelligent Traffic Distribution and Health Checks: The load balancer distributes user requests to the backend application server cluster. It continuously performs health checks on each backend server using methods such as HTTP GET and TCP connections. If a server times out or returns an error (such as HTTP 500), it is immediately removed from the service pool, ensuring that user requests are only sent to healthy servers.
Layer 4: Data and State Synchronization
Stateful services (such as user login sessions) require special design.
Stateless Applications: Best practice is to design applications as stateless as possible, storing session data in external shared services such as a Redis cluster or MySQL database. This way, if any application server fails, user requests are forwarded to other servers by the load balancer, and session information can still be obtained, providing a seamless experience.
Master-Slave Replication and Clustering: Stateful services such as databases and caches require master-slave replication, multi-master replication, or clustering technologies to synchronize data across multiple nodes, and to provide a unified access point and failover in conjunction with VIPs or proxy middleware.
Integration and Validation: Building a Deployable Reference Architecture
Combining the above layered designs, a typical high-availability architecture for medium to large-scale internet applications becomes clear:
1. Access Layer: Terminals connect to two access switches configured with VRRP via dual uplinks.
2. Core Layer: Two core switches are interconnected via multiple 10 Gigabit links, running dynamic routing protocols (such as OSPF), and establishing full connectivity with access switches and data center egress routers.
3. Service Layer: A group of load balancers (such as an Nginx + Keepalived cluster) receives traffic and distributes it to a backend cluster of dozens or even hundreds of application servers. The servers themselves are dual-connected via network bonding.
4. Data Layer: The database uses master-slave replication, combined with a VIP or middleware (such as ProxySQL) to achieve read/write separation and failover; caching uses Redis Sentinel or cluster mode; static files are stored in object storage.
5. Cross-Data Center: The above complete architecture is deployed in two or more geographically isolated data centers, interconnected via dedicated lines, using BGP or global load balancing to achieve cross-city traffic switching in disaster recovery scenarios.
After the design is completed, it must undergo rigorous testing and verification: simulating unplugging the core switch cables, shutting down a load balancer, and executing `kill -9` on the database master node. Observe whether the monitoring system issues timely alerts, whether business traffic automatically switches, whether the overall recovery time (RTO) meets expectations, and whether data is lost (RPO).
In summary, designing a highly available network architecture is a systematic project. It requires you to identify and eliminate every single point of failure, from physical cabling to application services, and to automatically organize redundant components into an organic whole through mature protocols and tools.
EN
CN