Repository / Resiliency and DR /Resilient Architecture Pattern: Tier 1

Resilient Architecture Pattern: Tier 1

Domain:

Level:Expert

Status:stable

Last Updated:2024-12-19

Tags:

disaster-recoveryresiliencymetro-pairactive-activesynchronous-replicationasynchronous-replicationwarm-standbyrto-rpo

Metro pair operating in active-active mode with synchronous replication, plus asynchronous replication to out-of-region warm standby for regional disaster recovery scenarios.

📊 Tier 1: Metro Active-Active + Out-of-Region Warm Standby

💡 This diagram is optimized for readability. Scroll horizontally on mobile devices to view the full architecture.

Resilient Architecture Pattern: Tier 1

Tier 1 provides high availability across a metropolitan pair of data centers with a warm standby in an out-of-region location. Primary and Secondary metro sites operate in active-active mode for production traffic. Data for critical platforms is synchronously replicated within the metro pair and asynchronously replicated to the out-of-region site.

This pattern is designed for workloads that require strong availability and low data loss within the metro region, combined with a survivable regional recovery option. It is less extreme than Tier 0 in operational rigidity, but still requires mature processes and disciplined change management.

The interactive architecture diagram above illustrates the complete Tier 1 configuration with metro active-active sites plus out-of-region warm standby, showing both synchronous and asynchronous replication patterns.

Purpose and Positioning

Tier 1 supports continuous service within a metro region and controlled recovery in a regional disaster.

Compared to Tier 0:

RTO and RPO expectations are slightly relaxed
Not every supporting component must operate in fully synchronous mode
The out-of-region site is explicitly warm standby, not part of the active traffic path
Failover to the out-of-region site is orchestrated and event-driven, not automatic under normal conditions

Architecture Summary

Metro Pair: Primary and Secondary

Both sites are active for production traffic
Application tiers are distributed across both locations
Critical data platforms use synchronous replication across the metro pair
Non-critical components may use asynchronous or single-site patterns if appropriate
Load balancing or global DNS distributes user traffic between the two metro sites

Out-of-Region Site

Operates as a warm standby location
Receives asynchronous replication for critical datasets
Hosts pre-provisioned application stacks at reduced or right-sized capacity
Does not receive production traffic in normal conditions
Activated only during regional-level incidents or controlled failover events

Traffic Flow and DNS Behavior

Normal Operations

Global DNS or GSLB advertises endpoints for both Primary and Secondary
Health checks and policies determine weighting between metro sites
Sessions may be affinity-bound to a particular site, while capacity remains available on both

Out-of-Region Standby

DNS records for the out-of-region endpoint exist but are not active in normal operation
Activation of the out-of-region site requires an explicit DR action, which may include:
- Updating DNS to point to out-of-region VIPs
- Adjusting GSLB pools to include the out-of-region site
- Scaling out application instances to production capacity

Data Replication Model

Metro Synchronous

Critical transactional databases and storage volumes replicate synchronously between Primary and Secondary
Write operations must be committed at both sites before acknowledgment
Delivers near-zero RPO within the metro region

Out-of-Region Asynchronous

Replication to the out-of-region site is asynchronous to avoid latency constraints
RPO is non-zero and defined by replication frequency and network conditions
Suitable for recovery from metro-wide events rather than sub-second continuity

Tier Distinction

Tier 0 expects universal synchronous behavior for all Tier 0 designated components inside the metro.

Tier 1 allows tiering within the tier:

Tier 1 core systems can remain fully synchronous
Adjacent or supporting services can use less aggressive replication if justified

Failure Scenarios and Outcomes

Single Site Failure in Metro Pair

Remaining metro site continues to handle production traffic
Load balancers and DNS health checks remove the failed site from rotation
No data loss for synchronously protected workloads
RTO usually in minutes due to automatic or script-driven failover of traffic

Partial Service Degradation

Individual components or subsets of the platform can be failed over between metro sites
Maintenance windows can use this behavior for rolling updates and patching
Operational runbooks must define site preference per application

Regional Metro Failure

The out-of-region site is invoked according to DR runbooks
Data is restored or promoted from asynchronous replicas
RPO equal to replication lag at time of event
RTO depends on automation maturity and pre-staged capacity:
- Best case: tens of minutes
- More typically: one to several hours

Operational Considerations

Platform and Network Requirements

Low-latency dedicated links between metro sites for synchronous replication
Sufficient bandwidth for both synchronous and asynchronous streams
Segregated replication networks or QoS policies to prevent contention with user traffic
Consistent IP addressing or routing strategies that allow rapid redirection of traffic

Runbooks and Governance

Documented procedures for:

Failing traffic between metro sites
Declaring a regional disaster and invoking the out-of-region site
Returning service from the out-of-region site to the restored metro region

Regular test schedule that exercises:

Metro-only failover
Partial failover per application
Out-of-region DR scenario at least annually

Monitoring and Health

Health checks for application endpoints, database clusters, and replication links
Alerting on replication lag, link saturation, and cluster quorum conditions
Dashboards that clearly show which site is considered authoritative for each service

Appropriate Workloads

Tier 1 is suitable for:

Core line-of-business applications with strict but not absolute continuity needs
Customer-facing portals where brief metro failovers are acceptable
Transactional systems that require zero or near-zero RPO within a region but can tolerate a defined RPO in regional disasters
Identity, directory, and API platforms that support the broader environment
Regulatory-sensitive workloads where regional survivability is a hard requirement

Unsuitable Workloads

Avoid Tier 1 for:

Systems that truly cannot tolerate any downtime or data loss under any condition
Platforms that have no support for synchronous replication or multi-site deployment models
Low-criticality workloads where the cost of a metro pair plus out-of-region standby is unjustified
Highly stateful legacy applications that cannot tolerate split-site operation or session distribution

Risks and Tradeoffs

Higher capital and operational cost compared to Tier 2 and below
Increased complexity in routing, load balancing, and replication topologies
Risk of misaligned configurations between metro sites leading to unexpected behavior during failover
Non-zero data loss risk when failing to the out-of-region site
Requires disciplined operational practices to keep three sites logically aligned

Summary

Tier 1 provides a practical high availability and disaster recovery pattern for enterprises that need strong continuity within a region and credible recovery options out-of-region. It preserves active-active behavior across a metro pair, while acknowledging realistic constraints on out-of-region replication and failover.

It forms the primary pattern for business-critical workloads that do not justify the extreme rigidity of Tier 0, but still require more than simple backup and rehydration.