Repository / Resiliency and DR /Resilient Architecture Pattern: Tier 1

Resilient Architecture Pattern: Tier 1

📊 Tier 1: Metro Active-Active + Out-of-Region Warm Standby
💡 This diagram is optimized for readability. Scroll horizontally on mobile devices to view the full architecture.

Resilient Architecture Pattern: Tier 1

Tier 1 provides high availability across a metropolitan pair of data centers with a warm standby in an out-of-region location. Primary and Secondary metro sites operate in active-active mode for production traffic. Data for critical platforms is synchronously replicated within the metro pair and asynchronously replicated to the out-of-region site.

This pattern is designed for workloads that require strong availability and low data loss within the metro region, combined with a survivable regional recovery option. It is less extreme than Tier 0 in operational rigidity, but still requires mature processes and disciplined change management.

The interactive architecture diagram above illustrates the complete Tier 1 configuration with metro active-active sites plus out-of-region warm standby, showing both synchronous and asynchronous replication patterns.


Purpose and Positioning

Tier 1 supports continuous service within a metro region and controlled recovery in a regional disaster.

Compared to Tier 0:

  • RTO and RPO expectations are slightly relaxed
  • Not every supporting component must operate in fully synchronous mode
  • The out-of-region site is explicitly warm standby, not part of the active traffic path
  • Failover to the out-of-region site is orchestrated and event-driven, not automatic under normal conditions

Architecture Summary

Metro Pair: Primary and Secondary

  • Both sites are active for production traffic
  • Application tiers are distributed across both locations
  • Critical data platforms use synchronous replication across the metro pair
  • Non-critical components may use asynchronous or single-site patterns if appropriate
  • Load balancing or global DNS distributes user traffic between the two metro sites

Out-of-Region Site

  • Operates as a warm standby location
  • Receives asynchronous replication for critical datasets
  • Hosts pre-provisioned application stacks at reduced or right-sized capacity
  • Does not receive production traffic in normal conditions
  • Activated only during regional-level incidents or controlled failover events

Traffic Flow and DNS Behavior

Normal Operations

  • Global DNS or GSLB advertises endpoints for both Primary and Secondary
  • Health checks and policies determine weighting between metro sites
  • Sessions may be affinity-bound to a particular site, while capacity remains available on both

Out-of-Region Standby

  • DNS records for the out-of-region endpoint exist but are not active in normal operation
  • Activation of the out-of-region site requires an explicit DR action, which may include:
    • Updating DNS to point to out-of-region VIPs
    • Adjusting GSLB pools to include the out-of-region site
    • Scaling out application instances to production capacity

Data Replication Model

Metro Synchronous

  • Critical transactional databases and storage volumes replicate synchronously between Primary and Secondary
  • Write operations must be committed at both sites before acknowledgment
  • Delivers near-zero RPO within the metro region

Out-of-Region Asynchronous

  • Replication to the out-of-region site is asynchronous to avoid latency constraints
  • RPO is non-zero and defined by replication frequency and network conditions
  • Suitable for recovery from metro-wide events rather than sub-second continuity

Tier Distinction

Tier 0 expects universal synchronous behavior for all Tier 0 designated components inside the metro.

Tier 1 allows tiering within the tier:

  • Tier 1 core systems can remain fully synchronous
  • Adjacent or supporting services can use less aggressive replication if justified

Failure Scenarios and Outcomes

Single Site Failure in Metro Pair

  • Remaining metro site continues to handle production traffic
  • Load balancers and DNS health checks remove the failed site from rotation
  • No data loss for synchronously protected workloads
  • RTO usually in minutes due to automatic or script-driven failover of traffic

Partial Service Degradation

  • Individual components or subsets of the platform can be failed over between metro sites
  • Maintenance windows can use this behavior for rolling updates and patching
  • Operational runbooks must define site preference per application

Regional Metro Failure

  • The out-of-region site is invoked according to DR runbooks
  • Data is restored or promoted from asynchronous replicas
  • RPO equal to replication lag at time of event
  • RTO depends on automation maturity and pre-staged capacity:
    • Best case: tens of minutes
    • More typically: one to several hours

Operational Considerations

Platform and Network Requirements

  • Low-latency dedicated links between metro sites for synchronous replication
  • Sufficient bandwidth for both synchronous and asynchronous streams
  • Segregated replication networks or QoS policies to prevent contention with user traffic
  • Consistent IP addressing or routing strategies that allow rapid redirection of traffic

Runbooks and Governance

Documented procedures for:

  • Failing traffic between metro sites
  • Declaring a regional disaster and invoking the out-of-region site
  • Returning service from the out-of-region site to the restored metro region

Regular test schedule that exercises:

  • Metro-only failover
  • Partial failover per application
  • Out-of-region DR scenario at least annually

Monitoring and Health

  • Health checks for application endpoints, database clusters, and replication links
  • Alerting on replication lag, link saturation, and cluster quorum conditions
  • Dashboards that clearly show which site is considered authoritative for each service

Appropriate Workloads

Tier 1 is suitable for:

  • Core line-of-business applications with strict but not absolute continuity needs
  • Customer-facing portals where brief metro failovers are acceptable
  • Transactional systems that require zero or near-zero RPO within a region but can tolerate a defined RPO in regional disasters
  • Identity, directory, and API platforms that support the broader environment
  • Regulatory-sensitive workloads where regional survivability is a hard requirement

Unsuitable Workloads

Avoid Tier 1 for:

  • Systems that truly cannot tolerate any downtime or data loss under any condition
  • Platforms that have no support for synchronous replication or multi-site deployment models
  • Low-criticality workloads where the cost of a metro pair plus out-of-region standby is unjustified
  • Highly stateful legacy applications that cannot tolerate split-site operation or session distribution

Risks and Tradeoffs

  • Higher capital and operational cost compared to Tier 2 and below
  • Increased complexity in routing, load balancing, and replication topologies
  • Risk of misaligned configurations between metro sites leading to unexpected behavior during failover
  • Non-zero data loss risk when failing to the out-of-region site
  • Requires disciplined operational practices to keep three sites logically aligned

Summary

Tier 1 provides a practical high availability and disaster recovery pattern for enterprises that need strong continuity within a region and credible recovery options out-of-region. It preserves active-active behavior across a metro pair, while acknowledging realistic constraints on out-of-region replication and failover.

It forms the primary pattern for business-critical workloads that do not justify the extreme rigidity of Tier 0, but still require more than simple backup and rehydration.