Repository / Resiliency and DR /Resilient Architecture Pattern: Tier 2

Resilient Architecture Pattern: Tier 2

📊 Tier 2: Single Primary + Out-of-Region Warm Standby
💡 This diagram is optimized for readability. Scroll horizontally on mobile devices to view the full architecture.

Resilient Architecture Pattern: Tier 2

Tier 2 provides a cost-optimized resiliency model built around a single primary production site with a warm out-of-region standby site using asynchronous replication. Unlike Tier 1, there is no active-active or synchronous metro pairing. All continuity is achieved through replication-based protection and orchestrated failover.

This model is widely used in enterprises that require regional survivability but do not need the operational or financial overhead of multi-site active-active designs.

The interactive architecture diagram above shows the Tier 2 configuration with a single primary production site handling all traffic and continuous asynchronous replication to an out-of-region warm standby site.


Purpose and Positioning

Tier 2 enables recovery from data center loss through asynchronous replication and structured disaster recovery procedures. It strikes a balance between capability and cost, reducing infrastructure requirements while preserving recoverability.

Compared to Tier 1:

  • There is no metro synchronous pair
  • Only one site runs production traffic
  • The standby site operates with reduced or dormant capacity
  • RTO and RPO are higher, but costs and complexity are significantly lower

Architecture Summary

Primary Site

  • Hosts the full production stack
  • Handles all user traffic
  • Maintains authoritative data and application state
  • Provides the replication source for the out-of-region standby
  • Operates independently without real-time dependency on the remote site

Out-of-Region Standby

  • Receives continuous asynchronous replication
  • May maintain warm application instances or templates
  • No production traffic flows to this site in steady state
  • Activated during loss of the primary site or controlled DR testing
  • May have scaled-down compute until DR invocation

Traffic Flow and DNS Behavior

Normal Operations

  • DNS points exclusively to the primary site
  • Health checks validate service endpoints only at the primary site
  • Failover behavior is not automatic; DNS changes occur only during DR invocation
  • Global load balancing (if used) keeps the standby disabled or unadvertised

DR Activation

  • DNS updates redirect traffic to the out-of-region site
  • Application capacity is scaled out if warm services exist
  • Infrastructure components (firewalls, reverse proxies, identity endpoints) are promoted from standby state
  • DNS cutover time and propagation are part of the RTO and must be accounted for in design

Data Replication Model

Asynchronous Replication

  • All critical data is replicated using asynchronous methods
  • RPO is non-zero and defined by replication interval and network conditions
  • Acceptable for most business-critical but not mission-critical workloads
  • May be based on:
    • Database-level asynchronous log shipping
    • Storage system asynchronous volume replication
    • Hypervisor or platform replication mechanisms

Application Consistency

  • Application-aware snapshots or replication should be used when possible
  • Systems relying solely on crash-consistent replication require additional validation after failover
  • Out-of-region data must be periodically tested to ensure recoverability

Failure Scenarios and Outcomes

Primary Site Outage

  • DR invocation is required
  • Out-of-region standby is brought online following runbook procedures
  • DNS or GSLB is updated to direct traffic to the standby site
  • RTO commonly ranges from 30 minutes to several hours depending on automation maturity
  • RPO equals replication lag at the moment of failure

Partial Failure or Service Degradation

  • Individual components can be failed over if replication supports application-level recovery
  • More commonly, full site failover is used for simplicity

Network Partition

  • Production remains active at the primary
  • Standby remains unavailable for promotion until link restoration
  • Split-brain conditions cannot occur because standby is not active while primary is healthy

Operational Considerations

Replication and Bandwidth Planning

  • Replication traffic must be isolated or rate-limited to protect production workloads
  • Bandwidth sizing must accommodate peak replication bursts
  • Regular checks on replication lag are required

Runbooks and Automation

DR runbooks should document:

  • Failover steps
  • DNS updates
  • Application promotion procedures
  • Storage or database recovery actions
  • Validation steps before declaring DR stable

Automation is recommended for:

  • Bootstrapping application clusters
  • Runbook execution
  • DNS updates
  • Health verification after failover

Testing Requirements

Scheduled DR testing validates:

  • Replication integrity
  • Application recovery behavior
  • DNS cutover processes
  • Ability to sustain expected production load

Annual full failover tests are recommended for Tier 2 environments.


Appropriate Workloads

Tier 2 is suitable for:

  • Business-critical applications with moderate continuity requirements
  • Systems requiring regional recovery but not real-time metro redundancy
  • Applications where a small amount of data loss is acceptable
  • Platforms with robust restart behavior
  • Systems that support warm start operations

Unsuitable Workloads

Avoid Tier 2 for:

  • Mission-critical platforms requiring zero or near-zero RPO
  • Low-tolerance transactional systems where asynchronous lag is unacceptable
  • Applications that cannot be restarted cleanly from replicated data
  • Systems requiring always-active multi-site operation
  • Solutions with strict regulatory uptime requirements (depending on sector)

Risks and Tradeoffs

  • RPO is non-zero and must be understood by business stakeholders
  • Failover is not automatic and typically requires runbook validation
  • Standby capacity requires ongoing maintenance despite being unused in steady state
  • Application compatibility with asynchronous replication must be validated
  • DNS cutover contributes measurable delays to RTO

Summary

Tier 2 provides a balanced resiliency model that supports regional recovery without the cost and complexity of active-active architectures. It is well suited for organizations seeking robust continuity capabilities within budget and operational constraints.

The pattern relies on asynchronous replication, warm standby infrastructure, and structured failover procedures. This tier is ideal when organizations require dependable recovery capabilities but do not need continuous multi-site availability.