Repository / Resiliency and DR /Resilient Architecture Pattern: Tier 2

Resilient Architecture Pattern: Tier 2

Domain:

Level:Intermediate

Status:stable

Last Updated:2024-12-19

Tags:

disaster-recoveryresiliencyasynchronous-replicationwarm-standbycost-optimizationregional-recoverydns-failover

Primary site with warm out-of-region standby using asynchronous replication, providing regional survivability without multi-site active-active operational overhead.

📊 Tier 2: Single Primary + Out-of-Region Warm Standby

💡 This diagram is optimized for readability. Scroll horizontally on mobile devices to view the full architecture.

Resilient Architecture Pattern: Tier 2

Tier 2 provides a cost-optimized resiliency model built around a single primary production site with a warm out-of-region standby site using asynchronous replication. Unlike Tier 1, there is no active-active or synchronous metro pairing. All continuity is achieved through replication-based protection and orchestrated failover.

This model is widely used in enterprises that require regional survivability but do not need the operational or financial overhead of multi-site active-active designs.

The interactive architecture diagram above shows the Tier 2 configuration with a single primary production site handling all traffic and continuous asynchronous replication to an out-of-region warm standby site.

Purpose and Positioning

Tier 2 enables recovery from data center loss through asynchronous replication and structured disaster recovery procedures. It strikes a balance between capability and cost, reducing infrastructure requirements while preserving recoverability.

Compared to Tier 1:

There is no metro synchronous pair
Only one site runs production traffic
The standby site operates with reduced or dormant capacity
RTO and RPO are higher, but costs and complexity are significantly lower

Architecture Summary

Primary Site

Hosts the full production stack
Handles all user traffic
Maintains authoritative data and application state
Provides the replication source for the out-of-region standby
Operates independently without real-time dependency on the remote site

Out-of-Region Standby

Receives continuous asynchronous replication
May maintain warm application instances or templates
No production traffic flows to this site in steady state
Activated during loss of the primary site or controlled DR testing
May have scaled-down compute until DR invocation

Traffic Flow and DNS Behavior

Normal Operations

DNS points exclusively to the primary site
Health checks validate service endpoints only at the primary site
Failover behavior is not automatic; DNS changes occur only during DR invocation
Global load balancing (if used) keeps the standby disabled or unadvertised

DR Activation

DNS updates redirect traffic to the out-of-region site
Application capacity is scaled out if warm services exist
Infrastructure components (firewalls, reverse proxies, identity endpoints) are promoted from standby state
DNS cutover time and propagation are part of the RTO and must be accounted for in design

Data Replication Model

Asynchronous Replication

All critical data is replicated using asynchronous methods
RPO is non-zero and defined by replication interval and network conditions
Acceptable for most business-critical but not mission-critical workloads
May be based on:
- Database-level asynchronous log shipping
- Storage system asynchronous volume replication
- Hypervisor or platform replication mechanisms

Application Consistency

Application-aware snapshots or replication should be used when possible
Systems relying solely on crash-consistent replication require additional validation after failover
Out-of-region data must be periodically tested to ensure recoverability

Failure Scenarios and Outcomes

Primary Site Outage

DR invocation is required
Out-of-region standby is brought online following runbook procedures
DNS or GSLB is updated to direct traffic to the standby site
RTO commonly ranges from 30 minutes to several hours depending on automation maturity
RPO equals replication lag at the moment of failure

Partial Failure or Service Degradation

Individual components can be failed over if replication supports application-level recovery
More commonly, full site failover is used for simplicity

Network Partition

Production remains active at the primary
Standby remains unavailable for promotion until link restoration
Split-brain conditions cannot occur because standby is not active while primary is healthy

Operational Considerations

Replication and Bandwidth Planning

Replication traffic must be isolated or rate-limited to protect production workloads
Bandwidth sizing must accommodate peak replication bursts
Regular checks on replication lag are required

Runbooks and Automation

DR runbooks should document:

Failover steps
DNS updates
Application promotion procedures
Storage or database recovery actions
Validation steps before declaring DR stable

Automation is recommended for:

Bootstrapping application clusters
Runbook execution
DNS updates
Health verification after failover

Testing Requirements

Scheduled DR testing validates:

Replication integrity
Application recovery behavior
DNS cutover processes
Ability to sustain expected production load

Annual full failover tests are recommended for Tier 2 environments.

Appropriate Workloads

Tier 2 is suitable for:

Business-critical applications with moderate continuity requirements
Systems requiring regional recovery but not real-time metro redundancy
Applications where a small amount of data loss is acceptable
Platforms with robust restart behavior
Systems that support warm start operations

Unsuitable Workloads

Avoid Tier 2 for:

Mission-critical platforms requiring zero or near-zero RPO
Low-tolerance transactional systems where asynchronous lag is unacceptable
Applications that cannot be restarted cleanly from replicated data
Systems requiring always-active multi-site operation
Solutions with strict regulatory uptime requirements (depending on sector)

Risks and Tradeoffs

RPO is non-zero and must be understood by business stakeholders
Failover is not automatic and typically requires runbook validation
Standby capacity requires ongoing maintenance despite being unused in steady state
Application compatibility with asynchronous replication must be validated
DNS cutover contributes measurable delays to RTO

Summary

Tier 2 provides a balanced resiliency model that supports regional recovery without the cost and complexity of active-active architectures. It is well suited for organizations seeking robust continuity capabilities within budget and operational constraints.

The pattern relies on asynchronous replication, warm standby infrastructure, and structured failover procedures. This tier is ideal when organizations require dependable recovery capabilities but do not need continuous multi-site availability.