Resilient Architecture Patterns: Overview
Resilient Architecture Patterns: Overview
This reference defines a standardized set of multi-site resiliency patterns applicable to enterprise platforms, spanning application, database, network, and data center tiers. These patterns serve as baseline architectural options for organizations with varying recovery objectives, regulatory requirements, and cost constraints.
Each tier represents a distinct level of availability, operational complexity, and protection against site-level failures. The intent is to provide consistent terminology, predictable behaviors during failure events, and clear guidance for selecting the appropriate model based on business requirements.
Purpose
This repository entry provides:
- A unified classification system for multi-site resiliency patterns
- Consistent terminology and behavior expectations across all tiers
- RTO and RPO characteristics used to select the correct architecture
- A mapping between business requirements and architectural impact
- A reference index for the Tier 0 through Tier 4 patterns
These patterns abstract vendor dependencies and describe platform-agnostic behaviors.
Scope and Applicability
These patterns apply to:
- On-premises data centers
- Hybrid environments
- Private cloud and virtualized platforms
- Critical enterprise services requiring predictable failover behavior
They do not prescribe specific tooling. Examples such as SQL Server AGs, Oracle DataGuard, hypervisor replication, or DNS failover solutions can be layered onto these patterns as needed.
Tier Model Summary
Tier 0
Fully active-active, synchronous replication within short-distance metro regions
Primary Use Case: Zero or near-zero RPO workloads with strict continuity requirements
Tier 1
Active-active metro region with synchronous replication, plus a warm standby out-of-region with asynchronous replication
Primary Use Case: High availability with controlled RPO and a secondary out-of-region site
Tier 2
Primary site with warm out-of-region standby using asynchronous replication
Primary Use Case: Cost-optimized failover without metro dependencies
Tier 3
Primary site with hypervisor-level replication and manual DNS failover
Primary Use Case: Workloads tolerating longer RTO with simplified operations
Tier 4
Primary site with rehydration-only recovery at a remote location
Primary Use Case: Lowest-cost pattern suitable for non-critical workloads
Selection Criteria
Architectural patterns should be chosen based on measurable objectives:
Recovery Time Objective (RTO)
Defines how long a service may remain unavailable following an unplanned outage.
Recovery Point Objective (RPO)
Defines how much data loss is acceptable between the last protection event and the failure.
Distance and Latency Constraints
- Synchronous replication is limited by metro-area network realities
- Asynchronous mechanisms are required for regional or national replication
Operational Maturity
Patterns differ in automation, failover orchestration, and operational overhead.
Cost and Complexity
Higher tiers require more infrastructure and operational rigor.
RTO and RPO Expectations
Tier 0
- Expected RTO: Seconds to minutes
- Expected RPO: Zero or near zero
Tier 1
- Expected RTO: Minutes
- Expected RPO: Zero in metro, seconds to minutes out-of-region
Tier 2
- Expected RTO: 30–90 minutes
- Expected RPO: Seconds to minutes
Tier 3
- Expected RTO: Several hours
- Expected RPO: Minutes to hours
Tier 4
- Expected RTO: Hours to days
- Expected RPO: Hours to 24+ hours
Values are generalized and non-prescriptive. Actual outcomes depend on tooling, data volumes, and operational execution.
Values are generalized and non-prescriptive. Actual outcomes depend on tooling, data volumes, and operational execution.
Behavioral Characteristics by Tier
Traffic Distribution
- Tier 0–1: Active-active or active-standby across metro and out-of-region sites
- Tier 2–4: DNS-based redirection with increasing amounts of manual activation
Data Replication
- Tier 0: Synchronous, metro-only
- Tier 1: Mixed synchronous and asynchronous
- Tier 2–4: Asynchronous or snapshot-based
Application State and Restart Behavior
- Tier 0–1: Continuous state maintenance
- Tier 2: Warm standby with partial state alignment
- Tier 3–4: Restart and rebuild patterns
Document Structure
This overview serves as the entry point to the full set of tiered patterns:
- Tier 0 Resiliency Pattern - Metro active-active with synchronous replication
- Tier 1 Resiliency Pattern - Metro active-active plus out-of-region standby
- Tier 2 Resiliency Pattern - Primary with warm out-of-region standby
- Tier 3 Resiliency Pattern - Primary with hypervisor replication and manual failover
- Tier 4 Resiliency Pattern - Primary with rehydration-only recovery
Each tier document includes:
- Architecture summary
- Component layout and traffic flow
- Replication behavior
- DNS and load balancer expectations
- Operational considerations
- Suitable and unsuitable workload types
This resilient architecture patterns framework provides a structured approach to designing multi-site resiliency that balances business requirements with operational complexity and cost constraints.