Resilient Architecture Pattern: Tier 3
Resilient Architecture Pattern: Tier 3
Tier 3 provides a pragmatic resiliency model based on hypervisor or platform-level replication, with manual or semi-automated DNS failover. It assumes a single active primary site and a secondary recovery site that remains offline or partially online until a failover event is initiated.
This tier trades continuous availability for reduced cost and complexity, while still delivering structured recovery capabilities within defined RTO and RPO targets.
The interactive architecture diagram above illustrates the Tier 3 configuration with VM/platform-level replication from the primary site to powered-off VM replicas at the recovery site.
Purpose and Positioning
Tier 3 is designed for workloads that can tolerate several hours of downtime and some level of data loss, but still require more than simple backups. It is typically implemented using hypervisor replication, VM-level snapshots, or platform-native replication mechanisms.
Compared to Tier 2:
- Replication focus is at the VM or platform layer, not necessarily at the database or storage array layer
- Application awareness may be limited, leading to crash-consistent recovery unless additional measures are used
- Failover processes are more manual, with heavier reliance on operational runbooks
Architecture Summary
Primary Site
- Hosts the full production environment
- Runs all user-facing services
- Acts as the source for hypervisor or platform replication
- Maintains authoritative state for the platform
Recovery Site
Receives VM or platform-level replication from the primary
May host:
-
Powered-off replicas
-
Periodic VM snapshots
-
Incremental image-level backups within the hypervisor or platform
-
Normally does not run production workloads
-
Activated during defined DR events or planned failover tests
Traffic Flow and DNS Behavior
Normal Operations
- DNS records point solely to the primary site's endpoints
- No active load balancing between sites
- Recovery site endpoints are not advertised to clients
Failover Operations
After recovery procedures at the secondary site are completed:
- DNS records are manually or semi-automatically updated to point to the recovery site
- Reverse proxies and firewall rules are adjusted to reflect the new active location
- DNS and caching behavior directly influence RTO and must be considered in design
Replication Model
Hypervisor or Platform Replication
- VM-level replication between primary and recovery sites
- Frequency of replication defines RPO:
- Near-continuous for some hypervisor technologies
- Periodic intervals (e.g., every 15, 30, or 60 minutes) for others
- Typically produces crash-consistent images, unless combined with application-aware tools
Application and Data Consistency
For database or transactional workloads, additional mechanisms are recommended:
- Application-aware snapshots
- Pre- and post-freeze scripts
- Database-level log shipping in addition to VM replication
For less critical workloads, crash consistency may be acceptable.
Failure Scenarios and Outcomes
Primary Site Loss
- Hypervisor replication is terminated at failure time
- Administrators perform the following at the recovery site:
- Promote or restore replicated VMs
- Reconfigure IPs, routes, or use pre-designed overlay networks
- Validate application health
- Once validated, DNS is updated to direct traffic to the recovery site
- RTO is normally several hours, depending on:
- Replication coverage
- Number of systems to promote
- Level of automation
Service-Level Failures
- Selected VMs or services can be failed over individually, but operational complexity increases
- Many organizations prefer full application stack or site-level failover to reduce coordination risk
Operational Considerations
Runbooks and Procedures
DR runbooks must define:
- Which VMs replicate and in what order they are recovered
- Network transformations required at the recovery site
- Integration points with identity, DNS, and security services
- Acceptance tests before making the site live
Testing and Validation
Regular DR tests should validate:
- VM promotion and boot sequence
- Application dependency ordering
- Connectivity from external and internal clients
- Ability to sustain expected load in the recovery site
Annual or semi-annual full-scale tests are recommended for Tier 3.
Network and Security Alignment
- Firewall policies, VLANs, and zones must be pre-defined at the recovery site
- Identity systems and logging endpoints need connectivity to support operational visibility
- Any dependency still anchored to the primary site reduces actual recoverability
Appropriate Workloads
Tier 3 is suitable for:
- Business services where several hours of downtime is acceptable
- Internal tools and portals with documented workarounds
- Non-real-time analytics or reporting platforms
- Batch processing systems with restartable jobs
- Development, test, and lower-criticality staging environments that still require DR posture
Unsuitable Workloads
Avoid Tier 3 for:
- High-frequency transactional systems with strict data integrity requirements
- Platforms where crash-consistent recovery is unacceptable
- Customer-facing systems with contractual uptime or narrow SLA windows
- Critical identity or authentication services that underpin higher-tier platforms
Risks and Tradeoffs
- RTO is longer due to manual orchestration and DNS propagation
- RPO can range from minutes to hours, depending on replication frequency
- Crash-consistent recovery may require application-level repair or reconciliation
- Incomplete replication coverage (e.g., missing support services) can cause failures during DR tests
- Misaligned networking or firewall rules at the recovery site can significantly delay failover
Summary
Tier 3 offers a practical, infrastructure-centric resiliency pattern that leverages hypervisor or platform replication rather than fully integrated, application-aware architectures. It is a common choice for workloads that are important but not mission-critical, where hours of downtime and some data loss are acceptable tradeoffs for reduced cost and complexity.
It provides a structured step up from pure backup-based recovery, without the substantial investment required for active-active or synchronous multi-site designs.