Repository / Resiliency and DR /Resilient Architecture Pattern: Tier 3

Resilient Architecture Pattern: Tier 3

Domain:

Level:Intermediate

Status:stable

Last Updated:2024-12-19

Tags:

disaster-recoveryresiliencyhypervisor-replicationvm-replicationmanual-failovercrash-consistentdns-failover

Single active primary site with hypervisor/platform replication to recovery site, utilizing manual failover processes for workloads tolerating several hours RTO.

📊 Tier 3: Primary + Recovery Site (VM/Platform Replication)

💡 This diagram is optimized for readability. Scroll horizontally on mobile devices to view the full architecture.

Resilient Architecture Pattern: Tier 3

Tier 3 provides a pragmatic resiliency model based on hypervisor or platform-level replication, with manual or semi-automated DNS failover. It assumes a single active primary site and a secondary recovery site that remains offline or partially online until a failover event is initiated.

This tier trades continuous availability for reduced cost and complexity, while still delivering structured recovery capabilities within defined RTO and RPO targets.

The interactive architecture diagram above illustrates the Tier 3 configuration with VM/platform-level replication from the primary site to powered-off VM replicas at the recovery site.

Purpose and Positioning

Tier 3 is designed for workloads that can tolerate several hours of downtime and some level of data loss, but still require more than simple backups. It is typically implemented using hypervisor replication, VM-level snapshots, or platform-native replication mechanisms.

Compared to Tier 2:

Replication focus is at the VM or platform layer, not necessarily at the database or storage array layer
Application awareness may be limited, leading to crash-consistent recovery unless additional measures are used
Failover processes are more manual, with heavier reliance on operational runbooks

Architecture Summary

Primary Site

Hosts the full production environment
Runs all user-facing services
Acts as the source for hypervisor or platform replication
Maintains authoritative state for the platform

Recovery Site

Receives VM or platform-level replication from the primary

May host:

Powered-off replicas
Periodic VM snapshots
Incremental image-level backups within the hypervisor or platform
Normally does not run production workloads
Activated during defined DR events or planned failover tests

Traffic Flow and DNS Behavior

Normal Operations

DNS records point solely to the primary site's endpoints
No active load balancing between sites
Recovery site endpoints are not advertised to clients

Failover Operations

After recovery procedures at the secondary site are completed:

DNS records are manually or semi-automatically updated to point to the recovery site
Reverse proxies and firewall rules are adjusted to reflect the new active location
DNS and caching behavior directly influence RTO and must be considered in design

Replication Model

Hypervisor or Platform Replication

VM-level replication between primary and recovery sites
Frequency of replication defines RPO:
- Near-continuous for some hypervisor technologies
- Periodic intervals (e.g., every 15, 30, or 60 minutes) for others
Typically produces crash-consistent images, unless combined with application-aware tools

Application and Data Consistency

For database or transactional workloads, additional mechanisms are recommended:

Application-aware snapshots
Pre- and post-freeze scripts
Database-level log shipping in addition to VM replication

For less critical workloads, crash consistency may be acceptable.

Failure Scenarios and Outcomes

Primary Site Loss

Hypervisor replication is terminated at failure time
Administrators perform the following at the recovery site:
- Promote or restore replicated VMs
- Reconfigure IPs, routes, or use pre-designed overlay networks
- Validate application health
Once validated, DNS is updated to direct traffic to the recovery site
RTO is normally several hours, depending on:
- Replication coverage
- Number of systems to promote
- Level of automation

Service-Level Failures

Selected VMs or services can be failed over individually, but operational complexity increases
Many organizations prefer full application stack or site-level failover to reduce coordination risk

Operational Considerations

Runbooks and Procedures

DR runbooks must define:

Which VMs replicate and in what order they are recovered
Network transformations required at the recovery site
Integration points with identity, DNS, and security services
Acceptance tests before making the site live

Testing and Validation

Regular DR tests should validate:

VM promotion and boot sequence
Application dependency ordering
Connectivity from external and internal clients
Ability to sustain expected load in the recovery site

Annual or semi-annual full-scale tests are recommended for Tier 3.

Network and Security Alignment

Firewall policies, VLANs, and zones must be pre-defined at the recovery site
Identity systems and logging endpoints need connectivity to support operational visibility
Any dependency still anchored to the primary site reduces actual recoverability

Appropriate Workloads

Tier 3 is suitable for:

Business services where several hours of downtime is acceptable
Internal tools and portals with documented workarounds
Non-real-time analytics or reporting platforms
Batch processing systems with restartable jobs
Development, test, and lower-criticality staging environments that still require DR posture

Unsuitable Workloads

Avoid Tier 3 for:

High-frequency transactional systems with strict data integrity requirements
Platforms where crash-consistent recovery is unacceptable
Customer-facing systems with contractual uptime or narrow SLA windows
Critical identity or authentication services that underpin higher-tier platforms

Risks and Tradeoffs

RTO is longer due to manual orchestration and DNS propagation
RPO can range from minutes to hours, depending on replication frequency
Crash-consistent recovery may require application-level repair or reconciliation
Incomplete replication coverage (e.g., missing support services) can cause failures during DR tests
Misaligned networking or firewall rules at the recovery site can significantly delay failover

Summary

Tier 3 offers a practical, infrastructure-centric resiliency pattern that leverages hypervisor or platform replication rather than fully integrated, application-aware architectures. It is a common choice for workloads that are important but not mission-critical, where hours of downtime and some data loss are acceptable tradeoffs for reduced cost and complexity.

It provides a structured step up from pure backup-based recovery, without the substantial investment required for active-active or synchronous multi-site designs.