Repository / Resiliency and DR /Architecture Pattern /Enterprise DR Architecture: RPO/RTO Tiers and What Actually Matters

Enterprise DR Architecture: RPO/RTO Tiers and What Actually Matters

Enterprise DR Architecture: RPO/RTO Tiers and What Actually Matters

Disaster recovery planning in most organizations starts with a vendor demo. Someone shows a dashboard with green checkmarks, recovery times measured in minutes, and a seamless failover button. The contract gets signed. The product gets deployed. And when the actual disaster happens, nothing works the way the demo showed, because nobody aligned the technology to the business requirements.

DR architecture doesn't start with technology. It starts with two numbers and one question: what's your RPO, what's your RTO, and can you afford either of them?

RPO and RTO: what they actually mean

Recovery Point Objective (RPO): How much data can you afford to lose, measured in time. An RPO of 1 hour means you accept losing up to 1 hour of data. An RPO of zero means you accept losing nothing -- every transaction must be preserved.

Recovery Time Objective (RTO): How long can the business tolerate the system being down, measured in time. An RTO of 4 hours means the application must be operational within 4 hours of a declared disaster. An RTO of zero means no downtime -- ever.

What people get wrong: RPO and RTO are business decisions, not technical ones. IT doesn't decide what they should be. The business decides based on financial impact, regulatory requirements, and risk tolerance. IT implements the architecture that achieves those targets. If the business wants a zero RPO and a 15-minute RTO but won't fund the infrastructure for it, that's a risk acceptance decision, not a technical problem.

The tier model

Not every system deserves the same recovery treatment. A tiered model assigns recovery requirements based on business criticality.

Tier 0: Mission-critical / zero downtime

RPO: Zero. RTO: Near-zero (minutes).

These are systems where any data loss or downtime directly impacts revenue, safety, or regulatory compliance.

Examples:

  • Financial transaction processing
  • Healthcare patient records (during active care)
  • Industrial control systems
  • Authentication/identity infrastructure

Architecture:

  • Active-active or synchronous replication across sites
  • Automated failover with no human intervention
  • Continuous data protection (CDP) or synchronous database replication
  • Tested failover at least quarterly

Cost: Very high. Synchronous replication requires low-latency connectivity between sites (dedicated fiber, metro Ethernet). Active-active application architecture requires the application to be designed for it.

Reality check: Very few workloads genuinely need Tier 0. When someone says "everything is critical," nothing is. Push back and make them quantify the per-hour cost of downtime. The CFO's answer usually moves things to Tier 1 or 2.

Tier 1: Business-critical / rapid recovery

RPO: 1-4 hours. RTO: 1-4 hours.

Systems that cause significant business disruption when down but can tolerate a short outage and minimal data loss.

Examples:

  • Email and collaboration (Exchange, Teams)
  • ERP systems
  • Core business databases
  • CI/CD pipelines (in organizations where deployment velocity is revenue-critical)

Architecture:

  • Asynchronous replication to a secondary site
  • Warm standby that can be activated within the RTO window
  • Database log shipping or streaming replication with a lag window matching the RPO
  • Automated monitoring and alerting with documented runbooks

Cost: Moderate. Asynchronous replication doesn't require low-latency site connectivity. Warm standby infrastructure can be sized smaller than production (scale up after failover).

Tier 2: Important / scheduled recovery

RPO: 24 hours. RTO: 24-48 hours.

Systems that the business can function without for a day or two, using manual workarounds if necessary.

Examples:

  • Internal wikis and documentation
  • Development and staging environments
  • Reporting and analytics platforms
  • Non-revenue-facing web applications

Architecture:

  • Daily backups to offsite storage
  • Documented rebuild procedures
  • Infrastructure-as-code that can recreate the environment
  • Recovery tested semi-annually

Cost: Low. Daily backups to object storage are inexpensive. The "infrastructure" at the DR site is a recovery procedure, not running hardware.

Tier 3: Non-critical / rebuild from scratch

RPO: Acceptable data loss. RTO: Days to weeks.

Systems that can be entirely rebuilt or that contain data which is reproducible.

Examples:

  • Build caches and artifact repositories
  • Training and lab environments
  • Historical archives (with copies in other systems)
  • Monitoring data (metrics history)

Architecture:

  • No dedicated DR infrastructure
  • Rebuild procedures documented
  • Data is either reproducible or loss is accepted

Cost: Minimal. The "DR plan" is "rebuild it when we need it."

The real architecture decisions

Site selection

DR requires geographic separation. The question is how much.

Same metro area (10-50 miles): Protects against building-level failures (fire, flood, power outage). Low-latency connectivity for synchronous replication. Doesn't protect against regional disasters (hurricanes, earthquakes, widespread power grid failure).

Cross-region (200+ miles): Protects against regional disasters. Higher latency makes synchronous replication impractical for most workloads. Asynchronous replication with defined RPO.

The hybrid approach: Synchronous replication to a metro DR site for Tier 0/1, asynchronous replication to a cross-region site for Tier 1/2. This gives you rapid failover for critical workloads and regional protection for everything.

Network architecture

The DR site needs network connectivity for:

  • Replication traffic (continuous, bandwidth-intensive)
  • Failover traffic (when the DR site becomes primary)
  • Management traffic (monitoring, administration)

Common mistake: sizing the DR network for replication only. When you fail over, the DR site handles production traffic. If your production site has 10 Gbps of user-facing bandwidth, your DR site needs comparable capacity -- not the 1 Gbps link you provisioned for replication.

DNS is the failover mechanism for most architectures. When you declare a disaster, DNS records are updated to point to the DR site's IP addresses. TTL values on critical DNS records should be low (60-300 seconds) so changes propagate quickly.

Storage replication

Synchronous replication: Every write is committed to both sites before the application receives an acknowledgment. Zero data loss. Performance impact proportional to inter-site latency. Use for Tier 0 only.

Asynchronous replication: Writes are committed locally, then replicated to the DR site in the background. Data loss equals the replication lag. Use for Tier 1 and 2.

Snapshot-based replication: Periodic snapshots are sent to the DR site. Data loss equals the snapshot interval. Use for Tier 2 and 3.

For Ceph environments: Ceph supports RBD mirroring (image-level replication between clusters) in both synchronous and asynchronous modes. CephFS has snapshot-based replication via cephfs-mirror. These are production-ready for most DR scenarios.

The organizational failures

The technology is the easy part. The organizational failures are what actually kill DR.

Failure 1: The plan exists but hasn't been tested

A DR plan that hasn't been tested is a hypothesis, not a plan. Testing means actually failing over to the DR site, running production workloads on it, and verifying that recovery meets the RPO/RTO targets.

"We did a tabletop exercise" is not testing. Tabletop exercises identify procedural gaps. Actual failover tests identify technical gaps. You need both.

Minimum testing cadence:

  • Quarterly: Tier 0 and Tier 1 failover tests
  • Semi-annually: Full site failover
  • After every major change: Architecture changes, storage migration, network reconfiguration

Failure 2: The plan is outdated

Infrastructure changes. Applications get added, decommissioned, or migrated. The DR plan written 18 months ago doesn't reflect the current environment.

Tie DR plan updates to your change management process. Every significant infrastructure change should include a line item: "Update DR documentation and recovery procedures."

Failure 3: Nobody knows their role

During a real disaster, stress is high and communication is chaotic. If the DR plan says "the infrastructure team initiates failover" but doesn't specify who, what tools they use, and what order they do it in, the failover will be delayed by confusion.

Name names. Specify tools. Write runbooks with step-by-step commands. The DR plan should be executable by someone who wasn't involved in designing it.

Failure 4: DR is treated as an IT project

DR is a business continuity function. It requires executive sponsorship, cross-department coordination (IT, security, compliance, operations), and budget allocation that reflects actual business risk.

When DR is "just an IT thing," it gets funded with leftovers, tested on weekends by whoever volunteers, and forgotten until the disaster happens. By then, it's too late to discover that the backup site's SSL certificates expired six months ago.

Building the DR architecture document

Every environment should have a DR architecture document that covers:

  1. System inventory -- Every system, its tier classification, and its RPO/RTO
  2. Replication architecture -- How data gets to the DR site, the replication mode, and the measured lag
  3. Failover procedure -- Step-by-step, per-system, with named owners
  4. Failback procedure -- How to return to the primary site after the disaster is resolved
  5. Communication plan -- Who gets notified, in what order, via what channel
  6. Test schedule and results -- When the last test was performed and what was learned

This document is a living artifact. If it's in a binder on a shelf, it's already out of date.

The cost conversation

DR costs money. The question is whether the cost of DR infrastructure is less than the cost of the disaster it prevents.

A simple framework:

  • Calculate hourly downtime cost -- Lost revenue + employee idle cost + contractual penalties + reputational damage
  • Multiply by the RTO -- That's the cost of a single incident
  • Compare to DR infrastructure cost -- Annual cost of replication, DR site, testing, and staffing

If a 4-hour outage costs $200,000 and your DR infrastructure costs $50,000/year, the investment pays for itself after a single incident. If a 48-hour outage costs $5,000 and the DR infrastructure costs $100,000/year, the investment doesn't make financial sense -- accept the risk and invest elsewhere.

Not everything needs DR. But the things that do need it should have DR that actually works. That's the whole point.