Disaster Recovery Archives

A

Active–Passive (Standby)

One site serves production traffic while another is kept updated but idle or partially warm; failover moves traffic to the passive site when needed.

Air-Gapped Backup

Backups stored on systems or media that are physically or logically disconnected from the production network, making them resistant to malware and mass deletion.

Always-On / Overprovisioned DR

A DR setup where a fully sized secondary environment runs all the time (hot site or active–active), providing the best RTO and RPO at the highest recurring cost.

Asynchronous Replication

Replication where writes are acknowledged on the primary first and shipped to the secondary after; it works well over distance but allows a small window of potential data loss.

Audit Trail / DR Logging

Logs that show who performed DR tests, failovers, and restores and when, often required by auditors and security teams.

Availability Zone (AZ)

A physically separate data center or group of data centers within a cloud region, connected with low-latency links; spreading workloads across AZs protects against single-facility failures.

B

Backup

A copy of data (and sometimes system configuration) taken at a point in time so it can be restored after loss, corruption, or ransomware.

Backup and Restore (DR Pattern)

A DR approach where you regularly back up data and infrastructure definitions and only provision the DR environment after a disaster; lowest cost but typically the longest RTO.

Backup as a Service (BaaS)

A managed service that handles backup scheduling, retention, encryption, and storage for your data and workloads, often integrated into a wider DR strategy.

Backup Window

The time period during which backups are taken; in legacy systems this is often a low-activity night window, while modern systems favor online or continuous backup.

Business Continuity (BC)

The organization’s ability to keep delivering critical services during and after a disruption, using workarounds, alternate sites, and DR capabilities.

Business Continuity Plan (BCP)

A documented plan describing how business operations will continue during an incident, including people, process, communication, and IT recovery steps.

Business Impact Analysis (BIA)

An assessment that identifies critical processes, their dependencies, and the financial/operational impact of downtime, used to prioritize apps and set RPO/RTO targets.

Blue-Green Deployment

Running the new cloud version (green) alongside the old (blue) and switching traffic over in one move once validation is complete.

C

Cold Site

A recovery location that has power and space (and sometimes network) but little or no hardware or data; cheapest option but longest RTO because most components are provisioned during the disaster.

Continuous Data Protection (CDP)

A protection method that continuously records changes to data, creating a fine-grained history so systems can be restored to almost any moment with near-zero data loss.

Critical Application

An application whose loss would cause major financial, operational, safety, or compliance impact; typically gets the most aggressive DR objectives and budget.

Cross-Region DR

DR designs that can fail workloads over to an entirely different cloud region, protecting against large-scale outages and regional disasters.

Cross-Zone DR

DR designs that replicate or load balance across AZs in a single region, providing strong resilience with relatively low latency.

D

Data Center

A physical facility housing servers, storage, and networking for IT workloads; in DR design, data centers are treated as failure domains.

Data Replication

Ongoing copying of data from a primary system to one or more secondary systems for DR, analytics, or scaling reads. It can be synchronous or asynchronous.

Data Sovereignty (for DR)

The requirement that replicated data in DR sites remains subject to the laws of specific countries or regions, which can restrict which regions or providers can be used.

Disaster Recovery (DR)

The set of processes and technologies used to restore IT systems and data after a disruptive event such as a data center outage, cyberattack, or regional failure. In cloud, this usually means recovering to another AZ, region, or provider.

Disaster Recovery as a Service (DRaaS)

A cloud-based service where a provider replicates and hosts your workloads and data, and provides failover capabilities, so you can resume operations quickly without running your own secondary site.

Disaster Recovery Plan (DRP)

A detailed technical playbook for restoring applications, data, and infrastructure after a disaster, including RPO/RTO targets, runbooks, contacts, and test schedules.

DNS-Based Failover

A DR mechanism where a global DNS service monitors endpoints and changes DNS records to route users to a healthy region or DR site if the primary fails.

Downtime

The period during which a system or service is unavailable to users, whether planned (maintenance) or unplanned (incident).

DR Site / Recovery Site

The secondary environment (another data center, region, or cloud) where workloads are restored and run when the primary site is unavailable.

DR Strategy

The chosen mix of architectures and tools (e.g., backup & restore, pilot light, warm standby, multi-site active/active) used to meet defined RPO/RTO and cost constraints.

DR Test / Drill / Game Day

A planned exercise in which teams practice executing the DR plan, sometimes including live failover, to verify that RPO/RTO and procedural expectations can actually be met.

E

F

Failback

The process of moving workloads and data from the DR site back to the original or a new primary environment once it is repaired and validated.

Failover

The controlled or automated switch of workloads from a primary site to a secondary site when the primary becomes unavailable or unhealthy.

Fault Tolerance

The ability of a system to continue operating correctly even when one or more components fail, often using redundancy and automatic failover within a site.

G

Global Load Balancer

A service that routes user traffic across regions or sites based on health, latency, or geography, and is often central to active-active or active-passive DR designs.

Graceful Degradation

A system behavior where functionality or quality is reduced under failure instead of completely stopping, buying time while DR steps are taken.

H

Health Check / Heartbeat

A probe used by load balancers, clusters, or monitoring tools to determine if a node, service, or site is healthy; persistent failures often trigger automated failover.

High Availability (HA)

The capability of a system to remain accessible despite component failures, typically achieved by redundancy within a region or data center; DR extends this across sites or regions.

Hot Site

A fully equipped, up-to-date secondary site that can take over operations almost immediately during a disaster, typically via real-time replication

I

Immutable Backup

A backup that cannot be altered or deleted for a defined retention period, protecting against ransomware and malicious or accidental deletion.

J

K

L

M

Maximum Tolerable Downtime (MTD / MAO)

The longest outage a business function can endure before the damage becomes unacceptable or irrecoverable; RTO must be less than or equal to this.

Multi-Site Active/Active

A pattern where two or more sites or regions actively serve traffic at the same time; if one fails, traffic shifts to the others with minimal disruption but at the highest cost and complexity.

N

O

Offsite Backup

Backups stored in a different physical location or region from production, used to survive site-level disasters.

P

Pay-as-You-Go DR

A DR approach that keeps only minimal resources running in the DR environment and scales up when a disaster or test occurs, charging mainly for storage plus short-lived compute.

Pilot Light

A pattern where a minimal but critical subset of infrastructure (e.g., databases, core services) is always running in the DR region, and the rest is scaled out during a disaster; faster than backup & restore at moderate cost.

Point-in-Time Recovery (PITR)

The ability to restore data or a database to an exact time in the past by combining a base backup with logs or journals, useful for undoing accidental changes.

Primary Site

The main production environment where applications normally run and from which data and configuration are replicated or backed up.

Q

Quorum

The minimum number of nodes or votes that must agree before a clustered system can make changes, used to maintain consistency during failures and failover.

R

Ransomware Recovery

DR capabilities focused on recovering from malicious encryption or deletion, relying heavily on immutable, versioned, or air-gapped backups.

Recovery Point Objective (RPO)

The maximum acceptable amount of data that can be lost, expressed as time between the last valid recovery point and the disaster (e.g., “≤ 15 minutes of data loss”).

Recovery Time Objective (RTO)

The maximum acceptable time to restore a service after an outage (e.g., “service must be back within 1 hour”).

Region (Cloud)

A set of AZs in a specific geographic area; cross-region DR protects against region-wide failures and can help with data residency requirements.

Regulatory Compliance (in DR)

Ensuring DR solutions still meet regulations such as GDPR, HIPAA, PCI DSS, or SOC 2—for example, by controlling where replicas live and how they’re encrypted.

Replication Lag

The time delay between a change on the primary system and its application on the replica, directly affecting effective RPO in asynchronous scenarios.

Resilience

The ability of a system to withstand and recover from failures, errors, and unexpected load while continuing to provide acceptable service.

Retention Policy

Rules defining how long backups, snapshots, or journals are kept before being expired or archived, driven by DR and compliance needs.

Risk Assessment (for DR)

A structured analysis of likely threats (power loss, hardware failure, ransomware, natural disasters) and their impact on systems, used to decide which workloads need strong DR.

Runbook / DR Playbook

A step-by-step operational guide that explains exactly how to execute failover, validate systems, and communicate status during a disaster or drill.

S

Service Level Agreement (SLA)

A contractual commitment, often from a provider, about uptime or availability (e.g., 99.99%) and sometimes response times during incidents.

Service Level Objective (SLO)

An internal target for reliability or recovery (e.g., “99.9% monthly availability” or “RTO ≤ 30 minutes”) that engineers design and operate against.

Single Point of Failure (SPOF)

Any component (server, DB, region, network link) whose failure can stop a service; DR and high-availability designs try to eliminate or mitigate SPOFs.

Snapshot

A point-in-time copy of a volume, VM, or database, typically crash-consistent, used for fast restore, cloning, or seeding a DR region.

Split-Brain

A failure mode where segments of a distributed system lose communication and each assumes it is primary, risking data divergence; DR and clustering designs use quorum and fencing to avoid this.

Synchronous Replication

Replication in which writes are committed to both primary and secondary storage before being acknowledged, delivering near-zero RPO but requiring low-latency links and potentially higher write latency.

T

Total Cost of Ownership (TCO) for DR

The combined cost of infrastructure, licenses, bandwidth, staffing, tests, and management tools required to implement and maintain DR over time.

U

V

W

Warm Site

A partially configured recovery site with infrastructure in place but needing some data sync or configuration before it can run full production load.

Warm Standby

A scaled-down but fully functional copy of the production stack running in a DR site, ready to take traffic after scaling up; offers lower RTO/RPO than pilot light at higher ongoing cost.

X

Y

Z

Disaster Recovery Glossary

Get in Touch