Disaster Recovery Glossary
One site serves production traffic while another is kept updated but idle or partially warm; failover moves traffic to the passive site when needed.
Backups stored on systems or media that are physically or logically disconnected from the production network, making them resistant to malware and mass deletion.
A DR setup where a fully sized secondary environment runs all the time (hot site or active–active), providing the best RTO and RPO at the highest recurring cost.
Replication where writes are acknowledged on the primary first and shipped to the secondary after; it works well over distance but allows a small window of potential data loss.
Logs that show who performed DR tests, failovers, and restores and when, often required by auditors and security teams.
A physically separate data center or group of data centers within a cloud region, connected with low-latency links; spreading workloads across AZs protects against single-facility failures.
A copy of data (and sometimes system configuration) taken at a point in time so it can be restored after loss, corruption, or ransomware.
A DR approach where you regularly back up data and infrastructure definitions and only provision the DR environment after a disaster; lowest cost but typically the longest RTO.
A managed service that handles backup scheduling, retention, encryption, and storage for your data and workloads, often integrated into a wider DR strategy.
The time period during which backups are taken; in legacy systems this is often a low-activity night window, while modern systems favor online or continuous backup.
The organization’s ability to keep delivering critical services during and after a disruption, using workarounds, alternate sites, and DR capabilities.
A documented plan describing how business operations will continue during an incident, including people, process, communication, and IT recovery steps.
An assessment that identifies critical processes, their dependencies, and the financial/operational impact of downtime, used to prioritize apps and set RPO/RTO targets.
Running the new cloud version (green) alongside the old (blue) and switching traffic over in one move once validation is complete.
A recovery location that has power and space (and sometimes network) but little or no hardware or data; cheapest option but longest RTO because most components are provisioned during the disaster.
A protection method that continuously records changes to data, creating a fine-grained history so systems can be restored to almost any moment with near-zero data loss.
An application whose loss would cause major financial, operational, safety, or compliance impact; typically gets the most aggressive DR objectives and budget.
DR designs that can fail workloads over to an entirely different cloud region, protecting against large-scale outages and regional disasters.
DR designs that replicate or load balance across AZs in a single region, providing strong resilience with relatively low latency.
A physical facility housing servers, storage, and networking for IT workloads; in DR design, data centers are treated as failure domains.
Ongoing copying of data from a primary system to one or more secondary systems for DR, analytics, or scaling reads. It can be synchronous or asynchronous.
The requirement that replicated data in DR sites remains subject to the laws of specific countries or regions, which can restrict which regions or providers can be used.
The set of processes and technologies used to restore IT systems and data after a disruptive event such as a data center outage, cyberattack, or regional failure. In cloud, this usually means recovering to another AZ, region, or provider.
A cloud-based service where a provider replicates and hosts your workloads and data, and provides failover capabilities, so you can resume operations quickly without running your own secondary site.
A detailed technical playbook for restoring applications, data, and infrastructure after a disaster, including RPO/RTO targets, runbooks, contacts, and test schedules.
A DR mechanism where a global DNS service monitors endpoints and changes DNS records to route users to a healthy region or DR site if the primary fails.
The period during which a system or service is unavailable to users, whether planned (maintenance) or unplanned (incident).
The secondary environment (another data center, region, or cloud) where workloads are restored and run when the primary site is unavailable.
The chosen mix of architectures and tools (e.g., backup & restore, pilot light, warm standby, multi-site active/active) used to meet defined RPO/RTO and cost constraints.
A planned exercise in which teams practice executing the DR plan, sometimes including live failover, to verify that RPO/RTO and procedural expectations can actually be met.
The process of moving workloads and data from the DR site back to the original or a new primary environment once it is repaired and validated.
The controlled or automated switch of workloads from a primary site to a secondary site when the primary becomes unavailable or unhealthy.
The ability of a system to continue operating correctly even when one or more components fail, often using redundancy and automatic failover within a site.
A service that routes user traffic across regions or sites based on health, latency, or geography, and is often central to active-active or active-passive DR designs.
A system behavior where functionality or quality is reduced under failure instead of completely stopping, buying time while DR steps are taken.
A probe used by load balancers, clusters, or monitoring tools to determine if a node, service, or site is healthy; persistent failures often trigger automated failover.
The capability of a system to remain accessible despite component failures, typically achieved by redundancy within a region or data center; DR extends this across sites or regions.
A fully equipped, up-to-date secondary site that can take over operations almost immediately during a disaster, typically via real-time replication
A backup that cannot be altered or deleted for a defined retention period, protecting against ransomware and malicious or accidental deletion.
The longest outage a business function can endure before the damage becomes unacceptable or irrecoverable; RTO must be less than or equal to this.
A pattern where two or more sites or regions actively serve traffic at the same time; if one fails, traffic shifts to the others with minimal disruption but at the highest cost and complexity.
Backups stored in a different physical location or region from production, used to survive site-level disasters.
A DR approach that keeps only minimal resources running in the DR environment and scales up when a disaster or test occurs, charging mainly for storage plus short-lived compute.
A pattern where a minimal but critical subset of infrastructure (e.g., databases, core services) is always running in the DR region, and the rest is scaled out during a disaster; faster than backup & restore at moderate cost.
The ability to restore data or a database to an exact time in the past by combining a base backup with logs or journals, useful for undoing accidental changes.
The main production environment where applications normally run and from which data and configuration are replicated or backed up.
The minimum number of nodes or votes that must agree before a clustered system can make changes, used to maintain consistency during failures and failover.
DR capabilities focused on recovering from malicious encryption or deletion, relying heavily on immutable, versioned, or air-gapped backups.
The maximum acceptable amount of data that can be lost, expressed as time between the last valid recovery point and the disaster (e.g., “≤ 15 minutes of data loss”).
The maximum acceptable time to restore a service after an outage (e.g., “service must be back within 1 hour”).
A set of AZs in a specific geographic area; cross-region DR protects against region-wide failures and can help with data residency requirements.
Ensuring DR solutions still meet regulations such as GDPR, HIPAA, PCI DSS, or SOC 2—for example, by controlling where replicas live and how they’re encrypted.
The time delay between a change on the primary system and its application on the replica, directly affecting effective RPO in asynchronous scenarios.
The ability of a system to withstand and recover from failures, errors, and unexpected load while continuing to provide acceptable service.
Rules defining how long backups, snapshots, or journals are kept before being expired or archived, driven by DR and compliance needs.
A structured analysis of likely threats (power loss, hardware failure, ransomware, natural disasters) and their impact on systems, used to decide which workloads need strong DR.
A step-by-step operational guide that explains exactly how to execute failover, validate systems, and communicate status during a disaster or drill.
A contractual commitment, often from a provider, about uptime or availability (e.g., 99.99%) and sometimes response times during incidents.
An internal target for reliability or recovery (e.g., “99.9% monthly availability” or “RTO ≤ 30 minutes”) that engineers design and operate against.
Any component (server, DB, region, network link) whose failure can stop a service; DR and high-availability designs try to eliminate or mitigate SPOFs.
A point-in-time copy of a volume, VM, or database, typically crash-consistent, used for fast restore, cloning, or seeding a DR region.
A failure mode where segments of a distributed system lose communication and each assumes it is primary, risking data divergence; DR and clustering designs use quorum and fencing to avoid this.
Replication in which writes are committed to both primary and secondary storage before being acknowledged, delivering near-zero RPO but requiring low-latency links and potentially higher write latency.
The combined cost of infrastructure, licenses, bandwidth, staffing, tests, and management tools required to implement and maintain DR over time.
A partially configured recovery site with infrastructure in place but needing some data sync or configuration before it can run full production load.
A scaled-down but fully functional copy of the production stack running in a DR site, ready to take traffic after scaling up; offers lower RTO/RPO than pilot light at higher ongoing cost.
No matching data found.