Kubernetes CSI Volume Snapshots: How They Work, How to Restore, and How to Test

Carolyn Weitz

Last Updated: Dec 22, 2025

9 Minute Read

602 Views

Kubernetes CSI Volume Snapshots: How They Work, How to Restore, and How to Test

Indeed, CSI snapshots are storage-level, point-in-time copies exposed through Kubernetes APIs. They only work for PersistentVolumes provisioned by CSI drivers; in-tree volume plugins cannot use the CSI snapshot APIs.

But reliable recovery also needs workload consistency and repeatable restore testing. As per CNCF’s annual survey, Kubernetes production use reached 80% in 2024, which means snapshot mistakes now affect many real customer workloads.

Therefore, before you automate anything, you should inventory your StatefulSets and map every PVC that must be protected together per application boundary. Additionally, you should record the owner, criticality tier and data dependency order for each StatefulSet because restore steps usually follow those relationships.

What is a CSI Volume Snapshot, and how is it Different from a Backup?

A CSI snapshot gives you a consistent API surface for asking storage systems to capture volume content at a specific moment. In Kubernetes, a VolumeSnapshot is a namespaced CRD that represents a request for a snapshot of a CSI-backed PersistentVolume. The actual backend snapshot reference lives in a cluster-scoped VolumeSnapshotContent object, and the snapshot API standardizes the request shape, not the storage implementation.

However, snapshots are not full backups because they usually lack portability guarantees, long-term retention workflows and immutability governance outside the storage backend.

Additionally, most snapshot implementations only cover the volume blocks, while application objects, secrets and cluster state require separate protection and restore plans. Therefore, you should treat snapshots as one building block in a recovery strategy that also includes metadata backup, access control and routine restore drills.

Uptime Institute reports 54% of operators said their most recent significant outage costs more than $100,000, which raises the stakes for recoverability decisions.

Action step: Write down per-application RPO and RTO targets, then tie snapshot frequency and restore procedures directly to those targets.

How CSI Snapshots Work in Kubernetes?

CSI snapshotting works well when you understand the Kubernetes objects involved and the controller sidecars that reconcile them.

VolumeSnapshot is the namespaced request
VolumeSnapshotContent is the cluster-scoped backing record
VolumeSnapshotClass defines the administrator policy

Kubernetes documents that these snapshot API objects are CRDs rather than core APIs, which means your cluster must include the snapshot CRDs and controllers.

Additionally, snapshot support is only available for CSI drivers and the snapshot controller plus csi-snapshotter sidecar drive CreateSnapshot and DeleteSnapshot calls.

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-snapshots-retain
annotations:
snapshot.storage.kubernetes.io/is-default-class: “true”
driver:
deletionPolicy: Retain
parameters:
# Driver-specific parameters go here
# e.g. snapshotType: “crash-consistent”

The CNCF Annual Survey reports Helm is preferred by 75% of respondents, which matters because snapshot CRDs and controllers are often installed through packaged deployments.

Pro Tip: List your CSI drivers per cluster and confirm snapshot support, then document the matching VolumeSnapshotClass for each driver.

What are Snapshot Lifecycle States like readyToUse and restoreSize?

Snapshot creation is asynchronous and you should gate every restore and promotion workflow on observable status fields. A snapshot request can exist before the backend snapshot is actually created, which is why you should monitor status before relying on the snapshot for recovery.

The readyToUse field indicates whether the snapshot is ready for restoring, while restoreSize represents the complete size reported by the snapshotter. Additionally, you should capture any reported errors in your runbook because failed snapshots often look like success until you check bound content and events.

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: pgdata-snap-2025-12-15
namespace: myapp
spec:
  volumeSnapshotClassName: csi-snapshots-retain
source:
    persistentVolumeClaimName: pgdata
status:
  readyToUse: true
  creationTime: “2025-12-15T10:12:35Z”
  restoreSize: 20Gi
  boundVolumeSnapshotContentName: snapcontent-1234abcd

The CNCF Annual Survey reports 71% of organizations check in code multiple times per day, which increases the chance snapshots occur during changes and migrations.

Action Step: Add a runbook rule that no restore proceeds unless readyToUse is true and the bound content exists.

How CSI Snapshots Provide Consistency for StatefulSets and Databases?

CSI snapshots usually reflect what the storage system captured and you should plan for crash consistency unless you actively coordinate database I/O.

Crash-consistent vs Application-consistent

Many storage systems aim for crash-consistent snapshots, which means data resembles an abrupt power loss rather than a clean application checkpoint.

Kubernetes does not automatically pause your database, which means you should quiesce, flush or checkpoint using database-native commands before taking snapshots.

Additionally, you should document post-restore validation queries because crash-consistent recovery can succeed technically while leaving application-level corruption undetected.

Use a short-lived Job to run a safe checkpoint or flush command before snapshot creation, since each database family requires different steps. In practice, you should trigger the VolumeSnapshot only after this Job has completed successfully (for example, via an operator or CI pipeline), otherwise the snapshot may be taken before the checkpoint reaches disk.

apiVersion: batch/v1
kind: Job
metadata:
name: postgres-checkpoint
namespace: myapp
spec:
template:
    spec:
      restartPolicy: Never
      containers:
        – name: psql
          image: postgres:16
          env:
            – name: PGHOST
              value: postgres.myapp.svc.cluster.local
            – name: PGUSER
              valueFrom:
                secretKeyRef:
                  name: pg-secret
                  key: username
            – name: PGPASSWORD
              valueFrom:
                secretKeyRef:
                  name: pg-secret
                  key: password
          command: [“bash”,”-lc”]
          args:
            – |
              psql -c “CHECKPOINT;”

Veeam reports roughly seven out of ten organizations experienced a cyber-attack and among those attacked only 10% recovered more than 90% of their data.

Pro Tip: For each database, document the quiesce command and at least one validation query that proves business tables and indexes are usable.

How to Snapshot Multi-PVC Apps Safely with VolumeGroupSnapshot?

Multi-PVC applications need coordinated recovery points and group snapshots can reduce the risk of cross-volume inconsistency for stateful designs.

Independent per-PVC snapshots can capture different write orders, which is risky when you split data, WAL and logs across multiple volumes.

Additionally, an operator can restore “the right snapshot” for one PVC and still boot a broken system because related PVCs may be from different moments.

Introduce Group Snapshots

Kubernetes v1.32 moved volume group snapshots to beta and the design uses a label selector to group multiple PVCs for snapshotting.

However, you should confirm driver support because group snapshots are only supported for CSI volume drivers that implement the group snapshot extension APIs; having the CRDs installed is not sufficient on its own.

Label your PVCs with a shared selector that represents an application-consistency group.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pgdata
namespace: myapp
labels:
    snapshot-group: myapp-db
spec:
# …
apiVersion: groupsnapshot.storage.k8s.io/v1beta1
kind: VolumeGroupSnapshotClass
metadata:
name: csi-groupsnap-retain
driver:
deletionPolicy: Retain
apiVersion: groupsnapshot.storage.k8s.io/v1beta1
kind: VolumeGroupSnapshot
metadata:
name: myapp-db-groupsnap-2025-12-15
namespace: myapp
spec:
  volumeGroupSnapshotClassName: csi-groupsnap-retain
source:
    selector:
      matchLabels:
        snapshot-group: myapp-db

The CNCF Annual Survey reports that 60% of organizations use CI/CD for most or all applications, which increases the value of coordinated snapshot automation.

Pro Tip: Identify every StatefulSet with more than one PVC, then decide whether it needs group snapshots based on recovery dependencies.

How to Restore from Snapshots and Prove Recovery Works?

Restoring from snapshots is only trustworthy when you can repeat it, validate it and measure the time required under realistic constraints.

Restore mechanics

Kubernetes restore typically means creating a new PVC that references a VolumeSnapshot through spec.dataSource, then mounting it into verification workloads.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pgdata-restore
namespace: myapp
spec:
  storageClassName:
  dataSource:
    name: pgdata-snap-2025-12-15
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    – ReadWriteOnce
resources:
    requests:
      storage: 20Gi

You should verify the restored data before promotion, since validation isolates storage restore issues from application rollout mistakes during incidents.

apiVersion: v1
kind: Pod
metadata:
name: restore-verify
namespace: myapp
spec:
  restartPolicy: Never
containers:
    – name: verify
      image: busybox:1.36
      command: [“sh”,”-c”,”ls -lah /data && sleep 3600″]
      volumeMounts:
        – name: data
          mountPath: /data
volumes:
    – name: data
      persistentVolumeClaim:
        claimName: pgdata-restore

Uptime Institute reports four in five respondents said their most recent serious outage could have been prevented with better management, processes and configuration.

Action Tip: Schedule restore drills, then track time-to-restore and validation outcomes per tier because those metrics drive practical improvements.

What Policies Keep Snapshots Safe, Cheap and Compliant?

Snapshot safety depends on policy, because the same API can create recoverable history or delete the only usable restore point.

Keep deletion and retention intentional

deletionPolicy controls whether backend snapshots are preserved when Kubernetes snapshot objects are deleted, which directly affects retention and incident survivability.

Additionally, you should scope permissions tightly because snapshot create, delete and restore operations can expose sensitive data or destroy recovery options.

apiVersion: v1
kind: ServiceAccount
metadata:
name: snapshot-operator
namespace: myapp
—
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: snapshot-operator
namespace: myapp
rules:
– apiGroups: [“snapshot.storage.k8s.io”]
    resources: [“volumesnapshots”]
    verbs: [“create”,”get”,”list”,”watch”,”delete”]
– apiGroups: [“groupsnapshot.storage.k8s.io”]
    resources: [“volumegroupsnapshots”]
    verbs: [“create”,”get”,”list”,”watch”,”delete”]
—
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: snapshot-operator
namespace: myapp
subjects:
– kind: ServiceAccount
    name: snapshot-operator
roleRef:
kind: Role
name: snapshot-operator
  apiGroup: rbac.authorization.k8s.io

The Uptime Institute reports that 54% of significant outages exceed $100,000, which supports least-privilege controls and audit trails for snapshot operations.

Action Step: Lock down snapshot create, delete and restore permissions per namespace, then audit usage events as part of your regular resilience review.

Key Takeaways

CSI snapshots provide a standardized API, yet recovery reliability comes from consistency planning, permissions hygiene and frequent restore drills. If an application spans multiple volumes, you should prefer group snapshots when supported by your CSI driver and cluster version.

Need more information related to snapshot storage usage for Kubernetes? Simply connect with our friendly cloud experts using your free consultation and ask all your burning questions. Together, we’ll make cloud computing easy to understand for you and your team!

Frequently Asked Questions

What does a VolumeSnapshot object represent and what storage prerequisite must already exist?

It is a namespaced request for a point-in-time volume snapshot and it requires a Bound CSI-backed PVC on a StorageClass whose CSI driver supports snapshots.

How should you explain the difference between VolumeSnapshot and VolumeSnapshotContent to a platform team reviewing risk?

VolumeSnapshot is the user request, while VolumeSnapshotContent is the cluster-scoped record that binds to the underlying snapshot.

What additional components must your cluster have before CSI snapshots work reliably across namespaces and teams?

You need snapshot CRDs plus the snapshot controller and the CSI driver must ship the csi-snapshotter sidecar integration.

Are CSI snapshots application-consistent by default for databases and what should database owners do to reduce risk?

They are often crash-consistent. Therefore, you should run DB-native flush or checkpoint steps and validate after restore with queries.

What is the standard restore workflow in Kubernetes when you want a safe validation step before promoting data?

You create a new PVC from the snapshot using spec.dataSource, then mount it into a verification Pod or Job.

When should teams consider VolumeGroupSnapshot and what Kubernetes capability makes coordinated selection possible?

You should use it when multiple PVCs need one recovery point and Kubernetes groups claims using a label selector.

Carolyn Weitz

author

Carolyn began her cloud career at a fast-growing SaaS company, where she led the migration from on-prem infrastructure to a fully containerized, cloud-native architecture using Kubernetes. Since then, she has worked with a range of companies from early-stage startups to global enterprises helping them implement best practices in cloud operations, infrastructure automation, and container orchestration. Her technical expertise spans across AWS, Azure, and GCP, with a focus on building scalable IaaS environments and streamlining CI/CD pipelines. Carolyn is also a frequent contributor to cloud-native open-source communities and enjoys mentoring aspiring engineers in the Kubernetes ecosystem.