Still paying hyperscaler rates? Save up to 60% on your cloud costs

Kubernetes CSI Volume Snapshots: How They Work, How to Restore, and How to Test

Carolyn Weitz's profile image
Carolyn Weitz
Last Updated: Dec 22, 2025
9 Minute Read
584 Views

Indeed, CSI snapshots are storage-level, point-in-time copies exposed through Kubernetes APIs. They only work for PersistentVolumes provisioned by CSI drivers; in-tree volume plugins cannot use the CSI snapshot APIs.

But reliable recovery also needs workload consistency and repeatable restore testing. As per CNCF’s annual survey, Kubernetes production use reached 80% in 2024, which means snapshot mistakes now affect many real customer workloads.

Therefore, before you automate anything, you should inventory your StatefulSets and map every PVC that must be protected together per application boundary. Additionally, you should record the owner, criticality tier and data dependency order for each StatefulSet because restore steps usually follow those relationships.

What is a CSI Volume Snapshot, and how is it Different from a Backup?

A CSI snapshot gives you a consistent API surface for asking storage systems to capture volume content at a specific moment. In Kubernetes, a VolumeSnapshot is a namespaced CRD that represents a request for a snapshot of a CSI-backed PersistentVolume. The actual backend snapshot reference lives in a cluster-scoped VolumeSnapshotContent object, and the snapshot API standardizes the request shape, not the storage implementation.

However, snapshots are not full backups because they usually lack portability guarantees, long-term retention workflows and immutability governance outside the storage backend.

Additionally, most snapshot implementations only cover the volume blocks, while application objects, secrets and cluster state require separate protection and restore plans. Therefore, you should treat snapshots as one building block in a recovery strategy that also includes metadata backup, access control and routine restore drills.

apiVersion: snapshot.storage.k8s.io/v1 
kind: VolumeSnapshot 
metadata: 
  name: pgdata-snap-2025-12-15 
  namespace: myapp 
spec: 
  volumeSnapshotClassName: csi-snapshots-retain 
  source: 
    persistentVolumeClaimName: pgdata 

Uptime Institute reports 54% of operators said their most recent significant outage costs more than $100,000, which raises the stakes for recoverability decisions.

Action step: Write down per-application RPO and RTO targets, then tie snapshot frequency and restore procedures directly to those targets.

How CSI Snapshots Work in Kubernetes?

CSI snapshotting works well when you understand the Kubernetes objects involved and the controller sidecars that reconcile them.

  • VolumeSnapshot is the namespaced request
  • VolumeSnapshotContent is the cluster-scoped backing record
  • VolumeSnapshotClass defines the administrator policy

Kubernetes documents that these snapshot API objects are CRDs rather than core APIs, which means your cluster must include the snapshot CRDs and controllers.

Additionally, snapshot support is only available for CSI drivers and the snapshot controller plus csi-snapshotter sidecar drive CreateSnapshot and DeleteSnapshot calls.

apiVersion: snapshot.storage.k8s.io/v1 
kind: VolumeSnapshotClass 
metadata: 
  name: csi-snapshots-retain 
  annotations: 
    snapshot.storage.kubernetes.io/is-default-class: “true” 
driver: <your.csi.driver.example.com> 
deletionPolicy: Retain 
parameters: 
  # Driver-specific parameters go here 
  # e.g. snapshotType: “crash-consistent” 

The CNCF Annual Survey reports Helm is preferred by 75% of respondents, which matters because snapshot CRDs and controllers are often installed through packaged deployments.

Pro Tip: List your CSI drivers per cluster and confirm snapshot support, then document the matching VolumeSnapshotClass for each driver.

What are Snapshot Lifecycle States like readyToUse and restoreSize?

Snapshot creation is asynchronous and you should gate every restore and promotion workflow on observable status fields. A snapshot request can exist before the backend snapshot is actually created, which is why you should monitor status before relying on the snapshot for recovery.

The readyToUse field indicates whether the snapshot is ready for restoring, while restoreSize represents the complete size reported by the snapshotter. Additionally, you should capture any reported errors in your runbook because failed snapshots often look like success until you check bound content and events.

apiVersion: snapshot.storage.k8s.io/v1 
kind: VolumeSnapshot 
metadata: 
  name: pgdata-snap-2025-12-15 
  namespace: myapp 
spec: 
  volumeSnapshotClassName: csi-snapshots-retain 
  source: 
    persistentVolumeClaimName: pgdata 
status: 
  readyToUse: true 
  creationTime: “2025-12-15T10:12:35Z” 
  restoreSize: 20Gi 
  boundVolumeSnapshotContentName: snapcontent-1234abcd 

The CNCF Annual Survey reports 71% of organizations check in code multiple times per day, which increases the chance snapshots occur during changes and migrations.

Action Step: Add a runbook rule that no restore proceeds unless readyToUse is true and the bound content exists.

How CSI Snapshots Provide Consistency for StatefulSets and Databases?

CSI snapshots usually reflect what the storage system captured and you should plan for crash consistency unless you actively coordinate database I/O.

Crash-consistent vs Application-consistent

Many storage systems aim for crash-consistent snapshots, which means data resembles an abrupt power loss rather than a clean application checkpoint.

Kubernetes does not automatically pause your database, which means you should quiesce, flush or checkpoint using database-native commands before taking snapshots.

Additionally, you should document post-restore validation queries because crash-consistent recovery can succeed technically while leaving application-level corruption undetected.

Use a short-lived Job to run a safe checkpoint or flush command before snapshot creation, since each database family requires different steps. In practice, you should trigger the VolumeSnapshot only after this Job has completed successfully (for example, via an operator or CI pipeline), otherwise the snapshot may be taken before the checkpoint reaches disk.

apiVersion: batch/v1 
kind: Job 
metadata: 
  name: postgres-checkpoint 
  namespace: myapp 
spec: 
  template: 
    spec: 
      restartPolicy: Never 
      containers: 
        – name: psql 
          image: postgres:16 
          env: 
            – name: PGHOST 
              value: postgres.myapp.svc.cluster.local 
            – name: PGUSER 
              valueFrom: 
                secretKeyRef: 
                  name: pg-secret 
                  key: username 
            – name: PGPASSWORD 
              valueFrom: 
                secretKeyRef: 
                  name: pg-secret 
                  key: password 
          command: [“bash”,”-lc”] 
          args: 
            – | 
              psql -c “CHECKPOINT;” 

Veeam reports roughly seven out of ten organizations experienced a cyber-attack and among those attacked only 10% recovered more than 90% of their data.

Pro Tip: For each database, document the quiesce command and at least one validation query that proves business tables and indexes are usable.

How to Snapshot Multi-PVC Apps Safely with VolumeGroupSnapshot?

Multi-PVC applications need coordinated recovery points and group snapshots can reduce the risk of cross-volume inconsistency for stateful designs.

Independent per-PVC snapshots can capture different write orders, which is risky when you split data, WAL and logs across multiple volumes.

Additionally, an operator can restore “the right snapshot” for one PVC and still boot a broken system because related PVCs may be from different moments.

Introduce Group Snapshots

Kubernetes v1.32 moved volume group snapshots to beta and the design uses a label selector to group multiple PVCs for snapshotting.

However, you should confirm driver support because group snapshots are only supported for CSI volume drivers that implement the group snapshot extension APIs; having the CRDs installed is not sufficient on its own.

Label your PVCs with a shared selector that represents an application-consistency group.

apiVersion: v1 
kind: PersistentVolumeClaim 
metadata: 
  name: pgdata 
  namespace: myapp 
  labels: 
    snapshot-group: myapp-db 
spec: 
  # … 
apiVersion: groupsnapshot.storage.k8s.io/v1beta1 
kind: VolumeGroupSnapshotClass 
metadata: 
  name: csi-groupsnap-retain 
driver: <your.csi.driver.example.com> 
deletionPolicy: Retain 
apiVersion: groupsnapshot.storage.k8s.io/v1beta1 
kind: VolumeGroupSnapshot 
metadata: 
  name: myapp-db-groupsnap-2025-12-15 
  namespace: myapp 
spec: 
  volumeGroupSnapshotClassName: csi-groupsnap-retain 
  source: 
    selector: 
      matchLabels: 
        snapshot-group: myapp-db 

The CNCF Annual Survey reports that 60% of organizations use CI/CD for most or all applications, which increases the value of coordinated snapshot automation.

Pro Tip: Identify every StatefulSet with more than one PVC, then decide whether it needs group snapshots based on recovery dependencies.

How to Restore from Snapshots and Prove Recovery Works?

Restoring from snapshots is only trustworthy when you can repeat it, validate it and measure the time required under realistic constraints.

Restore mechanics

Kubernetes restore typically means creating a new PVC that references a VolumeSnapshot through spec.dataSource, then mounting it into verification workloads.

apiVersion: v1 
kind: PersistentVolumeClaim 
metadata: 
  name: pgdata-restore 
  namespace: myapp 
spec: 
  storageClassName: <same-or-compatible-storage-class> 
  dataSource: 
    name: pgdata-snap-2025-12-15 
    kind: VolumeSnapshot 
    apiGroup: snapshot.storage.k8s.io 
  accessModes: 
    – ReadWriteOnce 
  resources: 
    requests: 
      storage: 20Gi 
 

You should verify the restored data before promotion, since validation isolates storage restore issues from application rollout mistakes during incidents.

apiVersion: v1 
kind: Pod 
metadata: 
  name: restore-verify 
  namespace: myapp 
spec: 
  restartPolicy: Never 
  containers: 
    – name: verify 
      image: busybox:1.36 
      command: [“sh”,”-c”,”ls -lah /data && sleep 3600″] 
      volumeMounts: 
        – name: data 
          mountPath: /data 
  volumes: 
    – name: data 
      persistentVolumeClaim: 
        claimName: pgdata-restore 

Uptime Institute reports four in five respondents said their most recent serious outage could have been prevented with better management, processes and configuration.

Action Tip: Schedule restore drills, then track time-to-restore and validation outcomes per tier because those metrics drive practical improvements.

What Policies Keep Snapshots Safe, Cheap and Compliant?

Snapshot safety depends on policy, because the same API can create recoverable history or delete the only usable restore point.

Keep deletion and retention intentional

deletionPolicy controls whether backend snapshots are preserved when Kubernetes snapshot objects are deleted, which directly affects retention and incident survivability.

Additionally, you should scope permissions tightly because snapshot create, delete and restore operations can expose sensitive data or destroy recovery options.

apiVersion: v1 
kind: ServiceAccount 
metadata: 
  name: snapshot-operator 
  namespace: myapp 
— 
apiVersion: rbac.authorization.k8s.io/v1 
kind: Role 
metadata: 
  name: snapshot-operator 
  namespace: myapp 
rules: 
  – apiGroups: [“snapshot.storage.k8s.io”] 
    resources: [“volumesnapshots”] 
    verbs: [“create”,”get”,”list”,”watch”,”delete”] 
  – apiGroups: [“groupsnapshot.storage.k8s.io”] 
    resources: [“volumegroupsnapshots”] 
    verbs: [“create”,”get”,”list”,”watch”,”delete”] 
— 
apiVersion: rbac.authorization.k8s.io/v1 
kind: RoleBinding 
metadata: 
  name: snapshot-operator 
  namespace: myapp 
subjects: 
  – kind: ServiceAccount 
    name: snapshot-operator 
roleRef: 
  kind: Role 
  name: snapshot-operator 
  apiGroup: rbac.authorization.k8s.io 

The Uptime Institute reports that 54% of significant outages exceed $100,000, which supports least-privilege controls and audit trails for snapshot operations.

Action Step: Lock down snapshot create, delete and restore permissions per namespace, then audit usage events as part of your regular resilience review.

Key Takeaways

CSI snapshots provide a standardized API, yet recovery reliability comes from consistency planning, permissions hygiene and frequent restore drills. If an application spans multiple volumes, you should prefer group snapshots when supported by your CSI driver and cluster version.

Need more information related to snapshot storage usage for Kubernetes? Simply connect with our friendly cloud experts using your free consultation and ask all your burning questions. Together, we’ll make cloud computing easy to understand for you and your team!

Frequently Asked Questions

It is a namespaced request for a point-in-time volume snapshot and it requires a Bound CSI-backed PVC on a StorageClass whose CSI driver supports snapshots.

VolumeSnapshot is the user request, while VolumeSnapshotContent is the cluster-scoped record that binds to the underlying snapshot.

You need snapshot CRDs plus the snapshot controller and the CSI driver must ship the csi-snapshotter sidecar integration.

They are often crash-consistent. Therefore, you should run DB-native flush or checkpoint steps and validate after restore with queries.

You create a new PVC from the snapshot using spec.dataSource, then mount it into a verification Pod or Job.

You should use it when multiple PVCs need one recovery point and Kubernetes groups claims using a label selector.

Carolyn Weitz's profile image
Carolyn Weitz
author
Carolyn began her cloud career at a fast-growing SaaS company, where she led the migration from on-prem infrastructure to a fully containerized, cloud-native architecture using Kubernetes. Since then, she has worked with a range of companies from early-stage startups to global enterprises helping them implement best practices in cloud operations, infrastructure automation, and container orchestration. Her technical expertise spans across AWS, Azure, and GCP, with a focus on building scalable IaaS environments and streamlining CI/CD pipelines. Carolyn is also a frequent contributor to cloud-native open-source communities and enjoys mentoring aspiring engineers in the Kubernetes ecosystem.

Get in Touch

Explore trends, industry updates and expert opinions to drive your business forward.

    We value your privacy and will never share your information with any third-party vendors. See Privacy Policy