Get Early Access to NVIDIA B200 With 20,000 Free Cloud Credits
Still paying hyperscaler rates? Save up to 60% on your cloud costs

Using Preemptible Compute Resources for Batch Data Labeling Jobs

Carolyn Weitz's profile image
Carolyn Weitz
Last Updated: Aug 4, 2025
10 Minute Read
383 Views

Organizations that manage large labeling workloads need to balance throughput, cost and reliability. Preemptible or spot compute offers a practical way to expand capacity at a lower price point, provided the pipeline is designed for interruption. 

This article explains how preemptible capacity works, why batch data labeling is well suited to it and how to implement a resilient approach that maintains quality while reducing spend. 

Together, we’ll focus on simple patterns that teams can adopt without major architectural change. 

What is Preemptible Compute? 

Preemptible compute refers to discounted virtual machines that a cloud provider may stop when capacity tightens. At AceCloud, we call it Spot Instance, similar to AWS. On Google Cloud and Azure they are Spot VMs. 

Most providers may offer a short interruption notice, often a minute or two, after which the instance is terminated. 

The discount reflects the risk of interruption. Workloads that use preemptible capacity should expect instances to stop at any time and should keep progress outside the worker.  

In practice this means stateless workers, small units of work and outputs that are safe to write more than once. When these conditions hold, preemptible capacity becomes a reliable tool for batch processing at scale. 

Why are labeling and preemptible a good match? 

Batch data labeling is naturally parallel. Each item or small batch goes through the same steps. If one worker disappears, another can pick up the same shard and finish it. 

Most teams care about total items labeled per day, not whether one particular worker lived for an hour or a week. That mindset lines up perfectly with preemptible fleets. 

A quick gut check you can use with your team: can you say “it is fine if this task restarts” without crossing your fingers? If yes, you are ready. 

How to get started with Preemptible Compute? 

You can make interruptions boring with a few choices that do not add much code. Here are the steps you can follow: 

Step 1: Add a queue and a task table 

Give every unit of work a row in a simple task table and a message in a managed queue. A minimal schema uses task_id, run_id, status (queued, running, done, failed), items, checkpoint_uri, updated_at and an optional fail_reason. 

Store artifacts in object storage and keep only pointers and status in the table. Use a service like SQS, Pub/Sub or RabbitMQ so workers pull tasks and you get natural backpressure. 

Set the queue’s visibility timeout a little longer than your shard time so if a worker dies the task reappears for another worker. 

Step 2: Pick a shard size you can lose without pain 

Aim for tasks that finish in two to five minutes. Measure how long one item takes, then compute items per task so a late preemption wastes very little work. 

target_task_seconds = 180 
avg_seconds_per_item = 0.8 
items_per_task = floor(180 / 0.8)  # ≈ 225 

Start with that value, run a small batch, then adjust until retries are cheap and throughput is smooth. 

Step 3: Use deterministic paths for every write 

Make reruns safe by writing to fixed locations that include the run and task IDs. A simple layout keeps inputs, outputs, the checkpoint and logs under one folder. 

If the same item runs twice it either overwrites with the same bytes or skips because a checksum matches. 

s3://bucket/runs/<run_id>/tasks/<task_id>/ 
  input.list 
  outputs/<item_id>.json 
  ckpt.json 
  logs/<timestamp>.txt 
 

Step 4: Save progress often with a tiny checkpoint 

Write a compact JSON file that records the list of completed item IDs and the last safe save time. 

Flush it every 20 items or every 60 seconds, whichever comes first. On resume the worker loads the file and skips what is already marked done. 

{ "done": ["img001","img002","img003"], "last_saved_at": "2025-07-28T09:20:00Z" }

Step 5: Mark “done” only after everything is durable 

Close each task with two clear steps. First write all outputs and the latest checkpoint to object storage. 

Then flip status=done in a single database transaction and record a checksum of the outputs list. 

If a worker stops before the transaction the task remains visible and another worker can finish it safely. 

Step 6: Retry the right things with calm backoff 

Treat timeouts, transient 5xx and short capacity dips as retryable. Treat bad schema, missing files and unsupported formats as fatal. 

Use exponential backoff with jitter so retries spread out, for example 2, 4, 8, 16 then cap near 30 seconds. 

Limit attempts to a small number, then mark failed with a clear reason. This keeps the system quiet during regional scarcity and makes root cause analysis easier. 

Step 7: Mix on demand and preemptible capacity 

Keep progress steady with a small baseline of on demand instances, then burst with preemptible as the queue grows. 

As a starting point size the floor at 10 to 30 percent of peak throughput. Diversify the preemptible pool across instance types and zones to avoid correlated evictions. 

Drive autoscaling from queue depth or oldest message age rather than CPU so capacity follows real backlog. 

Step 8: Watch a few numbers that tell the truth 

Create lightweight alerts and review them daily. 

Track throughput as tasks completed per minute, queue depth and oldest age for backlog health, retry rate by reason, preemption count and overall failure rate for reliability and cost per labeled item with the share of spot versus on demand for efficiency. 

Set simple SLOs such as 95 percent of tasks completing under 30 minutes, failure rate under 1 percent and cost per item below a target. These signals guide shard size and capacity mix. 

Your week-long rollout plan should look something like this: 

Day Action 
Add the task table and switch writers to deterministic paths. 
Implement checkpoints and the two step done marker. 
Add the retry policy with jitter and clear error buckets. 
Connect autoscaling to queue depth or oldest age. 
Launch a ten percent pilot on preemptible and keep logs tight. 
6–7 Review metrics, tune shard size and the on demand floor, then raise the preemptible share if the numbers look healthy. 

Follow this path and preemptions become routine. The pipeline stays simple, costs drop and the backlog stops being a fire drill. 

Cut Labeling Costs with Spot Compute
Let AceCloud help you tune and scale your batch jobs with ease.
Book a Free Consultation

Sample Layout: Making Workers Disposable 

You do not need a complex architecture. You need clear ownership of state. 

Your orchestrator creates tasks and pushes them to the queue. 

A worker pulls a task, reads inputs, writes outputs and checkpoints back to storage, then marks the task done in the database. 

Because the state lives in storage and the DB, any worker can resume a task after a shutdown. Workers can come and go without drama. 

How to Make Batch Labelling Tasks Safe to Retry? 

Give each task a stable ID and stick to a clean folder layout: 

  • Inputs at …/runs/<run_id>/tasks/<task_id>/input.list 
  • Outputs at …/runs/<run_id>/tasks/<task_id>/outputs/<item_id>.json 
  • Checkpoint at …/runs/<run_id>/tasks/<task_id>/ckpt.json 
  • Done flag in the DB only after outputs and checkpoint exist 

Add basic provenance to each output so you can trace issues later. Include task ID, worker ID, code version, time and an input checksum. 

Make every write safe to run twice. If a rerun happens, it should write the same bytes or skip because the file already matches. 

We highly recommend you skip these traps: 

  • Tasks that run for tens of minutes or hours. Split them. 
  • Keeping the only copy of work in a temp folder. Always flush to object storage. 

What Should Retry and What Should Fail? 

Not every error deserves another try. Therefore, we recommend you sort them into two buckets. 

  • Retryable 

Timeouts, transient network errors, short capacity dips. Use exponential backoff with jitter. Cap the number of attempts. 

  • Fatal 

Bad inputs, missing fields, schema mismatches. Mark the task failed with a clear reason so you can fix the source. 

Log retry reasons and counts. A quick look at this metric in your dashboard will tell you if you need to tune shard size, network timeouts or the on demand floor. 

Cost Model for Preemptible Compute Resources 

Here is a back of the envelope example you can adapt to your numbers. 

  • 10 million images 
  • 200 images per task 
  • 50,000 tasks 
  • 3 minutes per task 
  • On demand price 1.00 unit per vCPU hour 
  • Spot price 0.25 units per vCPU hour 
  • Each task uses 1 vCPU for 3 minutes 

On demand only: 

50,000 tasks × 3 minutes = 150,000 minutes = 2,500 hours , i.e., 2,500 units 

All preemptible: 

2,500 hours × 0.25 = 625 units 

Hybrid: 

On demand floor 20 percent = 500 hours, i.e, 500 units 

Preemptible 80 percent = 2,000 hours × 0.25, i.e., 500 units 

Total: 1,000 units 

That is a 60 percent saving for hybrid and 75 percent for all preemptible if capacity holds. Model storage, data transfer and human review separately since those can be large in some setups. 

Common Mistakes and Quick Fixes 

Here are some of the mistakes businesses make when deploying preemptible compute resources for data labelling tasks: 

Common Mistakes Quick Fixes 
Huge shards  Fix by splitting to 2 to 5 minutes of work per task  
Local state  Fix by pushing state into object storage and a simple task table  
Outputs that drift  Fix by pinning model versions and setting random seeds  
Retry storms  Fix with jitter, backoff caps and a short cool down on the queue  
Single zone clusters  Fix with multi zone spread and mixed instance families  

Run Your Next Labeling Batch on AceCloud Spot 

Preemptible compute is a practical way to raise labeling throughput while cutting spend. When workers are stateless, tasks are small and progress is saved often, interruptions become routine. Most teams see savings after a short pilot with no loss in quality or pace. 

Run your next batch on AceCloud spot instances to get low cost capacity with simple autoscaling, mixed instance types and multi‑zone spread so work keeps moving when capacity is tight. Start with a small on‑demand floor, burst on spot and track cost per item to prove the result in days. 

Ready to try it? Spin up AceCloud spot for a guided pilot or talk to our engineers for a labeling blueprint tailored to your stack. We will help you size the on‑demand floor, tune shard size and hit a clear cost target.

Carolyn Weitz's profile image
Carolyn Weitz
author
Carolyn began her cloud career at a fast-growing SaaS company, where she led the migration from on-prem infrastructure to a fully containerized, cloud-native architecture using Kubernetes. Since then, she has worked with a range of companies from early-stage startups to global enterprises helping them implement best practices in cloud operations, infrastructure automation, and container orchestration. Her technical expertise spans across AWS, Azure, and GCP, with a focus on building scalable IaaS environments and streamlining CI/CD pipelines. Carolyn is also a frequent contributor to cloud-native open-source communities and enjoys mentoring aspiring engineers in the Kubernetes ecosystem.

Get in Touch

Explore trends, industry updates and expert opinions to drive your business forward.

    We value your privacy and will use your information only to communicate and share relevant content, products and services. See Privacy Policy