Organizations that manage large labeling workloads need to balance throughput, cost and reliability. Preemptible or spot compute offers a practical way to expand capacity at a lower price point, provided the pipeline is designed for interruption.
This article explains how preemptible capacity works, why batch data labeling is well suited to it and how to implement a resilient approach that maintains quality while reducing spend.
Together, we’ll focus on simple patterns that teams can adopt without major architectural change.
What is Preemptible Compute?
Preemptible compute refers to discounted virtual machines that a cloud provider may stop when capacity tightens. At AceCloud, we call it Spot Instance, similar to AWS. On Google Cloud and Azure they are Spot VMs.
Most providers may offer a short interruption notice, often a minute or two, after which the instance is terminated.
The discount reflects the risk of interruption. Workloads that use preemptible capacity should expect instances to stop at any time and should keep progress outside the worker.
In practice this means stateless workers, small units of work and outputs that are safe to write more than once. When these conditions hold, preemptible capacity becomes a reliable tool for batch processing at scale.
Why are labeling and preemptible a good match?
Batch data labeling is naturally parallel. Each item or small batch goes through the same steps. If one worker disappears, another can pick up the same shard and finish it.
Most teams care about total items labeled per day, not whether one particular worker lived for an hour or a week. That mindset lines up perfectly with preemptible fleets.
A quick gut check you can use with your team: can you say “it is fine if this task restarts” without crossing your fingers? If yes, you are ready.
How to get started with Preemptible Compute?
You can make interruptions boring with a few choices that do not add much code. Here are the steps you can follow:
Step 1: Add a queue and a task table
Give every unit of work a row in a simple task table and a message in a managed queue. A minimal schema uses task_id, run_id, status (queued, running, done, failed), items, checkpoint_uri, updated_at and an optional fail_reason.
Store artifacts in object storage and keep only pointers and status in the table. Use a service like SQS, Pub/Sub or RabbitMQ so workers pull tasks and you get natural backpressure.
Set the queue’s visibility timeout a little longer than your shard time so if a worker dies the task reappears for another worker.
Step 2: Pick a shard size you can lose without pain
Aim for tasks that finish in two to five minutes. Measure how long one item takes, then compute items per task so a late preemption wastes very little work.
target_task_seconds = 180
avg_seconds_per_item = 0.8
items_per_task = floor(180 / 0.8) # ≈ 225 Start with that value, run a small batch, then adjust until retries are cheap and throughput is smooth.
Step 3: Use deterministic paths for every write
Make reruns safe by writing to fixed locations that include the run and task IDs. A simple layout keeps inputs, outputs, the checkpoint and logs under one folder.
If the same item runs twice it either overwrites with the same bytes or skips because a checksum matches.
s3://bucket/runs/<run_id>/tasks/<task_id>/
input.list
outputs/<item_id>.json
ckpt.json
logs/<timestamp>.txt
Step 4: Save progress often with a tiny checkpoint
Write a compact JSON file that records the list of completed item IDs and the last safe save time.
Flush it every 20 items or every 60 seconds, whichever comes first. On resume the worker loads the file and skips what is already marked done.
{ "done": ["img001","img002","img003"], "last_saved_at": "2025-07-28T09:20:00Z" } Step 5: Mark “done” only after everything is durable
Close each task with two clear steps. First write all outputs and the latest checkpoint to object storage.
Then flip status=done in a single database transaction and record a checksum of the outputs list.
If a worker stops before the transaction the task remains visible and another worker can finish it safely.
Step 6: Retry the right things with calm backoff
Treat timeouts, transient 5xx and short capacity dips as retryable. Treat bad schema, missing files and unsupported formats as fatal.
Use exponential backoff with jitter so retries spread out, for example 2, 4, 8, 16 then cap near 30 seconds.
Limit attempts to a small number, then mark failed with a clear reason. This keeps the system quiet during regional scarcity and makes root cause analysis easier.
Step 7: Mix on demand and preemptible capacity
Keep progress steady with a small baseline of on demand instances, then burst with preemptible as the queue grows.
As a starting point size the floor at 10 to 30 percent of peak throughput. Diversify the preemptible pool across instance types and zones to avoid correlated evictions.
Drive autoscaling from queue depth or oldest message age rather than CPU so capacity follows real backlog.
Step 8: Watch a few numbers that tell the truth
Create lightweight alerts and review them daily.
Track throughput as tasks completed per minute, queue depth and oldest age for backlog health, retry rate by reason, preemption count and overall failure rate for reliability and cost per labeled item with the share of spot versus on demand for efficiency.
Set simple SLOs such as 95 percent of tasks completing under 30 minutes, failure rate under 1 percent and cost per item below a target. These signals guide shard size and capacity mix.
Your week-long rollout plan should look something like this:
| Day | Action |
| 1 | Add the task table and switch writers to deterministic paths. |
| 2 | Implement checkpoints and the two step done marker. |
| 3 | Add the retry policy with jitter and clear error buckets. |
| 4 | Connect autoscaling to queue depth or oldest age. |
| 5 | Launch a ten percent pilot on preemptible and keep logs tight. |
| 6–7 | Review metrics, tune shard size and the on demand floor, then raise the preemptible share if the numbers look healthy. |
Follow this path and preemptions become routine. The pipeline stays simple, costs drop and the backlog stops being a fire drill.
Sample Layout: Making Workers Disposable
You do not need a complex architecture. You need clear ownership of state.
Your orchestrator creates tasks and pushes them to the queue.
A worker pulls a task, reads inputs, writes outputs and checkpoints back to storage, then marks the task done in the database.
Because the state lives in storage and the DB, any worker can resume a task after a shutdown. Workers can come and go without drama.
How to Make Batch Labelling Tasks Safe to Retry?
Give each task a stable ID and stick to a clean folder layout:
- Inputs at …/runs/<run_id>/tasks/<task_id>/input.list
- Outputs at …/runs/<run_id>/tasks/<task_id>/outputs/<item_id>.json
- Checkpoint at …/runs/<run_id>/tasks/<task_id>/ckpt.json
- Done flag in the DB only after outputs and checkpoint exist
Add basic provenance to each output so you can trace issues later. Include task ID, worker ID, code version, time and an input checksum.
Make every write safe to run twice. If a rerun happens, it should write the same bytes or skip because the file already matches.
We highly recommend you skip these traps:
- Tasks that run for tens of minutes or hours. Split them.
- Keeping the only copy of work in a temp folder. Always flush to object storage.
What Should Retry and What Should Fail?
Not every error deserves another try. Therefore, we recommend you sort them into two buckets.
- Retryable
Timeouts, transient network errors, short capacity dips. Use exponential backoff with jitter. Cap the number of attempts.
- Fatal
Bad inputs, missing fields, schema mismatches. Mark the task failed with a clear reason so you can fix the source.
Log retry reasons and counts. A quick look at this metric in your dashboard will tell you if you need to tune shard size, network timeouts or the on demand floor.
Cost Model for Preemptible Compute Resources
Here is a back of the envelope example you can adapt to your numbers.
- 10 million images
- 200 images per task
- 50,000 tasks
- 3 minutes per task
- On demand price 1.00 unit per vCPU hour
- Spot price 0.25 units per vCPU hour
- Each task uses 1 vCPU for 3 minutes
On demand only:
50,000 tasks × 3 minutes = 150,000 minutes = 2,500 hours , i.e., 2,500 units
All preemptible:
2,500 hours × 0.25 = 625 units
Hybrid:
On demand floor 20 percent = 500 hours, i.e, 500 units
Preemptible 80 percent = 2,000 hours × 0.25, i.e., 500 units
Total: 1,000 units
That is a 60 percent saving for hybrid and 75 percent for all preemptible if capacity holds. Model storage, data transfer and human review separately since those can be large in some setups.
Common Mistakes and Quick Fixes
Here are some of the mistakes businesses make when deploying preemptible compute resources for data labelling tasks:
| Common Mistakes | Quick Fixes |
| Huge shards | Fix by splitting to 2 to 5 minutes of work per task |
| Local state | Fix by pushing state into object storage and a simple task table |
| Outputs that drift | Fix by pinning model versions and setting random seeds |
| Retry storms | Fix with jitter, backoff caps and a short cool down on the queue |
| Single zone clusters | Fix with multi zone spread and mixed instance families |
Run Your Next Labeling Batch on AceCloud Spot
Preemptible compute is a practical way to raise labeling throughput while cutting spend. When workers are stateless, tasks are small and progress is saved often, interruptions become routine. Most teams see savings after a short pilot with no loss in quality or pace.
Run your next batch on AceCloud spot instances to get low cost capacity with simple autoscaling, mixed instance types and multi‑zone spread so work keeps moving when capacity is tight. Start with a small on‑demand floor, burst on spot and track cost per item to prove the result in days.
Ready to try it? Spin up AceCloud spot for a guided pilot or talk to our engineers for a labeling blueprint tailored to your stack. We will help you size the on‑demand floor, tune shard size and hit a clear cost target.