Get Early Access to NVIDIA B200 With 20,000 Free Cloud Credits
Still Paying Hyperscaler Rates? Save Up to 60% on your Cloud Costs

Cloud Virtual Machines Comparison: How to Benchmark CPU, Storage and Network fairly

Carolyn Weitz's profile image
Carolyn Weitz
Last Updated: Jan 14, 2026
12 Minute Read
435 Views

If you have ever compared cloud VMs across providers, you already know the results can feel inconsistent or unfair. Even when two instances look identical on paper, their real performance can shift based on placement, neighbors and hidden limits.

Forrester projects the public cloud market will reach $1.03 trillion in 2026, underscoring why fair and repeatable cloud VM benchmarking matters.

To help you benchmark cloud VMs, this guide walks you through a comparison-oriented benchmarking method for CPU, storage and network across providers. We suggest you use it to validate price-performance claims before you commit to a long-term platform decision.

Want to get it right? Download our eBook: How to Choose a Cloud Provider?

What Makes Most Cloud VM Benchmarks Unfair?

Benchmarking sounds simple, yet cloud infrastructure adds variability that you cannot completely eliminate with good intentions. If you want credible results, you should first understand where noise enters your measurement pipeline and how it skews conclusions.

1. Cloud factors that create noisy results

  • Multi-tenancy and contention can affect CPU time slices, storage queues and network bandwidth when neighbors compete for shared resources. In addition, contention can vary by hour, meaning an early test and a later test can look like different VM products.
  • Hardware heterogeneity can change your baseline even when the instance label stays the same. For example, two “same-size” instances can map to different CPU models or different storage controllers, depending on provider inventory.
  • Background maintenance and throttling can introduce brief pauses or limits that look like application problems. Moreover, providers may move workloads, apply patches or throttle aggressive usage patterns that resemble abuse detection.

2. Hidden factors that distort CPU, disk and network tests

  • Storage burst and caching effects can inflate short tests, particularly if you measure before steady state is reached. Therefore, you should treat “first-minute” disk performance as an observation, not as your sustained capacity.
  • Short-run CPU boosts versus sustained performance can make a VM look fast during a quick run and slow during real workloads. Likewise, CPU credit models or turbo behavior can lead to unrealistic averages if the run ends before throttling begins.
  • Network locality can be the biggest hidden variable because a same-zone path behaves differently than a cross-zone path. Accordingly, you should label paths clearly and avoid comparing same-zone results against cross-zone results.

How to Design a Cloud VM Benchmark Plan Across Providers?

A like-for-like benchmarking plan is your go-to-guide, because it prevents accidental differences from becoming the story of your comparison. In our opinion, you can treat the plan as a checklist that you apply to each provider before you run any benchmarks.

1. Factors to keep identical across providers

  • Region and zone strategy should be chosen upfront and you should label same-zone and cross-zone tests as separate categories. In addition, you should avoid mixing regions between providers as geography changes latency and can influence throughput.
  • OS image, kernel, drivers, filesystem and mount options should match as closely as possible across providers. For example, an older kernel might use different I/O schedulers and that difference can show up as “cloud performance.”
  • VM sizing assumptions should be explicit, including vCPU count, memory size, CPU architecture, virtualization features and CPU frequency behavior (for example, turbo enabled/disabled and the active CPU governor such as performance or ondemand). Likewise, you should record whether the VM uses SMT, NUMA and any provider-specific performance settings.

2. Structuring runs to capture real-world variance

  • We suggest you first conduct warm-up runs, then run multiple timed executions that capture sustained behavior under stable conditions.
  • Next, you should repeat runs across different time windows, because multi-tenant contention is often time-dependent.
  • You should randomize run order across providers as the first environment often benefits from fresh caches and fresh attention.
  • Moreover, random order reduces the chance that a morning provider always looks better than an evening provider.

3. Approach to reduce manual drift across clouds

An automated benchmark harness reduces drift because it standardizes provisioning, tool installation, configs and result capture. Google Cloud notes PerfKit Benchmarker “wraps over 100 industry-standard benchmark testing tools,” which helps consistency across providers.

You can pair a harness with infrastructure-as-code, which makes your VM definitions reviewable and repeatable during audits. In addition, a single repo for scripts and configs makes it easier for teammates to reproduce results and challenge assumptions.

How to Benchmark Cloud CPU Performance for Comparable Results?

Cloud CPU benchmarking is where many teams accidentally test burst behavior instead of sustained throughput, then overfit decisions to the wrong metric. You can avoid that trap by deciding which CPU questions matter to your workload before selecting a tool.

1. Ask the right CPU-related questions

  • Single-core latency matters for control planes, schedulers and latency-sensitive services and it can vary with turbo behavior. Therefore, you should measure both a short burst run and a longer sustained run to observe any performance decay.
  • Multi-core throughput matters for batch ETL, training preprocess pipelines and parallel builds, where scaling efficiency is critical. Also, you should test scaling at multiple thread counts, because linear scaling often fails after a certain core count.
  • Sustained performance under load matters when your workload runs for hours, which is common for training and data processing. Meanwhile, sustained tests reveal throttling, credit exhaustion and thermal limits that short tests will not expose.

2. Determine CPU benchmarks that work well for cloud comparisons

You can combine microbenchmarks with application-like tests as each category reveals a different risk.

  • Microbenchmarks give repeatable baselines, while application-like tests reveal scheduler overhead, memory behavior and real instruction mixes. For microbenchmarks, you can use tools like sysbench cpu, stress-ng CPU methods or openssl speed for crypto-heavy workloads.
  • For application-like tests, you can compile a fixed codebase, run compression benchmarks or use pgbench for database-like CPU paths.

Likewise, you should pin versions as benchmark results can shift after tool updates and compiler changes.

3. Metadata to record for credibility

In our opinion, you should always record the CPU model and flags, the effective governor behavior and the NUMA layout visible in the guest. In addition, you should capture the VM type, kernel version and hypervisor hints, because readers need context for interpretation.

This level of rigor matters more in hybrid environments, where teams compare on-premise baselines to multiple clouds. Gartner predicts 90% of organizations will adopt a hybrid cloud approach through 2027, which increases the need for apples-to-apples comparisons.

A simple metadata capture block like the following is often enough for most benchmark reports.

uname -a 
cat /etc/os-release 
lscpu 
cat /proc/cpuinfo | head -n 40 
numactl --hardware 2>/dev/null || true 

How to Benchmark Storage for IOPS, Throughput and Latency?

Storage benchmarking is easy to get wrong because caches and burst policies can disguise the disk you think you are testing. You can still get fair results if you choose representative profiles, control caching and measure steady-state performance.

1. Storage profiles to test for avoiding cherry-picking

You should test 4K random read and write for IOPS and latency as many databases and metadata workloads look like this. Furthermore, you should test 128K or 1M sequential read and write for throughput, because data pipelines often stream large blocks.

We also recommend you test mixed read-write ratios, such as 70/30, as real systems rarely read or write exclusively. Moreover, you should test multiple queue depths, since latency often worsens quickly after a certain outstanding I/O level.

2. Run fio for results to reflect the disk (not RAM cache)

You should use fio for storage tests because it is configurable, repeatable and widely understood in performance communities. However, you should ensure your job settings bypass page cache when your goal is disk performance rather than cache behavior.

For many Linux environments, direct=1 and invalidate=1 are common starting points and you should document these choices. In addition, you should make the test file significantly larger than available RAM (for example, 2–3× total memory), because otherwise you will accidentally benchmark RAM and page cache instead of the underlying disk.

You should precondition the volume, then run time-based workloads long enough to reach steady state. Next, you should report latency percentiles from fio output as averages hide tail latency that often determines application stability.

Here is an example fio job for 4K random read, tuned to measure steady behavior rather than first-minute bursts.

fio --name=randread4k --filename=/mnt/vol/testfile \ 
  --rw=randread --bs=4k --iodepth=32 --numjobs=4 \ 
  --direct=1 --invalidate=1 --ioengine=libaio \ 
  --time_based=1 --ramp_time=10 --runtime=60 \ 
  --group_reporting 

3. Prevent “cold volume” and configuration artifacts

For this, you should initialize new volumes before measuring, since first reads can trigger initialization behavior on some backends. Also, you should verify filesystem alignment, mount options and I/O scheduler settings, since defaults differ across distros.

You should match disk class and provisioned performance settings across providers as “general purpose SSD” is not a single tier. AWS documents that EBS gp3 includes baseline 3,000 IOPS and 125 MiB/s and can be provisioned up to 80,000 IOPS and 2,000 MiB/s.

That detail matters because a baseline disk on one provider might compete against a provisioned disk on another provider. Therefore, you should write down provisioned IOPS and throughput explicitly and you should include those costs in price-performance math.

How to Benchmark Network Throughput and Latency Without Misleading Results?

Network benchmarking can mislead quickly because you might be testing placement, routing and flow limits instead of raw NIC capability. You can keep it honest by stating the exact path, using consistent tools and verifying that CPU does not become your bottleneck.

1. Determinethe network path you are testing

You should define the path as a first-class benchmark variable, since different paths represent different production realities. Common paths include same-zone VM-to-VM, cross-zone, cross-region and public internet paths through VPN or interconnect.

You should keep same-zone and cross-zone results separate as they represent different latency budgets and different failure domains. Besides, you should record whether traffic stays within a VPC, crosses VPC boundaries or traverses public IP space.

2. Measuring throughput fairly

You can use iperf3 with multiple parallel streams, because one TCP flow often fails to fill modern links reliably. In addition, you should run long enough to avoid startup optimism and you should repeat the run across multiple time windows.

Moreover, you should also watch CPU usage during throughput tests as encryption, checksumming and interrupt handling can cap results. If the sender becomes CPU-bound, your throughput result describes the sender CPU limit rather than the network path capacity.

A practical starting command for a same-zone throughput test is shown below and you can adjust -P for your link speed.

# On server 
iperf3 -s 
 
# On client 
iperf3 -c <server_ip> -t 60 -P 4 

3. Measuring latency and jitter credibly

You can start with ICMP ping to capture baseline RTT, then expand into application-relevant probes if your workload needs them. For example, you can measure TCP handshake latency, small-packet RTT and retransmissions when you are designing low-latency services.

We also suggest you state whether you used a placement policy, since placement can change single-flow performance. AWS documents that single-flow bandwidth can be limited to 5 Gbps when instances are not in the same cluster placement group and suggests cluster placement groups to reach up to 10 Gbps within the group.

That limit can materially change iperf results, particularly when you run with -P 1 and interpret it as “network is slow.” Therefore, you should document placement decisions and parallel stream counts in every published benchmark.

How to Turn Raw Benchmarks into a Fair Price-Performance Comparison?

Raw numbers are useful, yet they become decision-grade only after you normalize performance against total cost. You can do this without overcomplication if you define a few normalized metrics and apply them consistently.

1. Normalized metrics that make cross-provider comparisons helpful

  • For CPU, you can compute cost per unit of CPU score (for example, ₹ or $ per benchmark point per hour) and you should show both median and p95 to reflect variability.
  • For storage, you can compute $ per sustained IOPS at a target latency as IOPS without latency constraints is misleading.
  • For throughput-focused disks, you can compute $ per achieved GB/s and you should note the block size used.
  • For network, you can compute $ per achieved Gbps and you should separate same-zone results from cross-zone results.

2. Include costs for achieving the best value

You should include storage performance provisioning costs, since performance knobs often add significant monthly charges. In addition, you should include data transfer and egress, since cheap compute can be offset by expensive outbound traffic.

Not only that, you should also disclose commitment models separately, such as reserved models and spot models, because they change risk and price. Flexera reports cloud spend is expected to increase 28% and organizations are exceeding budgets by 17%, which makes benchmarking a budgeting tool.

Start a Free Trial on AceCloud Cloud Platform

Fair cloud VM benchmarking is the fastest way to avoid buying capacity that looks great on paper but falls short under real load. This is why we allow our customers to try our cloud computing resources for free before making a commitment.

You can use the CPU, storage and network steps in this guide to measure median and p95 results, then translate them into true price-performance. However, the details matter, including disk tier matching, network path labeling and repeatable automation across providers.

If you want a second set of eyes on your plan, you can connect with AceCloud experts to design a like-for-like benchmark, interpret variability and map results to your SLOs. Book a free consultation session to ask everything you want to know about cloud computing.

Frequently Asked Questions

Yes, and you can start with common open-source tools, then scale into a harness when you need repeatability across environments. In practice, many teams use fio for storage, iperf3 for network, and CPU tests like sysbench, stress-ng or application-level benchmarks (for example, compiling a fixed codebase).

Many teams test too late, then rush decisions based on unrealistic traffic patterns and insufficient repetition across time windows. You can reduce risk by defining realistic profiles first, then repeating runs and reporting median together with p95 or p99 so you see both typical and tail behavior

It can reduce failure risk by exposing bottlenecks early, which gives you time to fix limits before users hit them at peak demand. You should still run production-like canaries, since lab tests cannot cover every dependency and failure mode.

A single run is rarely enough in multi-tenant systems as placement and contention can change across hours and days. You can start smaller, then increase runs until the median and p95 stabilize within a tolerance you consider acceptable.

Yes and you can reduce drift by using a harness that provisions identical infrastructure and executes the same benchmark configs.

Carolyn Weitz's profile image
Carolyn Weitz
author
Carolyn began her cloud career at a fast-growing SaaS company, where she led the migration from on-prem infrastructure to a fully containerized, cloud-native architecture using Kubernetes. Since then, she has worked with a range of companies from early-stage startups to global enterprises helping them implement best practices in cloud operations, infrastructure automation, and container orchestration. Her technical expertise spans across AWS, Azure, and GCP, with a focus on building scalable IaaS environments and streamlining CI/CD pipelines. Carolyn is also a frequent contributor to cloud-native open-source communities and enjoys mentoring aspiring engineers in the Kubernetes ecosystem.

Get in Touch

Explore trends, industry updates and expert opinions to drive your business forward.

    We value your privacy and will use your information only to communicate and share relevant content, products and services. See Privacy Policy