Posts Tagged 'Throughput'

July 16, 2013

Riak Performance Analysis: Bare Metal v. Virtual

In December, I posted a MongoDB performance analysis that showed the quantitative benefits of using bare metal servers for MongoDB workloads. It should come as no surprise that in the wake of SoftLayer's Riak launch, we've got some similar data to share about running Riak on bare metal.

To run this test, we started by creating five-node clusters with Riak 1.3.1 on SoftLayer bare metal servers and on a popular competitor's public cloud instances. For the SoftLayer environment, we created these clusters using the Riak Solution Designer, so the nodes were all provisioned, configured and clustered for us automatically when we ordered them. For the public cloud virtual instance Riak cluster, each node was provisioned indvidually using a Riak image template and manually configured into a cluster after all had come online. To optimize for Riak performance, I made a few tweaks at the OS level of our servers (running CentOS 64-bit):

Noatime
Nodiratime
barrier=0
data=writeback
ulimit -n 65536

The common Noatime and Nodiratime settings eliminate the need for writes during reads to help performance and disk wear. The barrier and writeback settings are a little less common and may not be what you'd normally set. Although those settings present a very slight risk for loss of data on disk failure, remember that the Riak solution is deployed in five-node rings with data redundantly available across multiple nodes in the ring. With that in mind and considering each node also being deployed with a RAID10 storage array, you can see that the minor risk for data loss on the failure of a single disk in the entire solution would have no impact on the entire data set (as there are plenty of redundant copies for that data available). Given the minor risk involved, the performance increases of those two settings justify their use.

With all of the nodes tweaked and configured into clusters, we set up Basho's test harness — Basho Bench — to remotely simulate load on the deployments. Basho Bench allows you to create a configurable test plan for a Riak cluster by configuring a number of workers to utilize a driver type to generate load. It comes packaged as an Erlang application with a config file example that you can alter to create the specifics for the concurrency, data set size, and duration of your tests. The results can be viewed as CSV data, and there is an optional graphics package that allows you to generate the graphs that I am posting in this blog. A simplified graphic of our test environment would look like this:

Riak Test Environment

The following Basho Bench config is what we used for our testing:

{mode, max}.
{duration, 120}.
{concurrent, 8}.
{driver, basho_bench_driver_riakc_pb}.
{key_generator,{int_to_bin,{uniform_int,1000000}}}.
{value_generator,{exponential_bin,4098,50000}}.
{riakc_pb_ips, [{10,60,68,9},{10,40,117,89},{10,80,64,4},{10,80,64,8},{10,60,68,7}]}.
{riakc_pb_replies, 2}.
{operations, [{get, 10},{put, 1}]}.

To spell it out a little simpler:

Tests Performed

Data Set: 400GB
10:1 Query-to-Update Operations
8 Concurrent Client Connections
Test Duration: 2 Hours

You may notice that in the test cases that use SoftLayer "Medium" Servers, the virtual provider nodes are running 26 virtual compute units against our dual proc hex-core servers (12 cores total). In testing with Riak, memory is important to the operations than CPU resources, so we provisioned the virtual instances to align with the 36GB of memory in each of the "Medium" SoftLayer servers. In the public cloud environment, the higher level of RAM was restricted to packages with higher CPU, so while the CPU counts differ, the RAM amounts are as close to even as we could make them.

One final "housekeeping" note before we dive into the results: The graphs below are pulled directly from the optional graphics package that displays Basho Bench results. You'll notice that the scale on the left-hand side of graphs differs dramatically between the two environments, so a cursory look at the results might not tell the whole story. Click any of the graphs below for a larger version. At the end of each test case, we'll share a few observations about the operations per second and latency results from each test. When we talk about latency in the "key observation" sections, we'll talk about the 99th percentile line — 99% of the results had latency below this line. More simply you could say, "This is the highest latency we saw on this platform in this test." The primary reason we're focusing on this line is because it's much easier to read on the graphs than the mean/median lines in the bottom graphs.

Riak Test 1: "Small" Bare Metal 5-Node Cluster vs Virtual 5-Node Cluster

Servers

SoftLayer Small Riak Server Node
Single 4-core Intel 1270 CPU
64-bit CentOS
8GB RAM
4 x 500GB SATAII – RAID10
1Gb Bonded Network
Virtual Provider Node
4 Virtual Compute Units
64-bit CentOS
7.5GB RAM
4 x 500GB Network Storage – RAID10
1Gb Network
 

Results

Riak Performance Analysis

Riak Performance Analysis

Key Observations

The SoftLayer environment showed much more consistency in operations per second with an average throughput around 450 Op/sec. The virtual environment throughput varied significantly between about 50 operations per second to more than 600 operations per second with the trend line fluctuating slightly between about 220 Op/sec and 350 Op/sec.

Comparing the latency of get and put requests, the 99th percentile of results in the SoftLayer environment stayed around 50ms for gets and under 200ms for puts while the same metric for the virtual environment hovered around 800ms in gets and 4000ms in puts. The scale of the graphs is drastically different, so if you aren't looking closely, you don't see how significantly the performance varies between the two.

Riak Test 2: "Medium" Bare Metal 5-Node Cluster vs Virtual 5-Node Cluster

Servers

SoftLayer Medium Riak Server Node
Dual 6-core Intel 5670 CPUs
64-bit CentOS
36GB RAM
4 x 300GB 15K SAS – RAID10
1Gb Network – Bonded
Virtual Provider Node
26 Virtual Compute Units
64-bit CentOS
30GB RAM
4 x 300GB Network Storage
1Gb Network
 

Results

Riak Performance Analysis

Riak Performance Analysis

Key Observations

Similar to the results of Test 1, the throughput numbers from the bare metal environment are more consistent (and are consistently higher) than the throughput results from the virtual instance environment. The SoftLayer environment performed between 1500 and 1750 operations per second on average while the virtual provider environment averaged around 1200 operations per second throughout the test.

The latency of get and put requests in Test 2 also paints a similar picture to Test 1. The 99th percentile of results in the SoftLayer environment stayed below 50ms and under 400ms for puts while the same metric for the virtual environment averaged about 250ms in gets and over 1000ms in puts. Latency in a big data application can be a killer, so the results from the virtual provider might be setting off alarm bells in your head.

Riak Test 3: "Medium" Bare Metal 5-Node Cluster vs Virtual 5-Node Cluster

Servers

SoftLayer Medium Riak Server Node
Dual 6-core Intel 5670 CPUs
64-bit CentOS
36GB RAM
4 x 128GB SSD – RAID10
1Gb Network – Bonded
Virtual Provider Node
26 Virtual Compute Units
64-bit CentOS
30GB RAM
4 x 300GB Network Storage
1Gb Network
 

Results

Riak Performance Analysis

Riak Performance Analysis

Key Observations

In Test 3, we're using the same specs in our virtual provider nodes, so the results for the virtual node environment are the same in Test 3 as they are in Test 2. In this Test, the SoftLayer environment substitutes SSD hard drives for the 15K SAS drives used in Test 2, and the throughput numbers show the impact of that improved I/O. The average throughput of the bare metal environment with SSDs is between 1750 and 2000 operations per second. Those numbers are slightly higher than the SoftLayer environment in Test 2, further distancing the bare metal results from the virtual provider results.

The latency of gets for the SoftLayer environment is very difficult to see in this graph because the latency was so low throughout the test. The 99th percentile of puts in the SoftLayer environment settled between 500ms and 625ms, which was a little higher than the bare metal results from Test 2 but still well below the latency from the virtual environment.

Summary

The results show that — similar to the majority of data-centric applications that we have tested — Riak has more consistent, better performing, and lower latency results when deployed onto bare metal instead of a cluster of public cloud instances. The stark differences in consistency of the results and the latency are noteworthy for developers looking to host their big data applications. We compared the 99th percentile of latency, but the mean/median results are worth checking out as well. Look at the mean and median results from the SoftLayer SSD Node environment: For gets, the mean latency was 2.5ms and the median was somewhere around 1ms. For puts, the mean was between 7.5ms and 11ms and the median was around 5ms. Those kinds of results are almost unbelievable (and that's why I've shared everything involved in completing this test so that you can try it yourself and see that there's no funny business going on).

It's commonly understood that local single-tenant resources that bare metal will always perform better than network storage resources, but by putting some concrete numbers on paper, the difference in performance is pretty amazing. Virtualizing on multi-tenant solutions with network attached storage often introduces latency issues, and performance will vary significantly depending on host load. These results may seem obvious, but sometimes the promise of quick and easy deployments on public cloud environments can lure even the sanest and most rational developer. Some applications are suited for public cloud, but big data isn't one of them. But when you have data-centric apps that require extreme I/O traffic to your storage medium, nothing can beat local high performance resources.

-Harold

December 29, 2011

Using iPerf to Troubleshoot Speed/Throughput Issues

Two of the most common network characteristics we look at when investigating network-related concerns in the NOC are speed and throughput. You may have experienced the following scenario yourself: You just provisioned a new bad-boy server with a gigabit connection in a data center on the opposite side of the globe. You begin to upload your data and to your shock, you see "Time Remaining: 10 Hours." "What's wrong with the network?" you wonder. The traceroute and MTR look fine, but where's the performance and bandwidth I'm paying for?

This issue is all too common and it has nothing to do with the network, but in fact, the culprits are none other than TCP and the laws of physics.

In data transmission, TCP sends a certain amount of data then pauses. To ensure proper delivery of data, it doesn't send more until it receives an acknowledgement from the remote host that all data was received. This is called the "TCP Window." Data travels at the speed of light, and typically, most hosts are fairly close together. This "windowing" happens so fast we don't even notice it. But as the distance between two hosts increases, the speed of light remains constant. Thus, the further away the two hosts, the longer it takes for the sender to receive the acknowledgement from the remote host, reducing overall throughput. This effect is called "Bandwidth Delay Product," or BDP.

We can overcome BDP to some degree by sending more data at a time. We do this by adjusting the "TCP Window" – telling TCP to send more data per flow than the default parameters. Each OS is different and the default values will vary, but most all operating systems allow tweaking of the TCP stack and/or using parallel data streams. So what is iPerf and how does it fit into all of this?

What is iPerf?

iPerf is simple, open-source, command-line, network diagnostic tool that can run on Linux, BSD, or Windows platforms which you install on two endpoints. One side runs in a 'server' mode listening for requests; the other end runs 'client' mode that sends data. When activated, it tries to send as much data down your pipe as it can, spitting out transfer statistics as it does. What's so cool about iPerf is you can test in real time any number of TCP window settings, even using parallel streams. There's even a Java based GUI you can install that runs on top of it called, JPerf (JPerf is beyond the scope of this article, but I recommend looking into it). What's even cooler is that because iPerf resides in memory, there are no files to clean up.

How do I use iPerf?

iPerf can be quickly downloaded from SourceForge to be installed. It uses port 5001 by default, and the bandwidth it displays is from the client to the server. Each test runs for 10 seconds by default, but virtually every setting is adjustable. Once installed, simply bring up the command line on both of the hosts and run these commands.

On the server side:
iperf -s

On the client side:
iperf -c [server_ip]

The output on the client side will look like this:

#iperf -c 10.10.10.5
------------------------------------------------------------
Client connecting to 10.10.10.5, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  3] local 0.0.0.0 port 46956 connected with 168.192.1.10 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 10.0 sec  10.0 MBytes  1.00 Mbits/sec

There are a lot of things we can do to make this output better with more meaningful data. For example, let's say we want the test to run for 20 seconds instead of 10 (-t 20), and we want to display transfer data every 2 seconds instead of the default of 10 (-i 2), and we want to test on port 8000 instead of 5001 (-p 8000). For the purposes of this exercise, let's use those customization as our baseline. This is what the command string would look like on both ends:

Client Side:

#iperf -c 10.10.10.5 -p 8000 -t 20 -i 2
------------------------------------------------------------
Client connecting to 10.10.10.5, TCP port 8000
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  3] local 10.10.10.10 port 46956 connected with 10.10.10.5 port 8000
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 2.0 sec  6.00 MBytes  25.2 Mbits/sec
[  3]  2.0- 4.0 sec  7.12 MBytes  29.9 Mbits/sec
[  3]  4.0- 6.0 sec  7.00 MBytes  29.4 Mbits/sec
[  3]  6.0- 8.0 sec  7.12 MBytes  29.9 Mbits/sec
[  3]  8.0-10.0 sec  7.25 MBytes  30.4 Mbits/sec
[  3] 10.0-12.0 sec  7.00 MBytes  29.4 Mbits/sec
[  3] 12.0-14.0 sec  7.12 MBytes  29.9 Mbits/sec
[  3] 14.0-16.0 sec  7.25 MBytes  30.4 Mbits/sec
[  3] 16.0-18.0 sec  6.88 MBytes  28.8 Mbits/sec
[  3] 18.0-20.0 sec  7.25 MBytes  30.4 Mbits/sec
[  3]  0.0-20.0 sec  70.1 MBytes  29.4 Mbits/sec

Server Side:

#iperf -s -p 8000 -i 2
------------------------------------------------------------
Server listening on TCP port 8000
TCP window size: 8.00 KByte (default)
------------------------------------------------------------
[852] local 10.10.10.5 port 8000 connected with 10.10.10.10 port 58316
[ ID] Interval Transfer Bandwidth
[  4]  0.0- 2.0 sec  6.05 MBytes  25.4 Mbits/sec
[  4]  2.0- 4.0 sec  7.19 MBytes  30.1 Mbits/sec
[  4]  4.0- 6.0 sec  6.94 MBytes  29.1 Mbits/sec
[  4]  6.0- 8.0 sec  7.19 MBytes  30.2 Mbits/sec
[  4]  8.0-10.0 sec  7.19 MBytes  30.1 Mbits/sec
[  4] 10.0-12.0 sec  6.95 MBytes  29.1 Mbits/sec
[  4] 12.0-14.0 sec  7.19 MBytes  30.2 Mbits/sec
[  4] 14.0-16.0 sec  7.19 MBytes  30.2 Mbits/sec
[  4] 16.0-18.0 sec  6.95 MBytes  29.1 Mbits/sec
[  4] 18.0-20.0 sec  7.19 MBytes  30.1 Mbits/sec
[  4]  0.0-20.0 sec  70.1 MBytes  29.4 Mbits/sec

There are many, many other parameters you can set that are beyond the scope of this article, but for our purposes, the main use is to prove out our bandwidth. This is where we'll use the TCP window options and parallel streams. To set a new TCP window you use the -w switch and you can set the parallel streams by using -P.

Increased TCP window commands:

Server side:
#iperf -s -w 1024k -i 2

Client side:
#iperf -i 2 -t 20 -c 10.10.10.5 -w 1024k

And here are the iperf results from two Softlayer file servers – one in Washington, D.C., acting as Client, the other in Seattle acting as Server:

Client Side:

# iperf -i 2 -t 20 -c 10.10.10.5 -p 8000 -w 1024k
------------------------------------------------------------
Client connecting to 10.10.10.5, TCP port 8000
TCP window size: 1.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  3] local 10.10.10.10 port 53903 connected with 10.10.10.5 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 2.0 sec  25.9 MBytes   109 Mbits/sec
[  3]  2.0- 4.0 sec  28.5 MBytes   120 Mbits/sec
[  3]  4.0- 6.0 sec  28.4 MBytes   119 Mbits/sec
[  3]  6.0- 8.0 sec  28.9 MBytes   121 Mbits/sec
[  3]  8.0-10.0 sec  28.0 MBytes   117 Mbits/sec
[  3] 10.0-12.0 sec  29.0 MBytes   122 Mbits/sec
[  3] 12.0-14.0 sec  28.0 MBytes   117 Mbits/sec
[  3] 14.0-16.0 sec  29.0 MBytes   122 Mbits/sec
[  3] 16.0-18.0 sec  27.9 MBytes   117 Mbits/sec
[  3] 18.0-20.0 sec  29.0 MBytes   122 Mbits/sec
[  3]  0.0-20.0 sec   283 MBytes   118 Mbits/sec

Server Side:

#iperf -s -w 1024k -i 2 -p 8000
------------------------------------------------------------
Server listening on TCP port 8000
TCP window size: 1.00 MByte
------------------------------------------------------------
[  4] local 10.10.10.5 port 8000 connected with 10.10.10.10 port 53903
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0- 2.0 sec  25.9 MBytes   109 Mbits/sec
[  4]  2.0- 4.0 sec  28.6 MBytes   120 Mbits/sec
[  4]  4.0- 6.0 sec  28.3 MBytes   119 Mbits/sec
[  4]  6.0- 8.0 sec  28.9 MBytes   121 Mbits/sec
[  4]  8.0-10.0 sec  28.0 MBytes   117 Mbits/sec
[  4] 10.0-12.0 sec  29.0 MBytes   121 Mbits/sec
[  4] 12.0-14.0 sec  28.0 MBytes   117 Mbits/sec
[  4] 14.0-16.0 sec  29.0 MBytes   122 Mbits/sec
[  4] 16.0-18.0 sec  28.0 MBytes   117 Mbits/sec
[  4] 18.0-20.0 sec  29.0 MBytes   121 Mbits/sec
[  4]  0.0-20.0 sec   283 MBytes   118 Mbits/sec

We can see here, that by increasing the TCP window from the default value to 1MB (1024k) we achieved around a 400% increase in throughput over our baseline. Unfortunately, this is the limit of this OS in terms of Window size. So what more can we do? Parallel streams! With multiple simultaneous streams we can fill the pipe close to its maximum usable amount.

Parallel Stream Command:
#iperf -i 2 -t 20 -c -p 8000 10.10.10.5 -w 1024k -P 7

Client Side:

#iperf -i 2 -t 20 -c -p 10.10.10.5 -w 1024k -P 7
------------------------------------------------------------
Client connecting to 10.10.10.5, TCP port 8000
TCP window size: 1.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
 [ ID] Interval       Transfer     Bandwidth
[  9]  0.0- 2.0 sec  24.9 MBytes   104 Mbits/sec
[  4]  0.0- 2.0 sec  24.9 MBytes   104 Mbits/sec
[  7]  0.0- 2.0 sec  25.6 MBytes   107 Mbits/sec
[  8]  0.0- 2.0 sec  24.9 MBytes   104 Mbits/sec
[  5]  0.0- 2.0 sec  25.8 MBytes   108 Mbits/sec
[  3]  0.0- 2.0 sec  25.9 MBytes   109 Mbits/sec
[  6]  0.0- 2.0 sec  25.9 MBytes   109 Mbits/sec
[SUM]  0.0- 2.0 sec   178 MBytes   746 Mbits/sec
 
(output omitted for brevity on server & client)
 
[  7] 18.0-20.0 sec  28.2 MBytes   118 Mbits/sec
[  8] 18.0-20.0 sec  28.8 MBytes   121 Mbits/sec
[  5] 18.0-20.0 sec  28.0 MBytes   117 Mbits/sec
[  4] 18.0-20.0 sec  28.0 MBytes   117 Mbits/sec
[  3] 18.0-20.0 sec  28.9 MBytes   121 Mbits/sec
[  9] 18.0-20.0 sec  28.8 MBytes   121 Mbits/sec
[  6] 18.0-20.0 sec  28.9 MBytes   121 Mbits/sec
[SUM] 18.0-20.0 sec   200 MBytes   837 Mbits/sec
[SUM]  0.0-20.0 sec  1.93 GBytes   826 Mbits/sec 

Server Side:

#iperf -s -w 1024k -i 2 -p 8000
------------------------------------------------------------
Server listening on TCP port 8000
TCP window size: 1.00 MByte
------------------------------------------------------------
[  4] local 10.10.10.10 port 8000 connected with 10.10.10.5 port 53903
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0- 2.0 sec  25.7 MBytes   108 Mbits/sec
[  8]  0.0- 2.0 sec  24.9 MBytes   104 Mbits/sec
[  4]  0.0- 2.0 sec  24.9 MBytes   104 Mbits/sec
[  9]  0.0- 2.0 sec  24.9 MBytes   104 Mbits/sec
[ 10]  0.0- 2.0 sec  25.9 MBytes   108 Mbits/sec
[  7]  0.0- 2.0 sec  25.9 MBytes   109 Mbits/sec
[  6]  0.0- 2.0 sec  25.9 MBytes   109 Mbits/sec
[SUM]  0.0- 2.0 sec   178 MBytes   747 Mbits/sec
 
[  4] 18.0-20.0 sec  28.8 MBytes   121 Mbits/sec
[  5] 18.0-20.0 sec  28.3 MBytes   119 Mbits/sec
[  7] 18.0-20.0 sec  28.8 MBytes   121 Mbits/sec
[ 10] 18.0-20.0 sec  28.1 MBytes   118 Mbits/sec
[  9] 18.0-20.0 sec  28.0 MBytes   118 Mbits/sec
[  8] 18.0-20.0 sec  28.8 MBytes   121 Mbits/sec
[  6] 18.0-20.0 sec  29.0 MBytes   121 Mbits/sec
[SUM] 18.0-20.0 sec   200 MBytes   838 Mbits/sec
[SUM]  0.0-20.1 sec  1.93 GBytes   825 Mbits/sec

As you can see from the tests above, we were able to increase throughput from 29Mb/s with a single stream and the default TCP Window to 824Mb/s using a higher window and parallel streams. On a Gigabit link, this about the maximum throughput one could hope to achieve before saturating the link and causing packet loss. The bottom line is, I was able to prove out the network and verify bandwidth capacity was not an issue. From that conclusion, I could focus on tweaking TCP to get the most out of my network.

I'd like to point out that we will never get 100% out of any link. Typically, 90% utilization is about the real world maximum anyone will achieve. If you get any more, you'll begin to saturate the link and incur packet loss. I should also point out that Softlayer doesn't directly support iPerf, so it's up to you install and play around with. It's such a versatile and easy to use little piece of software that it's become invaluable to me, and I think it will become invaluable to you as well!

-Andrew

Subscribe to throughput