Posts Tagged 'Latency'

July 16, 2013

Riak Performance Analysis: Bare Metal v. Virtual

In December, I posted a MongoDB performance analysis that showed the quantitative benefits of using bare metal servers for MongoDB workloads. It should come as no surprise that in the wake of SoftLayer's Riak launch, we've got some similar data to share about running Riak on bare metal.

To run this test, we started by creating five-node clusters with Riak 1.3.1 on SoftLayer bare metal servers and on a popular competitor's public cloud instances. For the SoftLayer environment, we created these clusters using the Riak Solution Designer, so the nodes were all provisioned, configured and clustered for us automatically when we ordered them. For the public cloud virtual instance Riak cluster, each node was provisioned indvidually using a Riak image template and manually configured into a cluster after all had come online. To optimize for Riak performance, I made a few tweaks at the OS level of our servers (running CentOS 64-bit):

Noatime
Nodiratime
barrier=0
data=writeback
ulimit -n 65536

The common Noatime and Nodiratime settings eliminate the need for writes during reads to help performance and disk wear. The barrier and writeback settings are a little less common and may not be what you'd normally set. Although those settings present a very slight risk for loss of data on disk failure, remember that the Riak solution is deployed in five-node rings with data redundantly available across multiple nodes in the ring. With that in mind and considering each node also being deployed with a RAID10 storage array, you can see that the minor risk for data loss on the failure of a single disk in the entire solution would have no impact on the entire data set (as there are plenty of redundant copies for that data available). Given the minor risk involved, the performance increases of those two settings justify their use.

With all of the nodes tweaked and configured into clusters, we set up Basho's test harness — Basho Bench — to remotely simulate load on the deployments. Basho Bench allows you to create a configurable test plan for a Riak cluster by configuring a number of workers to utilize a driver type to generate load. It comes packaged as an Erlang application with a config file example that you can alter to create the specifics for the concurrency, data set size, and duration of your tests. The results can be viewed as CSV data, and there is an optional graphics package that allows you to generate the graphs that I am posting in this blog. A simplified graphic of our test environment would look like this:

Riak Test Environment

The following Basho Bench config is what we used for our testing:

{mode, max}.
{duration, 120}.
{concurrent, 8}.
{driver, basho_bench_driver_riakc_pb}.
{key_generator,{int_to_bin,{uniform_int,1000000}}}.
{value_generator,{exponential_bin,4098,50000}}.
{riakc_pb_ips, [{10,60,68,9},{10,40,117,89},{10,80,64,4},{10,80,64,8},{10,60,68,7}]}.
{riakc_pb_replies, 2}.
{operations, [{get, 10},{put, 1}]}.

To spell it out a little simpler:

Tests Performed

Data Set: 400GB
10:1 Query-to-Update Operations
8 Concurrent Client Connections
Test Duration: 2 Hours

You may notice that in the test cases that use SoftLayer "Medium" Servers, the virtual provider nodes are running 26 virtual compute units against our dual proc hex-core servers (12 cores total). In testing with Riak, memory is important to the operations than CPU resources, so we provisioned the virtual instances to align with the 36GB of memory in each of the "Medium" SoftLayer servers. In the public cloud environment, the higher level of RAM was restricted to packages with higher CPU, so while the CPU counts differ, the RAM amounts are as close to even as we could make them.

One final "housekeeping" note before we dive into the results: The graphs below are pulled directly from the optional graphics package that displays Basho Bench results. You'll notice that the scale on the left-hand side of graphs differs dramatically between the two environments, so a cursory look at the results might not tell the whole story. Click any of the graphs below for a larger version. At the end of each test case, we'll share a few observations about the operations per second and latency results from each test. When we talk about latency in the "key observation" sections, we'll talk about the 99th percentile line — 99% of the results had latency below this line. More simply you could say, "This is the highest latency we saw on this platform in this test." The primary reason we're focusing on this line is because it's much easier to read on the graphs than the mean/median lines in the bottom graphs.

Riak Test 1: "Small" Bare Metal 5-Node Cluster vs Virtual 5-Node Cluster

Servers

SoftLayer Small Riak Server Node
Single 4-core Intel 1270 CPU
64-bit CentOS
8GB RAM
4 x 500GB SATAII – RAID10
1Gb Bonded Network
Virtual Provider Node
4 Virtual Compute Units
64-bit CentOS
7.5GB RAM
4 x 500GB Network Storage – RAID10
1Gb Network
 

Results

Riak Performance Analysis

Riak Performance Analysis

Key Observations

The SoftLayer environment showed much more consistency in operations per second with an average throughput around 450 Op/sec. The virtual environment throughput varied significantly between about 50 operations per second to more than 600 operations per second with the trend line fluctuating slightly between about 220 Op/sec and 350 Op/sec.

Comparing the latency of get and put requests, the 99th percentile of results in the SoftLayer environment stayed around 50ms for gets and under 200ms for puts while the same metric for the virtual environment hovered around 800ms in gets and 4000ms in puts. The scale of the graphs is drastically different, so if you aren't looking closely, you don't see how significantly the performance varies between the two.

Riak Test 2: "Medium" Bare Metal 5-Node Cluster vs Virtual 5-Node Cluster

Servers

SoftLayer Medium Riak Server Node
Dual 6-core Intel 5670 CPUs
64-bit CentOS
36GB RAM
4 x 300GB 15K SAS – RAID10
1Gb Network – Bonded
Virtual Provider Node
26 Virtual Compute Units
64-bit CentOS
30GB RAM
4 x 300GB Network Storage
1Gb Network
 

Results

Riak Performance Analysis

Riak Performance Analysis

Key Observations

Similar to the results of Test 1, the throughput numbers from the bare metal environment are more consistent (and are consistently higher) than the throughput results from the virtual instance environment. The SoftLayer environment performed between 1500 and 1750 operations per second on average while the virtual provider environment averaged around 1200 operations per second throughout the test.

The latency of get and put requests in Test 2 also paints a similar picture to Test 1. The 99th percentile of results in the SoftLayer environment stayed below 50ms and under 400ms for puts while the same metric for the virtual environment averaged about 250ms in gets and over 1000ms in puts. Latency in a big data application can be a killer, so the results from the virtual provider might be setting off alarm bells in your head.

Riak Test 3: "Medium" Bare Metal 5-Node Cluster vs Virtual 5-Node Cluster

Servers

SoftLayer Medium Riak Server Node
Dual 6-core Intel 5670 CPUs
64-bit CentOS
36GB RAM
4 x 128GB SSD – RAID10
1Gb Network – Bonded
Virtual Provider Node
26 Virtual Compute Units
64-bit CentOS
30GB RAM
4 x 300GB Network Storage
1Gb Network
 

Results

Riak Performance Analysis

Riak Performance Analysis

Key Observations

In Test 3, we're using the same specs in our virtual provider nodes, so the results for the virtual node environment are the same in Test 3 as they are in Test 2. In this Test, the SoftLayer environment substitutes SSD hard drives for the 15K SAS drives used in Test 2, and the throughput numbers show the impact of that improved I/O. The average throughput of the bare metal environment with SSDs is between 1750 and 2000 operations per second. Those numbers are slightly higher than the SoftLayer environment in Test 2, further distancing the bare metal results from the virtual provider results.

The latency of gets for the SoftLayer environment is very difficult to see in this graph because the latency was so low throughout the test. The 99th percentile of puts in the SoftLayer environment settled between 500ms and 625ms, which was a little higher than the bare metal results from Test 2 but still well below the latency from the virtual environment.

Summary

The results show that — similar to the majority of data-centric applications that we have tested — Riak has more consistent, better performing, and lower latency results when deployed onto bare metal instead of a cluster of public cloud instances. The stark differences in consistency of the results and the latency are noteworthy for developers looking to host their big data applications. We compared the 99th percentile of latency, but the mean/median results are worth checking out as well. Look at the mean and median results from the SoftLayer SSD Node environment: For gets, the mean latency was 2.5ms and the median was somewhere around 1ms. For puts, the mean was between 7.5ms and 11ms and the median was around 5ms. Those kinds of results are almost unbelievable (and that's why I've shared everything involved in completing this test so that you can try it yourself and see that there's no funny business going on).

It's commonly understood that local single-tenant resources that bare metal will always perform better than network storage resources, but by putting some concrete numbers on paper, the difference in performance is pretty amazing. Virtualizing on multi-tenant solutions with network attached storage often introduces latency issues, and performance will vary significantly depending on host load. These results may seem obvious, but sometimes the promise of quick and easy deployments on public cloud environments can lure even the sanest and most rational developer. Some applications are suited for public cloud, but big data isn't one of them. But when you have data-centric apps that require extreme I/O traffic to your storage medium, nothing can beat local high performance resources.

-Harold

July 25, 2012

ServerDensity: Tech Partner Spotlight

We invite each of our featured SoftLayer Tech Marketplace Partners to contribute a guest post to the SoftLayer Blog, and this week, we're happy to welcome David Mytton, Founder of ServerDensity. Server Density is a hosted server and website monitoring service that alerts you when your website is slow, down or back up.

5 Ways to Minimize Downtime During Summer Vacation

It's a fact of life that everything runs smoothly until you're out of contact, away from the Internet or on holiday. However, you can't be available 24/7 on the chance that something breaks; instead, there are several things you can do to ensure that when things go wrong, the problem can be managed and resolved quickly. To help you set up your own "get back up" plan, we've come up with a checklist of the top five things you can do to prepare for an ill-timed issue.

1. Monitoring

How will you know when things break? Using a tool like Server Density — which combines availability monitoring from locations around the world with internal server metrics like disk usage, Apache and MySQL — means that you can be alerted if your site goes down, and have the data to find out why.

Surprisingly, the most common problems we see are some that are the easiest to fix. One problem that happens all too often is when a customer simply runs out of disk space in a volume! If you've ever had it happen to you, you know that running out of space will break things in strange ways — whether it prevents the database from accepting writes or fails to store web sessions on disk. By doing something as simple as setting an alert to monitor used disk space for all important volumes (not just root) at around 75%, you'll have proactive visibility into your server to avoid hitting volume capacity.

Additionally, you should define triggers for unusual values that will set off a red flag for you. For example, if your Apache requests per second suddenly drop significantly, that change could indicate a problem somewhere else in your infrastructure, and if you're not monitoring those indirect triggers, you may not learn about those other problems as quickly as you'd like. Find measurable direct and indirect relationships that can give you this kind of early warning, and find a way to measure them and alert yourself when something changes.

2. Dealing with Alerts

It's no good having alerts sent to someone who isn't responding (or who can't at a given time). Using a service like Pagerduty allows you to define on-call rotations for different types of alerts. Nobody wants to be on-call every hour of every day, so differentiating and channeling alerts in an automated way could save you a lot of hassle. Another huge benefit of a platform like Pagerduty is that it also handles escalations: If the first contact in the path doesn't wake up or is out of service, someone else gets notified quickly.

3. Tracking Incidents

Whether you're the only person responsible or you have a team of engineers, you'll want to track the status of alerts/issues, particularly if they require escalation to different vendors. If an incident lasts a long time, you'll want to be able to hand it off to another person in your organization with all of the information they need. By tracking incidents with detailed notes information, you can avoid fatigue and prevent unnecessary repetition of troubleshooting steps.

We use JIRA for this because it allows you to define workflows an issue can progress along as you work on it. It also includes easy access to custom fields (e.g. specifying a vendor ticket ID) and can be assigned to different people.

4. Understanding What Happened

After you have received an alert, acknowledged it and started tracking the incident, it's time to start investigating. Often, this involves looking at logs, and if you only have one or two servers, it's relatively easy, but as soon as you add more, the process can get exponentially more difficult.

We recommend piping them all into a log search tool like (fellow Tech Partners Marketplace participant) Papertrail or Loggly. Those platforms afford you access to all of your logs from a single interface with the ability to see incoming lines in real-time or the functionality to search back to when the incident began (since you've clearly monitored and tracked all of that information in the first three steps).

5. Getting Access to Your Servers

If you're traveling internationally, access to the Internet via a free hotspot like the ones you find in Starbucks isn't always possible. It's always a great idea to order a portable 3G hotspot in advance of a trip. You can usually pick one up from the airport to get basic Internet access without paying ridiculous roaming charges. Once you have your connection, the next step is to make sure you can access your servers.

Both iPhone and Android have SSH and remote desktop apps available which allow you to quickly log into your servers to fix easy problems. Having those tools often saves a lot of time if you don't have access to your laptop, but they also introduce a security concern: If you open server logins to the world so you can login from the dynamic IPs that change when you use mobile connectivity, then it's worth considering a multi-factor authentication layer. We use Duo Security for several reasons, with one major differentiator being the modules they have available for all major server operating systems to lock down our logins even further.

You're never going to escape the reality of system administration: If your server has a problem, you need to fix it. What you can get away from is the uncertainty of not having a clearly defined process for responding to issues when they arise.

-David Mytton, ServerDensity

This guest blog series highlights companies in SoftLayer's Technology Partners Marketplace.
These Partners have built their businesses on the SoftLayer Platform, and we're excited for them to tell their stories. New Partners will be added to the Marketplace each month, so stay tuned for many more come.
October 15, 2011

Lower Latency: Neutrino Network?

SoftLayer is on the "bleeding edge" of technology, and that's right where I'm comfortable. I love being a part of something new and relevant. I also love science fiction and find that it's mixing together with reality more and more these days. Yay for me and my nerdyness! Beam me up Luke Skywalker! (I wonder how many nerds cringed at that statement!)

In a recent post from New Scientist, a test showed neutrino particles being clocked faster than the speed of light, and a dimension-hop might be the reason. Rather than go into the nerdy parts of the article that I'm sure you read before continuing to this sentence, I want to compare how SoftLayer would use this to our (and more importantly our customers') advantage: A neutrino network! We could have the fastest network in the world, and we could use the technology for faster motherboards and components too. Because that's how we roll.

BanzaiEnter science fiction. Let's say neutrinos were indeed using another dimension to travel. Like, say, the 8th dimension as referred to in "The Adventures of Buckaroo Banzai Across the 8th Dimension." This dimension also happens to be a prison used by the Lectroids of Planet 10 to store criminals. Go figure, right? Obstacles always come up, so if our neutrino network was targeted by those Lectroids, Dody Lira and the abuse team would have no problems taking them down ... After all, Lectroid's fiddling with data can be bad for business (Not to mention the possibility of Lectroid's using our network to come back to this dimension, wreak havoc, and eat all our junk food). Dody would have to upgrade some of the tools his team uses, like a Jet Car with an "Oscillation Overthruster" (which looks eerily similar to the Flux Capacitor) to travel in and out of the 8th dimension to hunt down those pesky Lectroids that won't comply.

Then, after Dody and crew wrangle the Lectroids (as I'm sure they would), we could offer the Lectroids email and Internet service. Bam! More customers on top of a supernatural network!

Coming back to reality (a bit), we have an interesting world ahead of us. Technologies we have only seen in movies and some we haven't even imagined yet are becoming reality! If they fall into the usable realm of SoftLayer, you can bet we'll be one of the first to share them with the world. But not before we get all the bugs (and Lectroids) out.

-Brad

October 11, 2011

Building a True Real-Time Multiplayer Gaming Platform

Some of the most innovative developments on the Internet are coming from online game developers looking to push the boundaries of realism and interactivity. Developing an online gaming platform that can support a wide range of applications, including private chat, avatar chats, turn-based multiplayer games, first-person shooters, and MMORPGs, is no small feat.

Our high speed, global network significantly minimizes reliability, access, latency, lag and bandwidth issues that commonly challenge online gaming. Once users begin to experience issues of latency, reliability, they are gone and likely never to return. Our cloud, dedicated, and managed hosting solutions enable game developers to rapidly test, deploy and manage rich interactive media on a secure platform.

Consider the success of one of our partners — Electrotank Inc. They’ve been able to support as many as 6,500 concurrent users on just ONE server in a realistic simulation of a first-person shooter game, and up to 330,000 concurrent users for a turn-based multiplayer game. Talk about server density.

This is just scratching the surface because we're continuing to build our global footprint to reduce latency for users around the world. This means no awkward pauses, jumping around, but rather a smooth, seamless, interactive online gaming experience. The combined efforts of SoftLayer’s infrastructure and Electrotank’s performant software have produced a high-performance networking platform that delivers a highly scalable, low latency user experience to both gamers and game developers.

Electrotank

You can read more about how Electrotank is leveraging SoftLayer’s unique network platform in today's press release or in the fantastic white paper they published with details about their load testing methodology and results.

We always like to hear our customers opinions so let us know what you think.

-@nday91

July 26, 2011

Globalization and Hosting: The World Wide Web is Flat

Christopher Columbus set sail from Palos, Spain, on August 3, 1492, with the goal of reaching the East Indies by traveling West. He fortuitously failed by stumbling across the New World and the discovery that the world was round – a globe. In The World is Flat, Thomas Friedman calls this discovery "Globalization 1.0," or an era of "countries globalizing." As transportation and technology grew and evolved in the nineteenth and twentieth centuries, "Globalization 2.0" brought an era of "companies globalizing," and around the year 2000, we moved into "Globalization 3.0":

The dynamic force in Globalization 3.0 – the force that gives it its unique character – is the newfound power for individuals to collaborate and compete globally. And the phenomenon that is enabling, empowering, and enjoining individuals and small groups to go global so easily and so seamlessly is what I call the flat-world platform.

Columbus discovered the world wasn't flat, we learned how to traverse that round world, and we keep making that world more and more accessible. He found out that the world was a lot bigger than everyone thought, and since his discovery, the smartest people on the planet have worked to make that huge world smaller and smaller.

The most traditional measure of globalization is how far "out" political, economical and technological changes extend. Look at the ARPANET network infrastructure in 1971 and a map of the Internet as it is today.

With every step Columbus took away from the Old World, he was one step closer to the New World. If you look at the growth of the Internet through that lens, you see that every additional node and connection added to the Internet brings connectivity closer to end-users who haven't had it before. Those users gain access to the rest of the Internet, and the rest of the Internet gains access to the information and innovation those users will provide.

Globalization in Hosting

As technology and high speed connectivity become more available to users around the world, the hosting industry has new markets to reach and serve. As Lance explained in a keynote session, "50% of the people in the world are not on the Internet today. They will be on the Internet in the next 5-10 years."

Understanding this global shift, SoftLayer can choose from a few different courses of action. Today, 40+% of our customers reside outside the United States of America, and we reach those customers via 2,000+ Gbps of network connectivity from transit and peering relationships with other networks around the world, and we've been successful. If the Internet is flattening the world, a USA-centric infrastructure may be limiting, though.

Before we go any further, let's take a step back and look at a map of the United States with a few important overlays:

US Latency

The three orange circles show the rough equivalents of the areas around our data centers in Seattle, Dallas and Washington, D.C., that have less than 40 milliseconds of latency directly to that facility. The blue circle on the left shows the same 40ms ring around our new San Jose facility (in blue to help avoid a little confusion). If a customer can access their host's data center directly with less than 40ms of latency, that customer will be pretty happy with their experience.

When you consider that each of the stars on the map represents a point of presence (PoP) on the SoftLayer private network, you can draw similar circles around those locations to represent the area within 40ms of the first on-ramp to our private network. While Winnipeg, Manitoba, isn't in one of our data center's 40ms rings, a user there would be covered by the Chicago PoP's coverage, and once the user is on the SoftLayer network, he or she has a direct, dedicated path to all of our data centers, and we're able to provide a stellar network experience.

If in the next 5-10 years, the half of the world that isn't on the Internet joins the Internet, we can't rely solely on our peering and transit providers to get those users to the SoftLayer network, so we will need to bring the SoftLayer network closer to them:

Global Network

This map gives you an idea of what the first steps of SoftLayer's international expansion will look like. As you've probably heard, we will have a data center location in Singapore and in Amsterdam by the end of the year, and those locations will be instrumental in helping us build our global network.

Each of the points of presence we add in Asia and Europe effectively wrap our 40ms ring around millions of users that may have previously relied on several hops on several providers to get to the SoftLayer network, and as a result, we're able to power a faster and more consistent network experience for those users. As SoftLayer grows, our goal is to maintain the quality of service our customers expect while we extend the availability of that service quality to users around the globe.

If you're not within 40ms of our network yet, don't worry ... We're globalizing, and we'll be in your neighborhood soon.

-@gkdog

August 28, 2008

The Speed of Light is Your Enemy

One of my favorite sites is highscalability.com. As someone with an engineering background, reading about the ways other people solve a variety of problems is really quite interesting.

A recent article talks about the impact of latency on web site viewers. It sounds like common sense that the slower a site is, the more viewers you lose, but what is amazing is that even a latency measured in milliseconds can cost a web site viewers.

The article focuses mainly on application specific solutions to latency, and briefly mentions how to deliver static content like images, videos, documents, etc. There are a couple ways to solve the static content delivery problem such as making your web server as efficient as you can. But that can only help so much. Physics - the speed of light - starts to be your enemy. If you are truly worried about shaving milliseconds off your content delivery time, you have to get your content closer to your viewers.

You can do this yourself by getting servers in datacenters in multiple sites in different geographic locations. This isn't the easiest solution for everyone but does have its advantages such as keeping you in absolute control of your content. The much easier option is to use a CDN (Content Delivery Network).

CDNs are getting more popular and the price is dropping rapidly. Akamai isn't the only game in town anymore and you don't have to pay dollars per GB of traffic or sign a contract with a large commit for a multi-year time frame. CDN traffic costs can be very competitive costing only a few pennies more per Gb compared with traffic costs from a shared or dedicated server. Plus, CDNs optimize their servers for delivering content quickly.

Just to throw some math into the discussion let's see how long it would take an electron to go from New York to San Francisco (4,125,910 meters / 299,792,458 meters per second = 13.7 milliseconds). 13.7 millisconds one way, now double that for the request to go there and the response to return. Now we are up to 27.4 milliseconds. And that is assuming a straight shot with no routers slowing things down. Let's look at Melbourne to London. (16,891,360 meters / 299,792,458 meters per second = 56.3 milliseconds). Now double that, throw in some router overhead and you can see that the delays are starting to be noticeable.

The moral of the story is that for most everybody, distributing static content geographically using a CDN is the right thing to do. That problem has been solved. The harder problem is how to get your application running as efficiently as possible. I'll leave that topic for another time.

-@nday91

Subscribe to latency