Posts Tagged 'Latency'

August 12, 2015

Network Performance 101: What is latency, and why does it matter?

We’ve all been there. Waiting for a web page to load can be so frustrating that we end up just closing out. You might ask yourself, “Hey, I have high-speed Internet. Why is this happening to me?” Well, there are a lot of factors outside your control that … control page loads. And whether you have an online store, run big data solutions, or have your employees set up on a network accessing files around the world, you never want to hear that your data, consumer products, information, or otherwise, is keeping you from a sale or slowing down employee productivity because of slow data transfer.

So why are some pages so much slower to load than others?
It could be that poorly written code or large images are slowing the load on the backend, but slow page loads can also be caused by network latency. This might sound elementary, but data is not just floating out there in some non-physical Internet space. In reality, data is stored on hard drives … somewhere. Network connectivity provides a path for that data to travel to end users around the world, and that connectivity can vary significantly—depending on how far it’s going, how many times the data has to hop between service providers, how much bandwidth is available along the way, the other data traveling across the same path, and a number of other variables.

The measurement of how quickly data travels between two connected points is called network latency. Network latency is an expression of the amount of time it takes a packet of data to get from one place to another.

Understanding Network Latency
Theoretically, data can travel at the speed of light across optical fiber network cables, but in practice, data typically travels slower than light due to the variables we referenced in the previous section. If a network connection doesn’t have any available bandwidth capacity, data might temporarily queue up to wait for its turn to travel across the line. If a service provider’s network doesn’t route a network path optimally, data could be sent hundreds or thousands of miles away from the destination in the process of routing to the destination. These kinds of delays and detours lead to higher network latency, which lead to slower page loads and download speeds.

We express network latency in milliseconds (that’s 1,000 milliseconds per second), and while a few thousandths of a second may not mean much to us as we’re living our daily lives, those milliseconds are often the deciding factors for whether we stay on a webpage or give up and try another site. As consumers of high-speed Internet, we like what we like, and we want what we want when we want it. In the financial sector, milliseconds can mean billions of dollars in gains or losses from trade transactions on a day-to-day basis.

Logical conclusion: Everyone wants the lowest network latency to the greatest number of users.

Common Approaches to Minimize Network Latency
If our shared goal is to minimize latency for our data, the most common approaches to addressing network latency involve limiting the number of potential variables that can impact the speed of data’s movement. While we don’t have complete control over how our data travels across the Internet, we can do a few things to keep our network latency in line:

  • Distribute data around the world: Users in different locations can pull data from a location that’s geographically close to them. Because the data is closer to the users, it is handed off fewer times, it has a shorter distance to travel, and inefficient routing is less likely to cause a significant performance impact.
  • Provision servers with high-capacity network ports: Huge volumes of data can travel to and from the server every second. If packets are delayed due to fully saturated ports, milliseconds of time pass, pages load slower, download speeds drop, and users get unhappy.
  • Understand how your providers route traffic: When you know how your data is transferred to users around the world, you can make better decisions about where you host your data.

How SoftLayer Minimizes Network Latency
To minimize latency, we took a unique approach to building our network. All of our data centers are connected to network points of presence. All of our network points of presence are connected to each other via our global backbone network. And by maintaining our own global backbone network, our network operations team is able to control network paths and data handoffs much more granularly than if we relied on other providers to move data between geographies.

SoftLayer Private Network

For example, if a user in Berlin wants to watch a cat video hosted on a SoftLayer server in Dallas, the packets of data that make up that cat video will travel across our backbone network (which is exclusively used by SoftLayer traffic) to Frankfurt, where the packets would be handed off to one of our peering or transit public network partners to get to the user in Berlin.

Without a global backbone network, the packets would be handed off to a peering or transit public network provider in Dallas, and that provider would route the packets across its network and/or hand the packets off to another provider at a network hop, and the packets would bounce their way to Germany. It’s entirely possible that the packets could get from Dallas to Berlin with the same network latency with or without the global backbone network, but without the global backbone network, there are a lot more variables.

In addition to building a global backbone network, we also segment public, private, and management traffic onto different network ports so that different types of traffic can be transferred without interfering with each other.

SoftLayer Private Network

But at the end of the day, all of that network planning and forethought doesn’t amount to a hill of beans if you can’t see the results for yourself. That’s why we put speed tests on our website so you can check out our network yourself (for more on speed tests, check out this blog post).

TL;DR: Network Latency
Your users want your data as quickly as you can get it to them. The time it takes for your data to get to them across the Internet is called network latency. The more control you (or your provider) have over your data’s network path, the more consistent (and lower) your network latency will be.

Stay tuned. Next month we will be discussing Network Performance 101: Security, where we’ll discuss all things cloud security—including answering your burning questions: Can other people see or access my data in a public cloud? Is my data more prone to hackers? And, what safeguards do SoftLayer have in place to protect data?

-JRL

March 30, 2015

The Importance of Data's Physical Location in the Cloud

If top-tier cloud providers use similar network hardware in their data centers and connect to the same transit and peering bandwidth providers, how can SoftLayer claim to provide the best network performance in the cloud computing industry?

Over the years, I've heard variations of that question asked dozens of times, and it's fairly easy to answer with impressive facts and figures. All SoftLayer data centers and network points of presence (PoPs) are connected to our unique global network backbone, which carries public, private, and management traffic to and from servers. Using our network connectivity table, some back-of-the-envelope calculations reveal that we have more than 2,500Gbps of bandwidth connectivity with some of the largest transit and peering bandwidth providers in the world (and that total doesn't even include the private peering relationships we have with other providers in various regional markets). Additionally, customers may order servers with up to 10Gbps network ports in our data centers.

For the most part, those stats explain our differentiation, but part of the bigger network performance story is still missing, and to a certain extent it has been untold—until today.

The 2,500+Gbps of bandwidth connectivity we break out in the network connectivity table only accounts for the on-ramps and off-ramps of our network. Our global network backbone is actually made up of an additional 2,600+Gbps of bandwidth connectivity ... and all of that backbone connectivity transports SoftLayer-related traffic.

This robust network architecture streamlines the access to and delivery of data on SoftLayer servers. When you access a SoftLayer server, the network is designed to bring you onto our global backbone as quickly as possible at one of our network PoPs, and when you're on our global backbone, you'll experience fewer hops (and a more direct route that we control). When one of your users requests data from your SoftLayer server, that data travels across the global backbone to the nearest network PoP, where it is handed off to another provider to carry the data the "last mile."

With this controlled environment, I decided to undertake an impromptu science experiment to demonstrate how location and physical distance affect network performance in the cloud.

Speed Testing on the SoftLayer Global Network Backbone

I work in the SoftLayer office in downtown Houston, Texas. In network-speak, this location is HOU04. You won't find that location on any data center or network tables because it's just an office, but it's connected to the same global backbone as our data centers and network points of presence. From my office, the "last mile" doesn't exist; when I access a SoftLayer server, my bits and bytes only travel across the SoftLayer network, so we're effectively cutting out a number of uncontrollable variables in the process of running network speed tests.

For better or worse, I didn't tell any network engineers that I planned to run speed tests to every available data center and share the results I found, so you're seeing exactly what I saw with no tomfoolery. I just fired up my browser, headed to our Data Centers page, and made my way down the list using the SpeedTest option for each facility. Customers often go through this process when trying to determine the latency, speeds, and network path that they can expect from servers in each data center, but if we look at the results collectively, we can learn a lot more about network performance in general.

With the results, we'll discuss how network speed tests work, what the results mean, and why some might be surprising. If you're feeling scientific and want to run the tests yourself, you're more than welcome to do so.

The Ookla SpeedTests we link to from the data centers table measured the latency (ping time), jitter (variation in latency), download speeds, and upload speeds between the user's computer and the data center's test server. To run this experiment, I connected my MacBook Pro via Ethernet to a 100Mbps wired connection. At the end of each speed test, I took a screenshot of the performance stats:

SoftLayer Network Speed Test

To save you the trouble of trying to read all of the stats on each data center as they cycle through that animated GIF, I also put them into a table (click the data center name to see its results screenshot in a new window):

Data Center Latency (ms) Download Speed (Mbps) Upload Speed (Mbps) Jitter (ms)
AMS01 121 77.69 82.18 1
DAL01 9 93.16 87.43 0
DAL05 7 93.16 83.77 0
DAL06 7 93.11 83.50 0
DAL07 8 93.08 83.60 0
DAL09 11 93.05 82.54 0
FRA02 128 78.11 85.08 0
HKG02 184 50.75 78.93 2
HOU02 2 93.12 83.45 1
LON02 114 77.41 83.74 2
MEL01 186 63.40 78.73 1
MEX01 27 92.32 83.29 1
MON01 52 89.65 85.94 3
PAR01 127 82.40 83.38 0
SJC01 44 90.43 83.60 1
SEA01 50 90.33 83.23 2
SNG01 195 40.35 72.35 1
SYD01 196 61.04 75.82 4
TOK02 135 75.63 82.20 2
TOR01 40 90.37 82.90 1
WDC01 43 89.68 84.35 0

By performing these speed tests on the SoftLayer network, we can actually learn a lot about how speed tests work and how physical location affects network performance. But before we get into that, let's take note of a few interesting results from the table above:

  • The lowest latency from my office is to the HOU02 (Houston, Texas) data center. That data center is about 14.2 miles away as the crow flies.
  • The highest latency results from my office are to the SYD01 (Sydney, Australia) and SNG01 (Singapore) data centers. Those data centers are at least 8,600 and 10,000 miles away, respectively.
  • The fastest download speed observed is 93.16Mbps, and that number was seen from two data centers: DAL01 and DAL05.
  • The slowest download speed observed is 40.35Mbps from SNG01.
  • The fastest upload speed observed is 87.43Mbps to DAL01.
  • The slowest upload speed observed is 72.35Mbps to SNG01.
  • The upload speeds observed are faster than the download speeds from every data center outside of North America.

Are you surprised that we didn't see any results closer to 100Mbps? Is our server in Singapore underperforming? Are servers outside of North America more selfish to receive data and stingy to give it back?

Those are great questions, and they actually jumpstart an explanation of how the network tests work and what they're telling us.

Maximum Download Speed on 100Mbps Connection

If my office is 2 milliseconds from the test server in HOU02, why is my download speed only 93.12Mbps? To answer this question, we need to understand that to perform these tests, a connection is made using Transmission Control Protocol (TCP) to move the data, and TCP does a lot of work in the background. The download is broken into a number of tiny chunks called packets and sent from the sender to the receiver. TCP wants to ensure that each packet that is sent is received, so the receiver sends an acknowledgement back to the sender to confirm that the packet arrived. If the sender is unable to verify that a given packet was successfully delivered to the receiver, the sender will resend the packet.

This system is pretty simple, but in actuality, it's very dynamic. TCP wants to be as efficient as possible ... to send the fewest number of packets to get the entire message across. To accomplish this, TCP is able to modify the size of each packet to optimize it for each communication. The receiver dictates how large the packet should be by providing a receive window to accommodate a small packet size, and it analyzes and adjusts the receive window to get the largest packets possible without becoming unstable. Some operating systems are better than others when it comes to tweaking and optimizing TCP transfer rates, but the processes TCP takes to ensure that the packets are sent and received without error takes overhead, and that overhead limits the maximum speed we can achieve.

Understanding the SNG01 Results

Why did my SNG01 speed test max out at a meager 40.35Mbps on my 100Mbps connection? Well, now that we understand how TCP is working behind the scenes, we can see why our download speeds from Singapore are lower than we'd expect. Latency between the sending and successful receipt of a packet plays into TCP’s considerations of a stable connection. Higher ping times will cause TCP to send smaller packet sizes than it would for lower ping times to ensure that no sizable packet is lost (which would have to be reproduced and resent).

With our global backbone optimizing the network path of the packets between Houston and Singapore, the more than 10,000-mile journey, the nature of TCP, and my computer's TCP receive window adjustments all factor into the download speeds recorded from SNG01. Looking at the results in the context of the distance the data has to travel, our results are actually well within the expected performance.

Because the default behavior of TCP is partially to blame for the results, we could actually tweak the test and tune our configurations to deliver faster speeds. To confirm that improvements can be made relatively easily, we can actually just look at the answer to our third question...

Upload > Download?

Why are the upload speeds faster than the download speeds after latency jumps from 50ms to 114ms? Every location in North America is within 2,000 miles of Houston, while the closest location outside of North America is about 5,000 miles away. With what we've learned about how TCP and physical distance play into download speeds, that jump in distance explains why the download speeds drop from 90.33Mbps to 77.41Mbps as soon as we cross an ocean, but how can the upload speeds to Europe (and even APAC) stay on par with their North American counterparts? The only difference between our download path and upload path is which side is sending and which side is receiving. And if the receiver determines the size of the TCP receive window, the most likely culprit in the discrepancy between download and upload speeds is TCP windowing.

A Linux server is built and optimized to be a server, whereas my MacOSX laptop has a lot of other responsibilities, so it shouldn't come as a surprise that the default TCP receive window handling is better on the server side. With changes to the way my laptop handles TCP, download speeds would likely be improved significantly. Additionally, if we wanted to push the envelope even further, we might consider using a different transfer protocol to take advantage of the consistent, controlled network environment.

The Importance of Physical Location in Cloud Computing

These real-world test results under controlled conditions demonstrate the significance of data's geographic proximity to its user on the user's perceived network performance. We know that the network latency in a 14-mile trip will be lower than the latency in a 10,000-mile trip, but we often don't think about the ripple effect latency has on other network performance indicators. And this experiment actually controls a lot of other variables that can exacerbate the performance impact of geographic distance. The tests were run on a 100Mbps connection because that's a pretty common maximum port speed, but if we ran the same tests on a GigE line, the difference would be even more dramatic. Proof: HOU02 @ 1Gbps v. SNG01 @ 1Gbps

Let's apply our experiment to a real-world example: Half of our site's user base is in Paris and the other half is in Singapore. If we chose to host our cloud infrastructure exclusively from Paris, our users would see dramatically different results. Users in Paris would have sub-10ms latency while users in Singapore have about 300ms of latency. Obviously, operating cloud servers in both markets would be the best way to ensure peak performance in both locations, but what if you can only afford to provision your cloud infrastructure in one location? Where would you choose to provision that infrastructure to provide a consistent user experience for your audience in both markets?

Given what we've learned, we should probably choose a location with roughly the same latency to both markets. We can use the SoftLayer Looking Glass to see that San Jose, California (SJC01) would be a logical midpoint ... At this second, the latency between SJC and PAR on the SoftLayer backbone is 149ms, and the latency between SJC and SNG is 162ms, so both would experience very similar performance (all else being equal). Our users in the two markets won't experience mind-blowing speeds, but neither will experience mind-numbing speeds either.

The network performance implications of physical distance apply to all cloud providers, but because of the SoftLayer global network backbone, we're able to control many of the variables that lead to higher (or inconsistent) latency to and from a given data center. The longer a single provider can route traffic, the more efficiently that traffic will move. You might see the same latency speeds to another provider's cloud infrastructure from a given location at a given time across the public Internet, but you certainly won't see the same consistency from all locations at all times. SoftLayer has spent millions of dollars to build, maintain, and grow our global network backbone to transport public and private network traffic, and as a result, we feel pretty good about claiming to provide the best network performance in cloud computing.

-@khazard

July 16, 2013

Riak Performance Analysis: Bare Metal v. Virtual

In December, I posted a MongoDB performance analysis that showed the quantitative benefits of using bare metal servers for MongoDB workloads. It should come as no surprise that in the wake of SoftLayer's Riak launch, we've got some similar data to share about running Riak on bare metal.

To run this test, we started by creating five-node clusters with Riak 1.3.1 on SoftLayer bare metal servers and on a popular competitor's public cloud instances. For the SoftLayer environment, we created these clusters using the Riak Solution Designer, so the nodes were all provisioned, configured and clustered for us automatically when we ordered them. For the public cloud virtual instance Riak cluster, each node was provisioned indvidually using a Riak image template and manually configured into a cluster after all had come online. To optimize for Riak performance, I made a few tweaks at the OS level of our servers (running CentOS 64-bit):

Noatime
Nodiratime
barrier=0
data=writeback
ulimit -n 65536

The common Noatime and Nodiratime settings eliminate the need for writes during reads to help performance and disk wear. The barrier and writeback settings are a little less common and may not be what you'd normally set. Although those settings present a very slight risk for loss of data on disk failure, remember that the Riak solution is deployed in five-node rings with data redundantly available across multiple nodes in the ring. With that in mind and considering each node also being deployed with a RAID10 storage array, you can see that the minor risk for data loss on the failure of a single disk in the entire solution would have no impact on the entire data set (as there are plenty of redundant copies for that data available). Given the minor risk involved, the performance increases of those two settings justify their use.

With all of the nodes tweaked and configured into clusters, we set up Basho's test harness — Basho Bench — to remotely simulate load on the deployments. Basho Bench allows you to create a configurable test plan for a Riak cluster by configuring a number of workers to utilize a driver type to generate load. It comes packaged as an Erlang application with a config file example that you can alter to create the specifics for the concurrency, data set size, and duration of your tests. The results can be viewed as CSV data, and there is an optional graphics package that allows you to generate the graphs that I am posting in this blog. A simplified graphic of our test environment would look like this:

Riak Test Environment

The following Basho Bench config is what we used for our testing:

{mode, max}.
{duration, 120}.
{concurrent, 8}.
{driver, basho_bench_driver_riakc_pb}.
{key_generator,{int_to_bin,{uniform_int,1000000}}}.
{value_generator,{exponential_bin,4098,50000}}.
{riakc_pb_ips, [{10,60,68,9},{10,40,117,89},{10,80,64,4},{10,80,64,8},{10,60,68,7}]}.
{riakc_pb_replies, 2}.
{operations, [{get, 10},{put, 1}]}.

To spell it out a little simpler:

Tests Performed

Data Set: 400GB
10:1 Query-to-Update Operations
8 Concurrent Client Connections
Test Duration: 2 Hours

You may notice that in the test cases that use SoftLayer "Medium" Servers, the virtual provider nodes are running 26 virtual compute units against our dual proc hex-core servers (12 cores total). In testing with Riak, memory is important to the operations than CPU resources, so we provisioned the virtual instances to align with the 36GB of memory in each of the "Medium" SoftLayer servers. In the public cloud environment, the higher level of RAM was restricted to packages with higher CPU, so while the CPU counts differ, the RAM amounts are as close to even as we could make them.

One final "housekeeping" note before we dive into the results: The graphs below are pulled directly from the optional graphics package that displays Basho Bench results. You'll notice that the scale on the left-hand side of graphs differs dramatically between the two environments, so a cursory look at the results might not tell the whole story. Click any of the graphs below for a larger version. At the end of each test case, we'll share a few observations about the operations per second and latency results from each test. When we talk about latency in the "key observation" sections, we'll talk about the 99th percentile line — 99% of the results had latency below this line. More simply you could say, "This is the highest latency we saw on this platform in this test." The primary reason we're focusing on this line is because it's much easier to read on the graphs than the mean/median lines in the bottom graphs.

Riak Test 1: "Small" Bare Metal 5-Node Cluster vs Virtual 5-Node Cluster

Servers

SoftLayer Small Riak Server Node
Single 4-core Intel 1270 CPU
64-bit CentOS
8GB RAM
4 x 500GB SATAII – RAID10
1Gb Bonded Network
Virtual Provider Node
4 Virtual Compute Units
64-bit CentOS
7.5GB RAM
4 x 500GB Network Storage – RAID10
1Gb Network
 

Results

Riak Performance Analysis

Riak Performance Analysis

Key Observations

The SoftLayer environment showed much more consistency in operations per second with an average throughput around 450 Op/sec. The virtual environment throughput varied significantly between about 50 operations per second to more than 600 operations per second with the trend line fluctuating slightly between about 220 Op/sec and 350 Op/sec.

Comparing the latency of get and put requests, the 99th percentile of results in the SoftLayer environment stayed around 50ms for gets and under 200ms for puts while the same metric for the virtual environment hovered around 800ms in gets and 4000ms in puts. The scale of the graphs is drastically different, so if you aren't looking closely, you don't see how significantly the performance varies between the two.

Riak Test 2: "Medium" Bare Metal 5-Node Cluster vs Virtual 5-Node Cluster

Servers

SoftLayer Medium Riak Server Node
Dual 6-core Intel 5670 CPUs
64-bit CentOS
36GB RAM
4 x 300GB 15K SAS – RAID10
1Gb Network – Bonded
Virtual Provider Node
26 Virtual Compute Units
64-bit CentOS
30GB RAM
4 x 300GB Network Storage
1Gb Network
 

Results

Riak Performance Analysis

Riak Performance Analysis

Key Observations

Similar to the results of Test 1, the throughput numbers from the bare metal environment are more consistent (and are consistently higher) than the throughput results from the virtual instance environment. The SoftLayer environment performed between 1500 and 1750 operations per second on average while the virtual provider environment averaged around 1200 operations per second throughout the test.

The latency of get and put requests in Test 2 also paints a similar picture to Test 1. The 99th percentile of results in the SoftLayer environment stayed below 50ms and under 400ms for puts while the same metric for the virtual environment averaged about 250ms in gets and over 1000ms in puts. Latency in a big data application can be a killer, so the results from the virtual provider might be setting off alarm bells in your head.

Riak Test 3: "Medium" Bare Metal 5-Node Cluster vs Virtual 5-Node Cluster

Servers

SoftLayer Medium Riak Server Node
Dual 6-core Intel 5670 CPUs
64-bit CentOS
36GB RAM
4 x 128GB SSD – RAID10
1Gb Network – Bonded
Virtual Provider Node
26 Virtual Compute Units
64-bit CentOS
30GB RAM
4 x 300GB Network Storage
1Gb Network
 

Results

Riak Performance Analysis

Riak Performance Analysis

Key Observations

In Test 3, we're using the same specs in our virtual provider nodes, so the results for the virtual node environment are the same in Test 3 as they are in Test 2. In this Test, the SoftLayer environment substitutes SSD hard drives for the 15K SAS drives used in Test 2, and the throughput numbers show the impact of that improved I/O. The average throughput of the bare metal environment with SSDs is between 1750 and 2000 operations per second. Those numbers are slightly higher than the SoftLayer environment in Test 2, further distancing the bare metal results from the virtual provider results.

The latency of gets for the SoftLayer environment is very difficult to see in this graph because the latency was so low throughout the test. The 99th percentile of puts in the SoftLayer environment settled between 500ms and 625ms, which was a little higher than the bare metal results from Test 2 but still well below the latency from the virtual environment.

Summary

The results show that — similar to the majority of data-centric applications that we have tested — Riak has more consistent, better performing, and lower latency results when deployed onto bare metal instead of a cluster of public cloud instances. The stark differences in consistency of the results and the latency are noteworthy for developers looking to host their big data applications. We compared the 99th percentile of latency, but the mean/median results are worth checking out as well. Look at the mean and median results from the SoftLayer SSD Node environment: For gets, the mean latency was 2.5ms and the median was somewhere around 1ms. For puts, the mean was between 7.5ms and 11ms and the median was around 5ms. Those kinds of results are almost unbelievable (and that's why I've shared everything involved in completing this test so that you can try it yourself and see that there's no funny business going on).

It's commonly understood that local single-tenant resources that bare metal will always perform better than network storage resources, but by putting some concrete numbers on paper, the difference in performance is pretty amazing. Virtualizing on multi-tenant solutions with network attached storage often introduces latency issues, and performance will vary significantly depending on host load. These results may seem obvious, but sometimes the promise of quick and easy deployments on public cloud environments can lure even the sanest and most rational developer. Some applications are suited for public cloud, but big data isn't one of them. But when you have data-centric apps that require extreme I/O traffic to your storage medium, nothing can beat local high performance resources.

-Harold

July 25, 2012

ServerDensity: Tech Partner Spotlight

We invite each of our featured SoftLayer Tech Marketplace Partners to contribute a guest post to the SoftLayer Blog, and this week, we're happy to welcome David Mytton, Founder of ServerDensity. Server Density is a hosted server and website monitoring service that alerts you when your website is slow, down or back up.

5 Ways to Minimize Downtime During Summer Vacation

It's a fact of life that everything runs smoothly until you're out of contact, away from the Internet or on holiday. However, you can't be available 24/7 on the chance that something breaks; instead, there are several things you can do to ensure that when things go wrong, the problem can be managed and resolved quickly. To help you set up your own "get back up" plan, we've come up with a checklist of the top five things you can do to prepare for an ill-timed issue.

1. Monitoring

How will you know when things break? Using a tool like Server Density — which combines availability monitoring from locations around the world with internal server metrics like disk usage, Apache and MySQL — means that you can be alerted if your site goes down, and have the data to find out why.

Surprisingly, the most common problems we see are some that are the easiest to fix. One problem that happens all too often is when a customer simply runs out of disk space in a volume! If you've ever had it happen to you, you know that running out of space will break things in strange ways — whether it prevents the database from accepting writes or fails to store web sessions on disk. By doing something as simple as setting an alert to monitor used disk space for all important volumes (not just root) at around 75%, you'll have proactive visibility into your server to avoid hitting volume capacity.

Additionally, you should define triggers for unusual values that will set off a red flag for you. For example, if your Apache requests per second suddenly drop significantly, that change could indicate a problem somewhere else in your infrastructure, and if you're not monitoring those indirect triggers, you may not learn about those other problems as quickly as you'd like. Find measurable direct and indirect relationships that can give you this kind of early warning, and find a way to measure them and alert yourself when something changes.

2. Dealing with Alerts

It's no good having alerts sent to someone who isn't responding (or who can't at a given time). Using a service like Pagerduty allows you to define on-call rotations for different types of alerts. Nobody wants to be on-call every hour of every day, so differentiating and channeling alerts in an automated way could save you a lot of hassle. Another huge benefit of a platform like Pagerduty is that it also handles escalations: If the first contact in the path doesn't wake up or is out of service, someone else gets notified quickly.

3. Tracking Incidents

Whether you're the only person responsible or you have a team of engineers, you'll want to track the status of alerts/issues, particularly if they require escalation to different vendors. If an incident lasts a long time, you'll want to be able to hand it off to another person in your organization with all of the information they need. By tracking incidents with detailed notes information, you can avoid fatigue and prevent unnecessary repetition of troubleshooting steps.

We use JIRA for this because it allows you to define workflows an issue can progress along as you work on it. It also includes easy access to custom fields (e.g. specifying a vendor ticket ID) and can be assigned to different people.

4. Understanding What Happened

After you have received an alert, acknowledged it and started tracking the incident, it's time to start investigating. Often, this involves looking at logs, and if you only have one or two servers, it's relatively easy, but as soon as you add more, the process can get exponentially more difficult.

We recommend piping them all into a log search tool like (fellow Tech Partners Marketplace participant) Papertrail or Loggly. Those platforms afford you access to all of your logs from a single interface with the ability to see incoming lines in real-time or the functionality to search back to when the incident began (since you've clearly monitored and tracked all of that information in the first three steps).

5. Getting Access to Your Servers

If you're traveling internationally, access to the Internet via a free hotspot like the ones you find in Starbucks isn't always possible. It's always a great idea to order a portable 3G hotspot in advance of a trip. You can usually pick one up from the airport to get basic Internet access without paying ridiculous roaming charges. Once you have your connection, the next step is to make sure you can access your servers.

Both iPhone and Android have SSH and remote desktop apps available which allow you to quickly log into your servers to fix easy problems. Having those tools often saves a lot of time if you don't have access to your laptop, but they also introduce a security concern: If you open server logins to the world so you can login from the dynamic IPs that change when you use mobile connectivity, then it's worth considering a multi-factor authentication layer. We use Duo Security for several reasons, with one major differentiator being the modules they have available for all major server operating systems to lock down our logins even further.

You're never going to escape the reality of system administration: If your server has a problem, you need to fix it. What you can get away from is the uncertainty of not having a clearly defined process for responding to issues when they arise.

-David Mytton, ServerDensity

This guest blog series highlights companies in SoftLayer's Technology Partners Marketplace.
These Partners have built their businesses on the SoftLayer Platform, and we're excited for them to tell their stories. New Partners will be added to the Marketplace each month, so stay tuned for many more come.
October 15, 2011

Lower Latency: Neutrino Network?

SoftLayer is on the "bleeding edge" of technology, and that's right where I'm comfortable. I love being a part of something new and relevant. I also love science fiction and find that it's mixing together with reality more and more these days. Yay for me and my nerdyness! Beam me up Luke Skywalker! (I wonder how many nerds cringed at that statement!)

In a recent post from New Scientist, a test showed neutrino particles being clocked faster than the speed of light, and a dimension-hop might be the reason. Rather than go into the nerdy parts of the article that I'm sure you read before continuing to this sentence, I want to compare how SoftLayer would use this to our (and more importantly our customers') advantage: A neutrino network! We could have the fastest network in the world, and we could use the technology for faster motherboards and components too. Because that's how we roll.

BanzaiEnter science fiction. Let's say neutrinos were indeed using another dimension to travel. Like, say, the 8th dimension as referred to in "The Adventures of Buckaroo Banzai Across the 8th Dimension." This dimension also happens to be a prison used by the Lectroids of Planet 10 to store criminals. Go figure, right? Obstacles always come up, so if our neutrino network was targeted by those Lectroids, Dody Lira and the abuse team would have no problems taking them down ... After all, Lectroid's fiddling with data can be bad for business (Not to mention the possibility of Lectroid's using our network to come back to this dimension, wreak havoc, and eat all our junk food). Dody would have to upgrade some of the tools his team uses, like a Jet Car with an "Oscillation Overthruster" (which looks eerily similar to the Flux Capacitor) to travel in and out of the 8th dimension to hunt down those pesky Lectroids that won't comply.

Then, after Dody and crew wrangle the Lectroids (as I'm sure they would), we could offer the Lectroids email and Internet service. Bam! More customers on top of a supernatural network!

Coming back to reality (a bit), we have an interesting world ahead of us. Technologies we have only seen in movies and some we haven't even imagined yet are becoming reality! If they fall into the usable realm of SoftLayer, you can bet we'll be one of the first to share them with the world. But not before we get all the bugs (and Lectroids) out.

-Brad

October 11, 2011

Building a True Real-Time Multiplayer Gaming Platform

Some of the most innovative developments on the Internet are coming from online game developers looking to push the boundaries of realism and interactivity. Developing an online gaming platform that can support a wide range of applications, including private chat, avatar chats, turn-based multiplayer games, first-person shooters, and MMORPGs, is no small feat.

Our high speed, global network significantly minimizes reliability, access, latency, lag and bandwidth issues that commonly challenge online gaming. Once users begin to experience issues of latency, reliability, they are gone and likely never to return. Our cloud, dedicated, and managed hosting solutions enable game developers to rapidly test, deploy and manage rich interactive media on a secure platform.

Consider the success of one of our partners — Electrotank Inc. They’ve been able to support as many as 6,500 concurrent users on just ONE server in a realistic simulation of a first-person shooter game, and up to 330,000 concurrent users for a turn-based multiplayer game. Talk about server density.

This is just scratching the surface because we're continuing to build our global footprint to reduce latency for users around the world. This means no awkward pauses, jumping around, but rather a smooth, seamless, interactive online gaming experience. The combined efforts of SoftLayer’s infrastructure and Electrotank’s performant software have produced a high-performance networking platform that delivers a highly scalable, low latency user experience to both gamers and game developers.

Electrotank

You can read more about how Electrotank is leveraging SoftLayer’s unique network platform in today's press release or in the fantastic white paper they published with details about their load testing methodology and results.

We always like to hear our customers opinions so let us know what you think.

-@nday91

July 26, 2011

Globalization and Hosting: The World Wide Web is Flat

Christopher Columbus set sail from Palos, Spain, on August 3, 1492, with the goal of reaching the East Indies by traveling West. He fortuitously failed by stumbling across the New World and the discovery that the world was round – a globe. In The World is Flat, Thomas Friedman calls this discovery "Globalization 1.0," or an era of "countries globalizing." As transportation and technology grew and evolved in the nineteenth and twentieth centuries, "Globalization 2.0" brought an era of "companies globalizing," and around the year 2000, we moved into "Globalization 3.0":

The dynamic force in Globalization 3.0 – the force that gives it its unique character – is the newfound power for individuals to collaborate and compete globally. And the phenomenon that is enabling, empowering, and enjoining individuals and small groups to go global so easily and so seamlessly is what I call the flat-world platform.

Columbus discovered the world wasn't flat, we learned how to traverse that round world, and we keep making that world more and more accessible. He found out that the world was a lot bigger than everyone thought, and since his discovery, the smartest people on the planet have worked to make that huge world smaller and smaller.

The most traditional measure of globalization is how far "out" political, economical and technological changes extend. Look at the ARPANET network infrastructure in 1971 and a map of the Internet as it is today.

With every step Columbus took away from the Old World, he was one step closer to the New World. If you look at the growth of the Internet through that lens, you see that every additional node and connection added to the Internet brings connectivity closer to end-users who haven't had it before. Those users gain access to the rest of the Internet, and the rest of the Internet gains access to the information and innovation those users will provide.

Globalization in Hosting

As technology and high speed connectivity become more available to users around the world, the hosting industry has new markets to reach and serve. As Lance explained in a keynote session, "50% of the people in the world are not on the Internet today. They will be on the Internet in the next 5-10 years."

Understanding this global shift, SoftLayer can choose from a few different courses of action. Today, 40+% of our customers reside outside the United States of America, and we reach those customers via 2,000+ Gbps of network connectivity from transit and peering relationships with other networks around the world, and we've been successful. If the Internet is flattening the world, a USA-centric infrastructure may be limiting, though.

Before we go any further, let's take a step back and look at a map of the United States with a few important overlays:

US Latency

The three orange circles show the rough equivalents of the areas around our data centers in Seattle, Dallas and Washington, D.C., that have less than 40 milliseconds of latency directly to that facility. The blue circle on the left shows the same 40ms ring around our new San Jose facility (in blue to help avoid a little confusion). If a customer can access their host's data center directly with less than 40ms of latency, that customer will be pretty happy with their experience.

When you consider that each of the stars on the map represents a point of presence (PoP) on the SoftLayer private network, you can draw similar circles around those locations to represent the area within 40ms of the first on-ramp to our private network. While Winnipeg, Manitoba, isn't in one of our data center's 40ms rings, a user there would be covered by the Chicago PoP's coverage, and once the user is on the SoftLayer network, he or she has a direct, dedicated path to all of our data centers, and we're able to provide a stellar network experience.

If in the next 5-10 years, the half of the world that isn't on the Internet joins the Internet, we can't rely solely on our peering and transit providers to get those users to the SoftLayer network, so we will need to bring the SoftLayer network closer to them:

Global Network

This map gives you an idea of what the first steps of SoftLayer's international expansion will look like. As you've probably heard, we will have a data center location in Singapore and in Amsterdam by the end of the year, and those locations will be instrumental in helping us build our global network.

Each of the points of presence we add in Asia and Europe effectively wrap our 40ms ring around millions of users that may have previously relied on several hops on several providers to get to the SoftLayer network, and as a result, we're able to power a faster and more consistent network experience for those users. As SoftLayer grows, our goal is to maintain the quality of service our customers expect while we extend the availability of that service quality to users around the globe.

If you're not within 40ms of our network yet, don't worry ... We're globalizing, and we'll be in your neighborhood soon.

-@gkdog

August 28, 2008

The Speed of Light is Your Enemy

One of my favorite sites is highscalability.com. As someone with an engineering background, reading about the ways other people solve a variety of problems is really quite interesting.

A recent article talks about the impact of latency on web site viewers. It sounds like common sense that the slower a site is, the more viewers you lose, but what is amazing is that even a latency measured in milliseconds can cost a web site viewers.

The article focuses mainly on application specific solutions to latency, and briefly mentions how to deliver static content like images, videos, documents, etc. There are a couple ways to solve the static content delivery problem such as making your web server as efficient as you can. But that can only help so much. Physics - the speed of light - starts to be your enemy. If you are truly worried about shaving milliseconds off your content delivery time, you have to get your content closer to your viewers.

You can do this yourself by getting servers in datacenters in multiple sites in different geographic locations. This isn't the easiest solution for everyone but does have its advantages such as keeping you in absolute control of your content. The much easier option is to use a CDN (Content Delivery Network).

CDNs are getting more popular and the price is dropping rapidly. Akamai isn't the only game in town anymore and you don't have to pay dollars per GB of traffic or sign a contract with a large commit for a multi-year time frame. CDN traffic costs can be very competitive costing only a few pennies more per Gb compared with traffic costs from a shared or dedicated server. Plus, CDNs optimize their servers for delivering content quickly.

Just to throw some math into the discussion let's see how long it would take an electron to go from New York to San Francisco (4,125,910 meters / 299,792,458 meters per second = 13.7 milliseconds). 13.7 millisconds one way, now double that for the request to go there and the response to return. Now we are up to 27.4 milliseconds. And that is assuming a straight shot with no routers slowing things down. Let's look at Melbourne to London. (16,891,360 meters / 299,792,458 meters per second = 56.3 milliseconds). Now double that, throw in some router overhead and you can see that the delays are starting to be noticeable.

The moral of the story is that for most everybody, distributing static content geographically using a CDN is the right thing to do. That problem has been solved. The harder problem is how to get your application running as efficiently as possible. I'll leave that topic for another time.

-@nday91

Subscribe to latency