Infrastructure Posts

October 8, 2014

An Insider’s Look at Our Data Centers

I’ve been with Softlayer over four years now. It’s been a journey that has taken me around the world—from Dallas to Singapore to Washington D.C, and back again. Along the way, I’ve met amazingly brilliant people who have helped me sharpen the tools in my ‘data center toolbox’ thus allowing me to enhance the customer experience by aiding and assisting in a complex compute environment.

I like to think of our data centers as masterpieces of elegant design. We currently have 14 of these works of art, with many more on the way. Here’s an insider’s look at the design:

Keeping It Cool
Our POD layouts have a raised floor system. The air conditioning units chill from the front bottom of the servers on the ‘cold rows’ passing through the servers on the ‘warm rows.’ The warm rows have ceiling vents to rapidly clear the warm air from the backs of the servers.

Jackets are recommended for this arctic environment.

Pumping up the POWER
Nothing is as important to us as keeping the lights on. Every data center has a three-tiered approach to keeping your servers and services on. Our first tier being street power. Each rack has two power strips to distribute the load and offer true redundancy for redundant servers and switches with the remote ability to power down an individual port on either power strip.

The second tier is our batter backup for each POD. This offers emergency response for seamless failover when street power is no more.

This leads to the third step in our model, generators. We have generators in place for a sustainable continuity of power until street power has returned. Check out the 2-megawatt diesel generator installation at the DAL05 data center here.

The Ultimate Social Network
Neither power nor cooling matter if you can’t connect to your server, which is where our proprietary networking topography comes to play. Each bare metal server and each virtual server resides in a rack that connects to three switches. Each of those switches connects to an aggregate switch for a row. The aggregate switch connects to a router.

The first switch, our private backend network, allows for SSL and VPN connectivity to manage your server. It also gives you the ability to have server-to-server communication without the bounds of bandwidth overages.

The second switch, our public network, provides pubic Internet access to your device, which is perfect for shopping, gaming, coding, or whatever you want to use it for. With 20TB of bandwidth coming standard for this network, the possibilities are endless.

The third and final switch, management, allows you to connect to the Intelligent Platform Management Interface that provides tools such as KVM/hardware monitoring/and even virtual CDs to install an image of your choosing! The cables to your devices from the switches are color-coded, port-number-to-rack-unit labeled, and masterfully arranged to maximize identification and airflow.

A Soft Place for Hardware
The heart and soul of our business is the computing hardware. We use enterprise grade hardware from the ground up. We offer our smallest offering of 1 core, 1GB RAM, 25GB HDD virtual servers, to one of our largest quad 10-core, 512GB RAM, multi 4TB HDD bare metal servers. With excellent hardware comes excellent options. There is almost always a path to improvement. Meaning, unless you already have the top of the line, you can always add more. Whether it be additional drive, RAM, or even processor.

I hope you enjoyed the view from the inside. If you want to see the data centers up close and personal, I am sorry to say, those are closed to the public. But you can take a virtual tour of some of our data centers via YouTube: AMS01 and DAL05

-Joshua Fox

October 1, 2014

Virtual Server Update

Good morning, afternoon, evening, or night, SoftLayer nation.

We want to give you an update and some more information on maintenance taking place right now with SoftLayer public and private node virtual servers.

As the world is becoming aware today, over the past week a security risk associated with Xen was identified by the Xen community and published as Xen Security Advisory 108 (XSA-108).

And as many are aware, Xen plays a role in our delivery of SoftLayer virtual servers.

Eliminating the vulnerability requires updating software on host nodes, and that requires downtime for the virtual servers running on those nodes.

Yeah, that’s not something anyone likes to hear. But customer security is of the utmost importance to us, so not doing it was not an option.

As soon as the risk was identified, our systems engineers and technology partners have been working nonstop to prepare the update.

On Sunday we notified every customer account that would be affected that we would have emergency maintenance in the middle of this week, and updated that notice each day.

And then yesterday we published that the maintenance would begin today at 3pm UTC, with a preliminary order of how the maintenance would roll out across all of our data centers.

We are updating host nodes data center by data center to complete the emergency maintenance as quickly as possible. This approach will minimize disruption for customers with failover infrastructure in multiple data centers.

The maintenance is under way and SoftLayer customers can follow it, live, on our forum at http://sftlyr.com/xs101.

-@SoftLayer

August 26, 2014

Bare Metal Power. By the Hour.

Think quickly. You hear that your new app will be featured on the front page of TechCrunch in less than two hours. Because it’s a resource-intensive application you know that a flood of new users will bog down its current cloud infrastructure and you’ll need to scale out.

What do you do? Choose virtual servers to guarantee quick deployment and more flexibility? Opt for bare metal servers to deliver the best user experience (while crossing your fingers that the servers are online in time for the flood of traffic)? In times like these, you shouldn’t have to choose between flexibility and power.

You need hourly bare metal servers.

We’ve streamlined the deployment of four of our most popular bare metal configurations, and with that speed, we’re able to offer them with hourly billing! With the hardware pre-configured, you tell us where you want the server to be provisioned—Dallas, San Jose, Washington D.C., London, Toronto, Amsterdam, Singapore, and Hong Kong—and which operating system you’d like us to install— CentOS, Red Hat, FreeBSD, or Ubuntu. And in less than 30 minutes, your server will be online, fully integrated with your other SoftLayer servers and services, and ready for you.

Use the server for as long as you need it. Spin it down when you’re done. Pay for the hours you had it on your account. It’s that easy. No virtualization. No noisy neighbors. Just your computing-intensive workload, the hardware configuration you need, and a phobia-proof commitment.

Why you need hourly bare metal servers in your cloud life?

  • Processing Power: You have short-term workloads that require significant amounts of processing power. To get the same performance from virtual servers, you might have to provision twice as many nodes or run them for twice as long.
    • Example: a business intelligence ELT (Extract/Load/Transform) application.
  • Schedule-based Workloads: You have a number of applications that require compute and storage resources on a set schedule (i.e., once every month), and you don’t want to deploy (and pay for) high-end machines that will sit idle at all other times.
    • Example: payroll processing or claims payment processing.
  • Performance Testing: Certify or validate how an application performs on a specific hardware configuration.
    • Example: Software or mobile application companies can validate performance on specific hardware platforms.

With bare metal performance available on demand and on hourly terms, you don’t have to compromise performance for flexibility. When TechCrunch comes calling, you have peace of mind that your app’s success and popularity won’t bring it down.

-RJ

June 9, 2014

Visualizing a SoftLayer Billing Order

In my time spent as a data and object modeler, I’ve dealt with both good and bad examples of model visualization. As an IBMer through the Rational acquisition, I have been using modeling tools for a long time. I can appreciate a nice diagram shining a ray of light on an object structure, and abhor a behemoth spaghetti diagram.

When I started studying SoftLayer’s API documentation, I saw both the relational and hierarchical nature of SoftLayer’s concept model. The naming convention of API services and data types embodies their hierarchical structure. While reading about “relational properties” in data types, I thought it would be helpful to see diagrams showing relationships between services and data types versus clicking through reference pages. After all, diagramming data models is a valuable complement to verbal descriptions.

One way people can deal with complex data models is to digest them a little at a time. I can’t imagine a complete data model diagram of SoftLayer’s cloud offering, but I can try to visualize small portions of it. In this spirit, after reviewing article and blog entries on creating product orders using SoftLayer’s API, I drew an E-R diagram, using IBM Rational Software Architect, of basic order elements.

The diagram, Figure 1, should help people understand data entities involved in creating SoftLayer product orders and the relationships among the entities. In particular, IBM Business Partners implementing custom re-branded portals to support the ordering of SoftLayer resources will benefit from visualization of the data model. Picture this!

Figure 1. Diagram of the SoftLayer Billing Order

A user account can have many associated billing orders, which are composed of billing order items. Billing order items can contain multiple order containers that hold a product package. Each package can have several configurations including product item categories. They can be composed of product items with each item having several possible prices.

-Andrew

Andrew Hoppe, Ph.D., is a Worldwide Channel Solutions Architect for SoftLayer, an IBM Company.

March 7, 2014

Why the Cloud Scares Traditional IT

My background is "traditional IT." I've been architecting and promoting enterprise virtualization solutions since 2002, and over the past few years, public and hybrid cloud solutions have become a serious topic of discussion ... and in many cases, contention. The customers who gasped with excitement when VMware rolled out a new feature for their on-premises virtualized environments would dismiss any recommendations of taking a public cloud or a hybrid cloud approach. Off-premises cloud environments were surrounded by marketing hype, and the IT departments considering them had legitimate concerns, especially around security and compliance.

I completely understood their concerns, and until recently, I often agreed with them. The cloud model is intimidating. If you've had control over every aspect of your IT environment for a few decades, you don't want to give up access to your infrastructure, much less have to trust another company to protect your business-critical information. But now, I think about those concerns as the start of a conversation about cloud, rather than a "no-go" zone. The cloud is different, but a company's approach to it should still be the same.

What do I mean by that? Enterprise developers and engineers still have to serve as architects to determine the functional and operational requirements for their services. In that process, they need to determine the suitability of a given platform for the computing workload and the company's business objectives and core competencies. Unfortunately, many of the IT decision-makers don't consider the bigger business context, and they choose to build their own "public" IaaS offerings to accommodate internal workloads, and in many cases, their own external clients.

This approach might makes sense for service providers, integrators and telcos because infrastructure resources are core components of their businesses, but I've seen the same thing happen at financial institutions, rental companies, and even an airline. Over time, internal IT departments carved out infrastructure-services revenue streams that are totally unrelated to the company's core business. The success of enterprise virtualization often empowered IT departments through cost savings and automation — making the promise of delivering public cloud “in-house” a natural extension and seemingly attractive proposition. Reshaping their perspectives around information security and compliance in that way is often a functional approach, but is it money well spent?

Instead of spending hundreds of thousands or millions of dollars in capital to build out (often commoditized) infrastructure, these businesses could be investing those resources in developing and marketing their core business areas. To give you an example of how a traditional IT task is performed in the cloud, I can share my experience from when I first accessed my SoftLayer account: I deployed a physical ESX host alongside a virtual compute instance, fully pre-configured with OS and vCenter, and I connected it via VPN to my existing (on-prem) vCenter environment. In the old model, that process would have probably taken a couple of days to complete, and I got it done in 3 hours.

Now more than ever, it is the responsibility of the core business line to validate internal IT strategies and evaluate alternatives. Public cloud is not always the right answer for all workloads, but driven by the rapidly evolving maturity and proliferation of IaaS, PaaS and SaaS offerings, most organizations will see significant benefits from it. Ultimately, the best way to understand the potential value is just to give it a try.

-Andy

Andreas Groth is an IBM worldwide channel solutions architect, focusing primarily on SoftLayer. Follow him on Twitter: @andreasgroth

February 6, 2014

Building a Bridge to the OpenStack API

OpenStack is experiencing explosive growth in the cloud market. With more than 200 companies contributing code to the source and new installations coming online every day, OpenStack is pushing hard to become a global standard for cloud computing. Dozens of useful tools and software products have been developed using the OpenStack API, so a growing community of administrators, developers and IT organizations have access to easy-to-use, powerful cloud resources. This kind of OpenStack integration is great for users on a full OpenStack cloud, but it introduces a challenge to providers and users on other cloud platforms: Should we consider deploying or moving to an OpenStack environment to take advantage of these tools?

If a cloud provider spends years developing a unique platform with a proprietary API, implementing native support for the OpenStack API or deploying a full OpenStack solution may be cost prohibitive, even with significant customer and market demand. The provider can either bite the bullet to implement OpenStack compatibility, hope that a third party library like libclouds or fog is updated to support its API, or choose to go it alone and develop an ecosystem of products around its own API.

Introducing Jumpgate

When we were faced with this situation at SoftLayer, we chose a fourth option. We wanted to make the process of creating an OpenStack-compatible API simpler and more modular. That's where Jumpgate was born. Jumpgate is a middleware that acts as a compatibility layer between the OpenStack API and a provider's proprietary API. Externally, it exposes endpoints that adhere to OpenStack's published and accepted API specification, which it then translates into the provider's API using a series of drivers. Think of it as a mechanism to enable passing from one realm/space into another — like the jumpgates featured in science fiction works.

Connection

How Jumpgate Works
Let's take a look at a high-level example: When you want to create a new virtual instance on OpenStack, you might use the Horizon dashboard or the Nova command line client. When you issue the request, the tool first makes a REST call to a Keystone endpoint for authentication, which returns an authorization token. The client then makes another REST call to a Nova endpoint, which manages the computing instances, to create the actual virtual instance. Nova may then make calls to other tools within the cluster for networking (Quantum), image information (Glance), block storage (Cinder), or more. In addition, your client may also send requests directly to some of these endpoints to query for status updates, information about available resources, and so on.

With Jumpgate, your tool first hits the Jumpgate middleware, which exposes a Keystone endpoint. Jumpgate takes the request, breaks it apart into its relevant pieces, then loads up your provider's appropriate API driver. Next, Jumpgate reformats your request into a form that the driver supports and sends it to the provider's API endpoint. Once the response comes back, Jumpgate again uses the driver to break apart the proprietary API response, reformats it into an OpenStack compatible JSON payload, and sends it back to your client. The result is that you interact with an OpenStack-compatible API, and your cloud provider processes those interactions on their own backend infrastructure.

Internally, Jumpgate is a lightweight middleware built in Python using the Falcon Framework. It provides endpoints for nearly every documented OpenStack API call and allows drivers to attach handlers to these endpoints. This modular approach allows providers to implement only the endpoints that are of the highest importance, rolling out OpenStack API compatibility in stages rather than in one monumental effort. Since it sits alongside the provider's existing API, Jumpgate provides a new API interface without risking the stability already provided by the existing API. It's a value-add service that increases customer satisfaction without a huge increase in cost. Once full implementations is finished, a provider with a proprietary cloud platform can benefit from and offer all the tools that are developed to work with the OpenStack API.

Jumpgate allows providers to test the proper OpenStack compatibility of their drivers by leveraging the OpenStack Tempest test suite. With these tests, developers run the full suite of calls used by OpenStack itself, highlighting edge cases or gaps in functionality. We've even included a helper script that allows Tempest to only run a subset of tests rather than the entire suite to assist with a staged rollout.

Current Development
Jumpgate is currently in an early alpha stage. We've built the compatibility framework itself and started on the SoftLayer drivers as a reference. So far, we've implemented key endpoints within Nova (computing instances), Keystone (identification and authorization), and Glance (image management) to get most of the basic functionality within Horizon (the web dashboard) working. We've heard that several groups outside SoftLayer are successfully using Jumpgate to drive products like Trove and Heat directly on SoftLayer, which is exciting and shows that we're well beyond the "proof of concept" stage. That being said, there's still a lot of work to be done.

We chose to develop Jumpgate in the open with a tool set that would be familiar to developers working with OpenStack. We're excited to debut this project for the broader OpenStack community, and we're accepting pull requests if you're interested in contributing. Making more clouds compatible with the OpenStack API is important and shouldn’t be an individual undertaking. If you're interested in learning more or contributing, head over to our in-flight project page on GitHub: SoftLayer Jumpgate. There, you'll find everything you need to get started along with the updates to our repository. We encourage everyone to contribute code or drivers ... or even just open issues with feature requests. The more community involvement we get, the better.

-Nathan

Categories: 
January 31, 2014

Simplified OpenStack Deployment on SoftLayer

"What is SoftLayer doing with OpenStack?" I can't even begin to count the number of times I've been asked that question over the last few years. In response, I'll usually explain how we've built our object storage platform on top of OpenStack Swift, or I'll give a few examples of how our customers have used SoftLayer infrastructure to build and scale their own OpenStack environments. Our virtual and bare metal cloud servers provide a powerful and flexible foundation for any OpenStack deployment, and our unique three-tiered network integrates perfectly with OpenStack's Compute and Network node architecture, so it's high time we make it easier to build an OpenStack environment on SoftLayer infrastructure.

To streamline and simplify OpenStack deployment for the open source community, we've published Opscode Chef recipes for both OpenStack Grizzly and OpenStack Havana on GitHub: SoftLayer Chef-Openstack. With Chef and SoftLayer, your own OpenStack cloud is a cookbook away. These recipes were designed with the needs of growth and scalability in mind. Let's take a deeper look into what exactly that means.

OpenStack has adopted a three-node design whereby a controller, compute, and network node make up its architecture:

OpenStack Architecture on SoftLayer

Looking more closely at any one node reveal the services it provides. Scaling the infrastructure beyond a few dozen nodes, using this model, could create bottlenecks in services such as your block store, OpenStack Cinder, and image store, OpenStack Glance, since they are traditionally located on the controller node. Infrastructure requirements change from service to service as well. For example OpenStack Neutron, the networking service, does not need much disk I/O while the Cinder storage service might heavily rely on a node's hard disk. Our cookbook allows you to choose how and where to deploy the services, and it even lets you break apart the MySQL backend to further improve platform performance.

Quick Start: Local Demo Environment

To make it easy to get started, we've created a rapid prototype and sandbox script for use with Vagrant and Virtual Box. With Vagrant, you can easily spin up a demo environment of Chef Server and OpenStack in about 15 minutes on moderately good laptops or desktops. Check it out here. This demo environment is an all-in-one installation of our Chef OpenStack deployment. It also installs a basic Chef server as a sandbox to help you see how the SoftLayer recipes were deployed.

Creating a Custom OpenStack Deployment

The thee-node OpenStack model does well in small scale and meets the needs of many consumers; however, control and customizability are the tenants for the design of the SoftLayer OpenStack Chef cookbook. In our model, you have full control over the configuration and location of eleven different components in your deployed environment:

Our Chef recipes will take care of populating the configuration files with the necessary information so you won't have to. When deploying, you merely add the role for the matching service to a hardware or virtual server node, and Chef will deploy the service to it with all the configuration done automatically, including adding multiple Neutron, Nova, and Cinder nodes. This approach allows you to tailor the needs of each service to the hardware it will be deployed to--you might put your Neutron hardware node on a server with 10-gigabit network interfaces and configure your Cinder hardware node with RAID 1+0 15k SAS drives.

OpenStack is a fast growing project for the implementation of IaaS in public and private clouds, but its deployment and configuration can be overwhelming. We created this cookbook to make the process of deploying a full OpenStack environment on SoftLayer quick and straightforward. With the simple configuration of eleven Chef roles, your OpenStack cloud can be deployed onto as little as one node and scaled up to many as hundreds (or thousands).

To follow this project, visit SoftLayer on GitHub. Check out some of our other projects on GitHub, and let us know if you need any help or want to contribute.

-@marcalanjones

August 22, 2013

Network Cabling Controversy: Zip Ties v. Hook & Loop Ties

More than 210,000 users have watched a YouTube video of our data center operations team cabling a row of server racks in San Jose. More than 95 percent of the ratings left on the video are positive, and more than 160 comments have been posted in response. To some, those numbers probably seem unbelievable, but to anyone who has ever cabled a data center rack or dealt with a poorly cabled data center rack, the time-lapse video is enthralling, and it seems to have catalyzed a healthy debate: At least a dozen comments on the video question/criticize how we organize and secure the cables on each of our server racks. It's high time we addressed this "zip ties v. hook & loop (Velcro®)" cable bundling controversy.

The most widely recognized standards for network cabling have been published by the Telecommunications Industry Association and Electronics Industries Alliance (TIA/EIA). Unfortunately, those standards don't specify the physical method to secure cables, but it's generally understood that if you tie cables too tight, the cable's geometry will be affected, possibly deforming the copper, modifying the twisted pairs or otherwise physically causing performance degradation. This understanding begs the question of whether zip ties are inherently inferior to hook & loop ties for network cabling applications.

As you might have observed in the "Cabling a Data Center Rack" video, SoftLayer uses nylon zip ties when we bundle and secure the network cables on our data center server racks. The decision to use zip ties rather than hook & loop ties was made during SoftLayer's infancy. Our team had a vision for an automated data center that wouldn't require much server/cable movement after a rack is installed, and zip ties were much stronger and more "permanent" than hook & loop ties. Zip ties allow us to tighten our cable bundles easily so those bundles are more structurally solid (and prettier). In short, zip ties were better for SoftLayer data centers than hook & loop ties.

That conclusion is contrary to the prevailing opinion in the world of networking that zip ties are evil and that hook & loop ties are among only a few acceptable materials for "good" network cabling. We hear audible gasps from some network engineers when they see those little strips of nylon bundling our Ethernet cables. We know exactly what they're thinking: Zip ties negatively impact network performance because they're easily over-tightened, and cables in zip-tied bundles are more difficult to replace. After they pick their jaws up off the floor, we debunk those myths.

The first myth (that zip ties can negatively impact network performance) is entirely valid, but its significance is much greater in theory than it is in practice. While I couldn't track down any scientific experiments that demonstrate the maximum tension a cable tie can exert on a bundle of cables before the traffic through those cables is affected, I have a good amount of empirical evidence to fall back on from SoftLayer data centers. Since 2006, SoftLayer has installed more than 400,000 patch cables in data centers around the world (using zip ties), and we've *never* encountered a fault in a network cable that was the result of a zip tie being over-tightened ... And we're not shy about tightening those ties.

The fact that nylon zip ties are cheaper than most (all?) of the other more "acceptable" options is a fringe benefit. By securing our cable bundles tightly, we keep our server racks clean and uniform:

SoftLayer Cabling

The second myth (that cables in zip-tied bundles are more difficult to replace) is also somewhat flawed when it comes to SoftLayer's use case. Every rack is pre-wired to deliver five Ethernet cables — two public, two private and one out-of-band management — to each "rack U," which provides enough connections to support a full rack of 1U servers. If larger servers are installed in a rack, we won't need all of the network cables wired to the rack, but if those servers are ever replaced with smaller servers, we don't have to re-run network cabling. Network cables aren't exposed to the tension, pressure or environmental changes of being moved around (even when servers are moved), so external forces don't cause much wear. The most common physical "failures" of network cables are typically associated with RJ45 jack crimp issues, and those RJ45 ends are easily replaced.

Let's say a cable does need to be replaced, though. Servers in SoftLayer data centers have redundant public and private network connections, but in this theoretical example, we'll assume network traffic can only travel over one network connection and a data center technician has to physically replace the cable connecting the server to the network switch. With all of those zip ties around those cable bundles, how long do you think it would take to bring that connection back online? (Hint: That's kind of a trick question.) See for yourself:

The answer in practice is "less than one minute" ... The "trick" in that trick question is that the zip ties around the cable bundles are irrelevant when it comes to physically replacing a network connection. Data center technicians use temporary cables to make a direct server-to-switch connection, and they schedule an appropriate time to perform a permanent replacement (which actually involves removing and replacing zip ties). In the video above, we show a temporary cable being installed in about 45 seconds, and we also demonstrate the process of creating, installing and bundling a permanent network cable replacement. Even with all of those villainous zip ties, everything is done in less than 18 minutes.

Many of the comments on YouTube bemoan the idea of having to replace a single cable in one of these zip-tied bundles, but as you can see, the process isn't very laborious, and it doesn't vary significantly from the amount of time it would take to perform the same maintenance with a Velcro®-secured cable bundle.

Zip ties are inferior to hook & loop ties for network cabling? Myth(s): Busted.

-@khazard

P.S. Shout-out to Elijah Fleites at DAL05 for expertly replacing the network cable on an internal server for the purposes of this video!

July 24, 2013

Deconstructing SoftLayer's Three-Tiered Network

When Sun Microsystems VP John Gage coined the phrase, "The network is the computer," the idea was more wishful thinking than it was profound. At the time, personal computers were just starting to show up in homes around the country, and most users were getting used to the notion that "The computer is the computer." In the '80s, the only people talking about networks were the ones selling network-related gear, and the idea of "the network" was a little nebulous and vaguely understood. Fast-forward a few decades, and Gage's assertion has proven to be prophetic ... and it happens to explain one of SoftLayer's biggest differentiators.

SoftLayer's hosting platform features an innovative, three-tier network architecture: Every server in a SoftLayer data center is physically connected to public, private and out-of-band management networks. This "network within a network" topology provides customers the ability to build out and manage their own global infrastructure without overly complex configurations or significant costs, but the benefits of this setup are often overlooked. To best understand why this network architecture is such a game-changer, let's examine each of the network layers individually.

SoftLayer Private Network

Public Network

When someone visits your website, they are accessing content from your server over the public network. This network connection is standard issue from every hosting provider since your content needs to be accessed by your users. When SoftLayer was founded in 2005, we were the first hosting provider to provide multiple network connections by default. At the time, some of our competitors offered one-off private network connections between servers in a rack or a single data center phase, but those competitors built their legacy infrastructures with an all-purpose public network connection. SoftLayer offers public network connection speeds up to 10Gbps, and every bare metal server you order from us includes free inbound bandwidth and 5TB of outbound bandwidth on the public network.

Private Network

When you want to move data from one server to another in any of SoftLayer's data centers, you can do so quickly and easily over the private network. Bandwidth between servers on the private network is unmetered and free, so you don't incur any costs when you transfer files from one server to another. Having a dedicated private network allows you to move content between servers and facilities without fighting against or getting in the way of the users accessing your server over the public network.

It should come as no surprise to learn that all private network traffic stays on SoftLayer's network exclusively when it travels between our facilities. The blue lines in this image show how the private network connects all of our data centers and points of presence:

SoftLayer Private Network

To fully replicate the functionality provided by the SoftLayer private network, competitors with legacy single-network architecture would have to essentially double their networking gear installation and establish safeguards to guarantee that customers can only access information from their own servers via the private network. Because that process is pretty daunting (and expensive), many of our competitors have opted for "virtual" segmentation that logically links servers to each other. The traffic between servers in those "virtual" private networks still travels over the public network, so they usually charge you for "private network" bandwidth at the public bandwidth rate.

Out-of-Band Management Network

When it comes to managing your server, you want an unencumbered network connection that will give you direct, secure access when you need it. Splitting out the public and private networks into distinct physical layers provides significant flexibility when it comes to delivering content where it needs to go, but we saw a need for one more unique network layer. If your server is targeted for a denial of service attack or a particular ISP fails to route traffic to your server correctly, you're effectively locked out of your server if you don't have another way to access it. Our management-specific network layer uses bandwidth providers that aren't included in our public/private bandwidth mix, so you're taking a different route to your server, and you're accessing the server through a dedicated port.

If you've seen pictures or video from a SoftLayer data center (or if you've competed in the Server Challenge), you probably noticed the three different colors of Ethernet cables connected at the back of every server rack, and each of those colors carries one of these types of network traffic exclusively. The pink/red cables carry public network traffic, the blue cables carry private network traffic, and the green cables carry out-of-band management network traffic. All thirteen of our data centers have the same colored cables in the same configuration doing the same jobs, so we're able to train our operations staff consistently between all thirteen of our data centers. That consistency enables us to provide quicker service when you need it, and it lessens the chance of human error on the data center floor.

The most powerful server on the market can be sidelined by a poorly designed, inefficient network. If "the network is the computer," the network should be a primary concern when you select your next hosting provider.

-@khazard

July 16, 2013

Riak Performance Analysis: Bare Metal v. Virtual

In December, I posted a MongoDB performance analysis that showed the quantitative benefits of using bare metal servers for MongoDB workloads. It should come as no surprise that in the wake of SoftLayer's Riak launch, we've got some similar data to share about running Riak on bare metal.

To run this test, we started by creating five-node clusters with Riak 1.3.1 on SoftLayer bare metal servers and on a popular competitor's public cloud instances. For the SoftLayer environment, we created these clusters using the Riak Solution Designer, so the nodes were all provisioned, configured and clustered for us automatically when we ordered them. For the public cloud virtual instance Riak cluster, each node was provisioned indvidually using a Riak image template and manually configured into a cluster after all had come online. To optimize for Riak performance, I made a few tweaks at the OS level of our servers (running CentOS 64-bit):

Noatime
Nodiratime
barrier=0
data=writeback
ulimit -n 65536

The common Noatime and Nodiratime settings eliminate the need for writes during reads to help performance and disk wear. The barrier and writeback settings are a little less common and may not be what you'd normally set. Although those settings present a very slight risk for loss of data on disk failure, remember that the Riak solution is deployed in five-node rings with data redundantly available across multiple nodes in the ring. With that in mind and considering each node also being deployed with a RAID10 storage array, you can see that the minor risk for data loss on the failure of a single disk in the entire solution would have no impact on the entire data set (as there are plenty of redundant copies for that data available). Given the minor risk involved, the performance increases of those two settings justify their use.

With all of the nodes tweaked and configured into clusters, we set up Basho's test harness — Basho Bench — to remotely simulate load on the deployments. Basho Bench allows you to create a configurable test plan for a Riak cluster by configuring a number of workers to utilize a driver type to generate load. It comes packaged as an Erlang application with a config file example that you can alter to create the specifics for the concurrency, data set size, and duration of your tests. The results can be viewed as CSV data, and there is an optional graphics package that allows you to generate the graphs that I am posting in this blog. A simplified graphic of our test environment would look like this:

Riak Test Environment

The following Basho Bench config is what we used for our testing:

{mode, max}.
{duration, 120}.
{concurrent, 8}.
{driver, basho_bench_driver_riakc_pb}.
{key_generator,{int_to_bin,{uniform_int,1000000}}}.
{value_generator,{exponential_bin,4098,50000}}.
{riakc_pb_ips, [{10,60,68,9},{10,40,117,89},{10,80,64,4},{10,80,64,8},{10,60,68,7}]}.
{riakc_pb_replies, 2}.
{operations, [{get, 10},{put, 1}]}.

To spell it out a little simpler:

Tests Performed

Data Set: 400GB
10:1 Query-to-Update Operations
8 Concurrent Client Connections
Test Duration: 2 Hours

You may notice that in the test cases that use SoftLayer "Medium" Servers, the virtual provider nodes are running 26 virtual compute units against our dual proc hex-core servers (12 cores total). In testing with Riak, memory is important to the operations than CPU resources, so we provisioned the virtual instances to align with the 36GB of memory in each of the "Medium" SoftLayer servers. In the public cloud environment, the higher level of RAM was restricted to packages with higher CPU, so while the CPU counts differ, the RAM amounts are as close to even as we could make them.

One final "housekeeping" note before we dive into the results: The graphs below are pulled directly from the optional graphics package that displays Basho Bench results. You'll notice that the scale on the left-hand side of graphs differs dramatically between the two environments, so a cursory look at the results might not tell the whole story. Click any of the graphs below for a larger version. At the end of each test case, we'll share a few observations about the operations per second and latency results from each test. When we talk about latency in the "key observation" sections, we'll talk about the 99th percentile line — 99% of the results had latency below this line. More simply you could say, "This is the highest latency we saw on this platform in this test." The primary reason we're focusing on this line is because it's much easier to read on the graphs than the mean/median lines in the bottom graphs.

Riak Test 1: "Small" Bare Metal 5-Node Cluster vs Virtual 5-Node Cluster

Servers

SoftLayer Small Riak Server Node
Single 4-core Intel 1270 CPU
64-bit CentOS
8GB RAM
4 x 500GB SATAII – RAID10
1Gb Bonded Network
Virtual Provider Node
4 Virtual Compute Units
64-bit CentOS
7.5GB RAM
4 x 500GB Network Storage – RAID10
1Gb Network
 

Results

Riak Performance Analysis

Riak Performance Analysis

Key Observations

The SoftLayer environment showed much more consistency in operations per second with an average throughput around 450 Op/sec. The virtual environment throughput varied significantly between about 50 operations per second to more than 600 operations per second with the trend line fluctuating slightly between about 220 Op/sec and 350 Op/sec.

Comparing the latency of get and put requests, the 99th percentile of results in the SoftLayer environment stayed around 50ms for gets and under 200ms for puts while the same metric for the virtual environment hovered around 800ms in gets and 4000ms in puts. The scale of the graphs is drastically different, so if you aren't looking closely, you don't see how significantly the performance varies between the two.

Riak Test 2: "Medium" Bare Metal 5-Node Cluster vs Virtual 5-Node Cluster

Servers

SoftLayer Medium Riak Server Node
Dual 6-core Intel 5670 CPUs
64-bit CentOS
36GB RAM
4 x 300GB 15K SAS – RAID10
1Gb Network – Bonded
Virtual Provider Node
26 Virtual Compute Units
64-bit CentOS
30GB RAM
4 x 300GB Network Storage
1Gb Network
 

Results

Riak Performance Analysis

Riak Performance Analysis

Key Observations

Similar to the results of Test 1, the throughput numbers from the bare metal environment are more consistent (and are consistently higher) than the throughput results from the virtual instance environment. The SoftLayer environment performed between 1500 and 1750 operations per second on average while the virtual provider environment averaged around 1200 operations per second throughout the test.

The latency of get and put requests in Test 2 also paints a similar picture to Test 1. The 99th percentile of results in the SoftLayer environment stayed below 50ms and under 400ms for puts while the same metric for the virtual environment averaged about 250ms in gets and over 1000ms in puts. Latency in a big data application can be a killer, so the results from the virtual provider might be setting off alarm bells in your head.

Riak Test 3: "Medium" Bare Metal 5-Node Cluster vs Virtual 5-Node Cluster

Servers

SoftLayer Medium Riak Server Node
Dual 6-core Intel 5670 CPUs
64-bit CentOS
36GB RAM
4 x 128GB SSD – RAID10
1Gb Network – Bonded
Virtual Provider Node
26 Virtual Compute Units
64-bit CentOS
30GB RAM
4 x 300GB Network Storage
1Gb Network
 

Results

Riak Performance Analysis

Riak Performance Analysis

Key Observations

In Test 3, we're using the same specs in our virtual provider nodes, so the results for the virtual node environment are the same in Test 3 as they are in Test 2. In this Test, the SoftLayer environment substitutes SSD hard drives for the 15K SAS drives used in Test 2, and the throughput numbers show the impact of that improved I/O. The average throughput of the bare metal environment with SSDs is between 1750 and 2000 operations per second. Those numbers are slightly higher than the SoftLayer environment in Test 2, further distancing the bare metal results from the virtual provider results.

The latency of gets for the SoftLayer environment is very difficult to see in this graph because the latency was so low throughout the test. The 99th percentile of puts in the SoftLayer environment settled between 500ms and 625ms, which was a little higher than the bare metal results from Test 2 but still well below the latency from the virtual environment.

Summary

The results show that — similar to the majority of data-centric applications that we have tested — Riak has more consistent, better performing, and lower latency results when deployed onto bare metal instead of a cluster of public cloud instances. The stark differences in consistency of the results and the latency are noteworthy for developers looking to host their big data applications. We compared the 99th percentile of latency, but the mean/median results are worth checking out as well. Look at the mean and median results from the SoftLayer SSD Node environment: For gets, the mean latency was 2.5ms and the median was somewhere around 1ms. For puts, the mean was between 7.5ms and 11ms and the median was around 5ms. Those kinds of results are almost unbelievable (and that's why I've shared everything involved in completing this test so that you can try it yourself and see that there's no funny business going on).

It's commonly understood that local single-tenant resources that bare metal will always perform better than network storage resources, but by putting some concrete numbers on paper, the difference in performance is pretty amazing. Virtualizing on multi-tenant solutions with network attached storage often introduces latency issues, and performance will vary significantly depending on host load. These results may seem obvious, but sometimes the promise of quick and easy deployments on public cloud environments can lure even the sanest and most rational developer. Some applications are suited for public cloud, but big data isn't one of them. But when you have data-centric apps that require extreme I/O traffic to your storage medium, nothing can beat local high performance resources.

-Harold

Subscribe to infrastructure