When you think about all the things that have to go right all the time where all the time is millions of times per second for a user to get your content it can be a little... daunting. The software, the network, the hardware all have to work for this bit of magic we call the Internet to actually occur.
There are points of failure all over the place. Take a server for example: hard drives can fail, power supplies can fail, the OS could fail. The people running servers can fail.. maybe you try something new and it has unforeseen consequences. This is simply the way of things.
Mitigation comes in many forms. If your content is mostly images you could use something like a content delivery network to move your content into the "cloud" so that failure in one area might not take out everything. On the server itself you can do things like redundant power supplies and RAID arrays. Proper testing and staging of changes can help minimize the occurrence of software bugs and configuration errors impacting your production setup.
Even if nothing fails there will come a time when you have to shut down a service or reboot an entire server. Patches can't always update files that are in use, for example. One way to work around this problem is to have multiple servers working together in a server cluster. Clustering can be done in various ways, using Unix machines, Windows machines and even a combination of operating systems.
Since I've recently setup a Windows 2008 cluster that is we're going to discuss. First we need to discuss some terms. A node is a member of a cluster. Nodes are used to host resources, which are things that a cluster provides. When a node in a cluster fails another node takes over the job of offering that resource to the network. This can be done because resources (files, IPs, etc) are stored on the network using shared storage, which is typically a set of SAN drives to which multiple machines can connect.
Windows clusters come in a couple of conceptual forms. Active/Passive clusters have the resources hosted on one node and have another node just sitting idle waiting for the first to fail. Active/Active clusters on the other hand host some resources on each node. This puts each node to work. The key with clusters is that you need to size the nodes such that your workloads can still function even if there is node failure.
Ok, so you have multiple machines, a SAN between them, some IPs and something you wish to serve up in a highly available manner. How does this work? Once you create the cluster you then go about defining resources. In the case of the cluster I set up my resource was a file share. I wanted these files to be available on the network even if I had to reboot one of the servers. The resource was actually combination of an IP address that could be answered by either machine and the iSCSI drive mount which contained the actual files.
Once the resource was established it was hosted on NodeA. When I rebooted NodeA though the resource was automatically failed over to NodeB so that the total interruption in service was only a couple of seconds. NodeB took possession of the IP address and the iSCSI mount automatically once it determined that NodeA had gone away.
File serving is a really basic example but you can clustering with much more complicated things like the Microsoft Exchange e-mail server, Internet Information Server, Virtual Machines and even network services like DHCP/DNS/WINs.
Clusters are not the end of service failures. The shared storage can fail, the network can fail, the software configuration or the humans could fail. With a proper technical staff implementing and maintaining them, however, clusters can be a useful tool in the quest for high availability.