Red Hat High Availability (in Toronto)

CN Tower
 
Last week I had the opportunity to attend a session of the four-day class, Red Hat Enterprise Clustering and Storage Management (RH436), held in Toronto.  It was a busy week, and that lovely view of the CN Tower above, as seen from my hotel room window, had to suffice for experiencing the city.  Fortunately I’ve been here before, looked through the Glass Floor, and generally done the tourist thing.  So let’s get down to business.

Purpose Of The Trip
At the radiology practice in which I work, we’ve long relied on Red Hat Enterprise Linux as the operating system underpinning our PACS (Picture Archiving and Communication System) that is the heart of our medical imaging business.  For awhile, most of the rest of our back-end systems ran atop the Microsoft family of server products, just as our workstations and laptops run Microsoft Windows Professional.  But over the last couple of years, the Microsoft-centric approach has gradually started to shift for us, as we build and deploy additional solutions on Linux.  (The reasons for this change have a lot to do with the low cost and abundance of various open-source infrastructure technologies as compared to their Microsoft licensed equivalents.)  But as we build out and begin to rely on additional applications running on Linux, we have to invest time in making these platforms as reliable and fault-tolerant as possible.

Fault Tolerance, Generally Speaking
The term ‘fault tolerance’ is fairly self-explanatory, though in practice it can cover a substantial amount of ground where technical implementations are concerned.  Perhaps it’s best thought of as eliminating single points of failure everywhere we can.  At my employer, and perhaps for the majority of businesses our size and larger, there’s already a great deal of fault tolerance underneath any new ‘server’ that we deploy today.  For starters, our SAN storage environment includes fault tolerant RAID disk groups, redundant storage processors, redundant Fibre Channel paths to the storage, redundant power supplies on redundant electrical circuits, etc.  Connected to the storage is a chassis containing multiple redundant physical blade servers, all running VMware’s virtualization software, including their High Availability (HA) and Distributed Resource Scheduler (DRS) features.  Finally, we create virtual Microsoft and Linux servers on top of all this infrastructure.  Those virtual servers get passed around from one physical host to another – seamlessly – as the workload demands, or in the event of a hardware component failure.  That’s a lot of redundancy.  But what if we want to take this a step further, and implement fault tolerance at the operating system or application level, in this case leveraging Red Hat Enterprise Linux?  That is where Red Hat clustering comes into play.

Caveat Emptor
Before we go any further, we should note that Red Hat lists the following prerequisites in their RH436 course textbook: “Students who are senior Linux systems administrators with at least five years of full-time Linux experience, preferably using Red Hat Enterprise Linux,” and “Students should enter the class with current RHCE credentials.”  Neither of those applies to me, so what you’re about to read is filtered through the lens of someone who is arguably not in the same league as the intended audience.  Then again, we’re all here to learn.

What Red Hat Clustering Is…
In Red Hat parlance, the term ‘clustering’ can refer to multiple scenarios, including simple load-balancing, high-performance computing clusters, and finally, high availability clusters.  Today we’ll focus on the latter, provided by Red Hat’s High Availability Add-On, an extra-cost module that starts at $399/year per 2-processor host.  With Red Hat’s HA addon, we’re able to cluster instances of Apache web server, a file system, an IP address, MySQL, an NFS client or server, an NFS/CIFS file system, Open LDAP, Oracle 10g, PostgreSQL, Samba, a SAP database, Sybase, Tomcat or a virtual machine.  We’re also able to cluster any custom service that launches via an init script, and which returns status appropriately.  Generally speaking, a clustered resource will run in an active-passive configuration, with one node holding the resource until it fails, at which time another node will take over.

…And What Red Hat HA Clustering Is Not
Less than two weeks prior to the RH436 class, I somehow managed to get through a half-hour phone conversation with a Red Hat Engineer without touching on one fundamental requirement of HA that, when later identified, shaped my understanding of Red Hat clustering going forward.  So perhaps the following point merits particular attention: Any service clustered via Red Hat’s HA add-on that also uses storage – say Apache or MySQL – requires that the cluster nodes have shared access to block level storage.  Let’s read it again: Red Hat’s HA clustering requires that all nodes have shared access to block level storage; the type typically provided by an iSCSI or Fibre Channel SAN.  Red Hat HA passes control of this shared storage back and forth among nodes as needed, rather than having some built-in facility for replicating a cluster’s user-facing content from one node to another.  For this reason and others, we can’t simply create discrete Red Hat servers here and there and combine them into a cluster, with no awareness of, nor regard for, our underlying storage and network infrastructure.  Yet before anyone goes dismissing any potential use cases out of hand, remember that like much of life and technology, the full story is always just a bit more complicated.

Traditional Cluster
Let’s begin by talking about how we might implement a traditional Red Hat HA cluster.  The following steps are vastly oversimplified, as a lot of planning is required around many of these actions prior to execution.  We’re not going to get into any command-line detail in today’s discussion, though that would make for an interesting post down the road.

  • We’ll begin with between two and sixteen physical or virtual servers running Red Hat Enterprise Linux with the HA add-on license.  The physical or virtual servers must support power fencing, a technology that allows a surviving node to separate failed nodes from possibly writing to shared storage by shutting the failed node down.  This is supported on physical servers by Cisco, Dell, HP, IBM and others, and is also supported on VMware.
  • We’ll need one or more shared block level storage instances accessible to all nodes, though one at a time.  In a traditional cluster, we’d make this available via an iSCSI or Fibre Channel SAN.
  • All nodes are on the same network segment in the same address space, though it’s wise to isolate cluster communication to a separate VLAN from published services.  Multicast, IGMP and gratuitous ARP are supported on our segments.  There’s no traditional layer 3 routing separating one cluster node from another.
  • We’d install a web-based cluster management application called Luci on a non-cluster node.  We’re not concerned about fault-tolerance of this management tool, as a new one can be spun up at a moment’s notice and pointed at an existing cluster.
  • Then we’d install a corresponding agent called Ricci (or likely the more all-encompassing “High Availability” and “Resilient Storage” groups from the Yum repository) on each cluster node, assign passwords, and set them to start on boot.
  • At this point we’d likely log into the Luci web interface, create a cluster, add nodes, set up fencing, set up failover, create shared resources (like an IP address, a file system or an Apache web service) and add those resources to a service group.  If that sounds like a lot, you’re right.  We could spend hours or days on this one bullet the first time around.
  • Before we declare Mission Accomplished, we’ll want to restart each node in the cluster and test every failover scenario that we can think of.  We don’t want to assume that we’ve got a functional cluster without proving it.

What About Small Environments Without a SAN?
It’s conceivable that someone might want to cluster Red Hat servers in an environment without a SAN at all.  Or perhaps one has a SAN, but they’ve already provisioned the entire thing for use by VMware, and they’d rather not start carving out LUNs to present directly to every new clustered scenario that they deploy.  What then?  Well, there are certainly free and non-free virtual iSCSI SAN products including FreeNAS, Openfiler and others.  Some are offered in several forms including a VMware VMDK file or virtual appliance.  They can be installed and sharing iSCSI targets in minutes, where previously we had none.  Some virtual iSCSI solutions even offer replication from one instance to another, analogous to an EMC MirrorView or similar.  In addition to eliminating yet another single point of failure, SAN replication provides a bit of a segue into what we’re going to talk about next.

What About Geographic Fault Tolerance?
As mentioned early on, at my office we already have several layers of fault tolerance built into our computing environment at our primary data center.  When looking into Red Hat HA, our ideal scenario might involve clustering a service or application across two data centers, separated in our case by around 25 miles, 1 Gbit/s of network bandwidth and a 1 ms response time.  Can we do it, and what about the shared storage requirement?  Fortunately Red Hat supports certain scenarios of Multi-Site Disaster Recovery Clusters and Stretch Clusters.  Let’s take a look at a few of the things involved.  Be aware that there are other requirements.

  • A Stretch Cluster, for instance, requires the data volumes to be replicated via hardware or 3rd-party software so that each group has access to a replica.
  • Further, a Stretch Cluster must span no more than two sites, and must have the same number of nodes at each location.
  • Both sites must share the same logical network, and routing between the two physical sites is not supported.  The network must also offer LAN-like latency that is less than or equal to 2 ms.
  • In the event of a site failure, human intervention is required to continue cluster operation, since a link failure would prevent the remaining site from initiating fencing.
  • Finally, all Stretch Clusters are subject to a Red Hat Architecture Review before they’ll be supported.  In fact, an Architecture Review might be a good idea in any cluster deployment, stretch or not.

Conclusion
While many enterprise computing environments already contain a great deal of fault tolerance these days, the clustering in Red Hat’s High Availability Add-On is one more tool that Systems Administrators may take advantage of as the need dictates.  Though generally designed around increasing the availability of enterprise workloads within a single data center, it can be scaled down to use virtual iSCSI storage, or stretched under certain specific circumstances to provide geographic fault tolerance.  In today’s 24×7 world, it’s good to have options.

Getting Started With Monit on Red Hat

Monit Console Via Browser
 
As is common in many professional environments today, the technology infrastructure that we support and maintain at my day job includes a wide variety of platforms. Narrowing our focus to operating systems, we run not only Microsoft products, but also several Linux distributions including Red Hat Enterprise Linux, Ubuntu and openSUSE. And while we monitor the applications that ride atop these operating systems using a variety of means, we encountered an interesting new requirement this week. How do we monitor, restart and alert administrators to particular failed services on an instance of Red Hat Enterprise Linux, running only a relatively lightweight process on the server itself?

Potential Solution
A few minutes with Google led us to Monit, where their web site suggests that it comes “With all features needed for system monitoring and error recovery.” Better yet, you can get “Up and running in 15 minutes!” It’s freely available for Linux, BSD and OS X. But would the claims prove true? Let’s put it to the test.

Getting Started
Those familiar with various Linux distros are well aware that package installation differs from one version to the next. The following exercise involves installing Monit on RHEL (Red Hat Enterprise Linux) 6.5. We’ll assume that we’re logged in as root for the duration of this process.

  1. First we have to enable the EPEL (Extra Packages for Enterprise Linux) repository on our RHEL server using the following two commands. The first of these commands is longer than may be depicted in this blog post, but if you triple-click on the line of text to highlight it all, you should be able to copy the full command to your buffer and then paste it into a text editor or an SSH session.
  2. Then we install Monit and start it.
  3. To this point we’ve spent maybe 2 minutes, and technically we’ve already installed Monit. Not bad, but we can’t use it yet. Monit creates it’s own web page that runs on port 2812 by default. If you’re running a firewall, you may wish to open port 2812 using the commands below. Even so, Monit doesn’t allow access to its web page without further configuration beyond the firewall, which will discuss later.
  4. All Monit configuration is done via the file /etc/monit.conf. We’ll use vim to edit it. If you’ve installed the Desktop portion of RHEL, you could use the graphic gedit in place of vim.
  5. You’ll see that virtually the entire monit.conf file is commented out with # signs by default. We’ll have to uncomment various sections and add our own parameters to make it work. Look for the portion that looks like the following and remove the # markers at the beginning of the lines. (Navigate within vim using the keyboard arrow keys, and hit insert before attempting to add any new content.) You might set the daemon value down to 60 if you wish to check at one-minute intervals.
  6. Scrolling further, uncomment the following line and replace mail.bar.baz with the name or TCP/IP address of an internal SMTP mail server through which you want to send your e-mail alerts. We ran our alerts through our local Microsoft Exchange server.
  7. Monit has a default e-mail format that’s usable, but I found myself wanting to customize it. In the example below, everything following the set mail-format, inside { }, represents the content of the alert messages. You can insert these lines following a section of monit.conf that displays the default format.
  8. Next we have to identify who receives e-mail alerts. You’ll see the example syntax: set alert manager@foo.bar. I prefer to only be alerted on things that I should be concerned about, so I might limit my alerts to only the following types. Again, triple-click to highlight and then copy-paste the following line, to make sure that you got it all.
  9. Now we come to the parameters that control the Monit web page on port 2812 as discussed earlier. Uncomment the following code. There are a few things to note here, which we’ve already incorporated into this example. We don’t need the use address parameter if we wish to listen on all interfaces. If we wish to be able to connect from client workstations in a particular IP range, you can do an allow statement with a network number and subnet mask, as seen here with 192.168.1.0. The allow admin:monit section will require that username and password when connecting. You might want to change it from the default.
  10. Finally, we have to decide what to monitor. This is where it gets a bit more complicated, and Google is your friend. The following example can be configured to monitor the Apache web server. Unfortunately, the path to httpd.pid within monit.conf isn’t accurate for Red Hat, but it is accurate in the example below. If you’re having trouble locating the pid file for a particular service, you can always try locate *.pid.
  11. Let’s also put in an entry to alert if our root file system get’s larger than 90% full.
  12. At this point we’re done editing monit.conf for purposes of this conversation. If editing with vim, we’ll hit Esc if we’re still in insert mode, and then :wq to write our changes and quit. Let’s restart Monit and give it a try. You’ll know if you got a syntax error pretty quickly, as Monit will alert you to the error and fail to restart if one exists. (Go back and modify /etc/monit.conf to fix any syntax errors.)
  13. Assuming Monit restarts successfully, you should now be able to connect via a web browser to your host’s name or IP address, followed by :2812.
  14. If you wish to test monit, you might force down apache to see if it restarts, and whether you receive an alert message. Keep in mind that Monit only checks every 120 seconds, or whatever interval we specified.
  15. Assuming your e-mail configuration is correct, and that your SMTP server is relaying e-mail for you, you should receive one message indicating that apache is not running, and a second indicating that it is running once again.

Extending Monit Further
While Monit is free, there’s an M/Monit license that provides consolidated monitoring of multiple Monit clients, as well as a mobile monitoring site. Pricing starts at 65 Euros for five hosts.

Conclusion
Monit meets our requirement of having lightweight application monitoring, restart and alerting running directly atop specific Red Hat Enterprise Linux servers. And while it takes longer than fifteen minutes to fully understand one’s configuration options, one can indeed be “Up and running in 15 minutes” with a little practice. All in all, Monit is a pretty impressive free product.