Table of Contents
This document is intended to demonstrate the use of the Coraid CLN Failover Kit with two Coraid Linux NAS units to improve storage service availability. After digesting the principles demonstrated by the simple examples here, the reader will have an outline which can be filled out by reading the documentation for the software they choose to use. With the acquired expertise, the reader should be able to create highly available services as desired.
Administering a highly available system using open source software is safe when the system administrator understands the technology. The Coraid support team is happy to serve as a resource as you acquire this understanding, especially if you are going beyond the scope of this HOWTO.
The reader should keep in mind that high availability is different from archival backup. A highly available system minimizes downtime, but an archival backup allows data to be restored after a catastrophic event. The two are complimentary. Data on a highly available system should be backed up regularly.
The CLN is a general-purpose Linux system tuned toward exporting AoE storage via NFS and CIFS. This document builds on and assumes familiarity with the content of the CLN-HOWTO. Please do not assume that everything covered in the CLN HOWTO is safe to use on a heartbeat cluster in the same way as on a single host.
The Failover Kit allows two CLN units to provide storage services with higher availability than a single unit could provide. With one CLN actively serving clients, the other passive CLN can monitor the active CLN's "heartbeat." If the heartbeat stops, it indicates that the active CLN has stopped working, and the passive CLN takes over after turning off the dead CLN. Clients should not notice more than a brief pause in ongoing service.
In this HOWTO, two CLN units, "pete" and "mclaws," form a simple "Active/Passive" two-node cluster. Together, pete and mclaws provide a more reliable NFS service than either one could alone.
They're using data from one shared AoE device with shelf address zero
and slot address zero. An XFS resides on the e0.0 device and is
being exported via NFS.
This example is deliberately simple for two reasons. Firstly, a short HOWTO should help the important principles stand out. Secondly, simple systems are usually more stable than complex ones, and it is expected that most readers will have availability as a priority.
For instance, by foregoing the flexibility provided by LVM, the admin can omit a large amount of software, and all software (not just LVM) sometimes has bugs. In general, reducing the number of components in a system increases its availability. So although the software stack in the example below is minimal, it probably is not unrealistic.
A simple and reliable way to achieve high availability is to let one computer perform a task while another computer stands ready to step in should the first one fail. This situation is called "Active/Passive," because only one of the machines is performing the service.
It is possible, but significantly more difficult, to achieve high availability by instead having both machines sharing in the performance of the same task, so that if one fails the other simply keeps going. This "Active/Active" situation requires more complex software, like cluster filesystems.
The Active/Passive application of the Heartbeat software package is a solid, well-tested way of achieving high availability. It does not rely on rapidly changing or complex software, and it is relatively easy to understand and configure.
This document describes the use of Heartbeat to configure an Active/Passive cluster of two CLNs.
When two machines support one another in offering a single service, they are acting as a cluster. Each machine is distinct and has its own IP. In our example, there are names for all the IP addresses.
pete:~# host pete pete.coraid.com has address 205.185.197.218 pete:~# host mclaws mclaws.coraid.com has address 205.185.197.217
The computers in a cluster are called nodes. The clients using the service provided by the cluster don't care which node is the "real" server. They connect to a special IP address that is used for the cluster.
pete:~# host clusterb clusterb.coraid.com has address 205.185.197.220
In addition to its own IP address, whichever node has the active role will assume the cluster IP on its front side network interface.
When two cluster nodes are mounting a traditional (single host) filesystem from a shared block device, they must be sure that only a single host mounts the filesystem at any given time.
Cluster filesystems like GFS depend on programs that coordinate the different nodes, ensuring that access to the shared block storage proceeds in a consistent way. For GFS, the Distributed Lock Manager and the Cluster Manager software perform this coordination.
Traditional filesystems lack such cluster management because they were designed with the assumption that only one host will be able to access the block storage device, like a hard disk in a computer case.
Two CLNs can safely share a traditional filesystem only as long as they don't both attempt to use the filesystem at the same time.
Imagine that pete is the active node and mclaws notices that pete's heartbeat stops. Now mclaws wants to take over, but how can mclaws be sure that pete no longer has the XFS mounted? Well, mclaws can turn pete off. With pete's power disconnected, mclaws can be certain that it's the only node accessing the shared AoE storage.
So to be sure that the other machine can't be accessing the shared block storage device, mclaws can "Shoot the Other Node in the Head." "STONITH" is an acronym.
The objective is a highly available service. A whole system of parts makes the service available. To increase availability, we can attempt to design a system where any single part can fail without interrupting the service itself.
It's important to concentrate on points of failure that cause a service interruption. Trying to make every part redundant is a wild goose chase.
Heartbeat is a popular, well-tested, high-availability software that can be installed with APT.
pete:~# apt-get update pete:~# apt-get install heartbeat-2
It is normal to see an error at the end like the one below, because you haven't yet configured heartbeat.
Setting up heartbeat-2 (2.0.3-2) ... Heartbeat not configured: /etc/ha.d/ha.cf not found. Heartbeat failure [rc=1]. Failed.
If you see an error like the one in the example below, please read the subsection immediately following this one. It will help you to use the package from the stable distribution.
pete:~# apt-get install heartbeat-2 Reading package lists... Done Building dependency tree... Done Package heartbeat-2 is not available, but is referred to by another package. This may mean that the package is missing, has been obsoleted, or is only available from another source E: Package heartbeat-2 has no installation candidate
There are CLN-specific scripts that work with Heartbeat to ensure that failover works smoothly. They're installed with the command below.
pete:~# apt-get install coraid-ft-scripts
Currently the heartbeat-2 package is maintained in the "stable" debian distribution but not in the "testing" distribution. This situation is new and is expected to change. Coraid Linux is based on the testing distribution, but you can selectively use packages from the stable distribution.
When heartbeat-2 does appear in the testing distribution, the procedure below will not be necessary. You can check the debian website to see whether heartbeat-2 already has appeared in testing. After going to the URL below, you can click "Search package directories" and select "any" distribution from the drop down list before searching for "heartbeat-2".
http://www.debian.org/distrib/packages
While heartbeat-2 remains only in stable, you can install it using the following method. First, make sure that your /etc/apt/apt.conf file specifies that testing is the default distribution.
pete:~# grep -i default /etc/apt/apt.conf APT::Default-Release "testing";
If a default line does not exist, simply add it to that file (creating it if necessary).
Next, tell APT where to find a stable distribution repository.
pete:~# grep stable /etc/apt/sources.list deb http://ftp.us.debian.org/debian/ stable main
You will probably have to add this line to your sources.list file, because the CLN ships without it.
Update your APT repository information next.
pete:~# apt-get update # repeat if needed
If you see a message about key B5D0C804ADB11277 not being available, it's because you don't have the new stable distribution key installed yet. It is easy to upgrade your debian-archive-keyring package, though, so that the key is present.
pete:~# apt-get install debian-archive-keyring pete:~# apt-get update
Install the heartbeat-2 from the stable distribution by specifying it with the "-t" option as shown in the command below.
pete:~# apt-get -t stable install heartbeat-2
You can then free some local storage by using "apt-get clean".
The kermit program is a terminal emulator that can speak to the WTI
IPS-800 over a null-modem cable. We'll use it to set the IP address
on the IPS-800.
pete:~# apt-get install coraid-kermit
Install telnet on the CLNs in order to configure the WTI IPS-800 Internet Power Switch.
pete:~# apt-get install telnet
In general, you can get rid of old .deb files to free up some space
after installing or upgrading.
pete:~# apt-get clean
Install the WTI IPS-800 in your rack. After cleanly shutting down the
CLNs with shutdown -h now, plug the power cable of each CLN into the
IPS-800 so that each is on a different power circuit.
For the example, we plug pete into A1 and mclaws into B5.
The RJ-45 ethernet port is plugged into a switch on the CLNs' "front" network. This connection is not a single point of failure, because if it fails, the machines would not be able to turn one another off, but something else would have to fail for service to be interrupted.
We'll connect the serial ports of the two CLNs together later, after using the null-modem cable to first configure the WTI IPS-800 in the next section.
In the steps below, we give the IPS-800 an IP address so that the CLNs can connect to it over your network. The initial configuration is performed over a serial connection.
Using the null-modem cable supplied in the Failover Kit, connect the 9-pin serial port of one CLN to the 9-pin serial port of the IPS-800.
Using kermit, connect to the IPS-800 as follows.
pete:~# kermit C-Kermit 8.0.211, 10 Apr 2004, for Linux Copyright (C) 1985, 2004, Trustees of Columbia University in the City of New York. Type ? or HELP for help. (/root/) C-Kermit>set line /dev/ttyS0 (/root/) C-Kermit>set speed 9600 /dev/ttyS0, 9600 bps (/root/) C-Kermit>set carrier-watch off (/root/) C-Kermit>connect Connecting to /dev/ttyS0, speed 9600 Escape character: Ctrl-\ (ASCII 28, FS): enabled Type the escape character followed by C to get back, or followed by ? to see other options. ----------------------------------------------------
Hit the enter key, and you'll see the IPS-800 menu.
Internet Power Switch v1.41h Site ID: (undefined)
Plug | Name | Password | Status | Boot/Seq. Delay | Default | -----+------------------+-------------+--------+-----------------+---------+ 1 | (undefined) | (undefined) | ON | 0.5 Secs | ON | 2 | (undefined) | (undefined) | ON | 0.5 Secs | ON | 3 | (undefined) | (undefined) | ON | 0.5 Secs | ON | 4 | (undefined) | (undefined) | ON | 0.5 Secs | ON | 5 | (undefined) | (undefined) | ON | 0.5 Secs | ON | 6 | (undefined) | (undefined) | ON | 0.5 Secs | ON | 7 | (undefined) | (undefined) | ON | 0.5 Secs | ON | 8 | (undefined) | (undefined) | ON | 0.5 Secs | ON | -----+------------------+-------------+--------+-----------------+---------+
"/H" for help.
IPS>
To configure the IP address,
Press the escape key to return to the IPS> prompt. You can
disconnect by holding down the control key and pressing the backslash
key to get kermit's attention, and then hitting the "c" key. That
key sequence returns you to the kermit prompt, at which point you can
quit kermit.
(Back at pete.coraid.com) ---------------------------------------------------- (/root/) C-Kermit>quit Closing /dev/ttyS0...OK pete:~#
The rest of the IPS-800 configuration may be performed over the network, now that it has an IP address.
The STONITH plugin in Heartbeat for the WTI IPS-800 assumes that a password is needed to use the IPS-800.
To set the password, telnet to the IP you assigned to the IPS-800 in the previous section.
pete:~# host benjamin benjamin.coraid.com has address 205.185.197.219 pete:~# telnet benjamin
At the IPS> prompt, use "/g" to set the general parameters. Enter
the number "1" to set the password. Type in a password that you won't
forget or lose, and hit the enter key. I'm using "changeme" for this
example.
Only one telnet session at a time is allowed on the IPS-800, so you'll have to log out in order to try out your new password.
Use "/x" to exit, and be sure to select "1" (the number) to save your changes.
Now that you're sure that you can get to the IPS-800 over your IP network, you can use the null-modem cable to connect the serial ports of your two CLNs. This cable will carry the actual heartbeat messages. On each CLN, connect the serial port that is right next to the VGA video port.
Heartbeat's STONITH plugin uses host names to control the CLN power outlets.
Telnet to the IPS-800 and set a name for plug number 1. Enter "/p1", and then enter "1". Type in the name of the host ("pete" in our example) connected to plug 1.
Now set the name for your other CLN on plug 5 after entering "/p5".
Save with "/e" and exit with "/x".
Before configuring Heartbeat, the storage itself should be configured so that it can be started up smoothly. For this example, we create an XFS on an AoE device that both mclaws and pete can use.
pete:~# aoe-stat | fgrep e0.0
e0.0 3298.534GB eth2 up
pete:~# modprobe xfs
pete:~# mkfs -t xfs /dev/etherd/e0.0
meta-data=/dev/etherd/e0.0 isize=256 agcount=32, agsize=25165824 blks
= sectsz=512 attr=0
data = bsize=4096 blocks=805306368, imaxpct=25
= sunit=0 swidth=0 blks, unwritten=1
naming =version 2 bsize=4096
log =internal log bsize=4096 blocks=32768, version=1
= sectsz=512 sunit=0 blks
realtime =none extsz=65536 blocks=0, rtextents=0
pete:~#
Now mclaws and pete can both see this new XFS on e0.0, but only one of them should have it mounted at any given time. The Heartbeat resource scripts will take care of that.
This step only needs to be performed on one host (because there's only one AoE device that they're both using). The rest of the configuration steps must be performed on both hosts.
It's important to understand that when Heartbeat is starting and stopping services, the services should not be started or stopped when the rest of the system comes up or goes down.
The general system initialization scripts reside in /etc/init.d.
They are run based on the presence of symbolic links ("symlinks") in
the directories that are named after runlevels.
pete:~# ls -d /etc/rc*.d /etc/rc0.d /etc/rc2.d /etc/rc4.d /etc/rc6.d /etc/rc1.d /etc/rc3.d /etc/rc5.d /etc/rcS.d
The symlinks would be difficult to manage without tools, and the
update-rc.d tool makes it easy to remove all the symlinks for NFS.
That's necessary to make sure that only Heartbeat controls NFS.
pete:~# update-rc.d -f nfs-common remove Removing any system startup links for /etc/init.d/nfs-common ... /etc/rc0.d/K79nfs-common /etc/rc1.d/K79nfs-common /etc/rc2.d/S21nfs-common /etc/rc3.d/S21nfs-common /etc/rc4.d/S21nfs-common /etc/rc5.d/S21nfs-common /etc/rc6.d/K79nfs-common pete:~# update-rc.d -f nfs-kernel-server remove Removing any system startup links for /etc/init.d/nfs-kernel-server ... /etc/rc0.d/K80nfs-kernel-server /etc/rc1.d/K80nfs-kernel-server /etc/rc2.d/S20nfs-kernel-server /etc/rc3.d/S20nfs-kernel-server /etc/rc4.d/S20nfs-kernel-server /etc/rc5.d/S20nfs-kernel-server /etc/rc6.d/K80nfs-kernel-server
In the future, after updating your nfs-common and
nfs-kernel-server packages, make a habit of running these commands
to make sure that the symlinks haven't been recreated.
On mclaws the system-wide symlinks are removed in the same way.
Any Software RAID that is managed by Heartbeat should not be listed in
/etc/aoe/md.conf, nor should any filesystem managed by Heartbeat be
listed in /etc/aoe/fs.conf. (This HOWTO does not cover Linux
Software RAID, but it should be clear from the last statement and from
a basic understanding of heartbeat that heartbeat itself, and not the
rest of the system, must control all of the layers above a resource
shared by both nodes in an HA cluster.)
The heartbeat itself consists of messages that go between cluster
members. To help the members identify legitimate cluster members,
some cryptography is used. You can pick a secret that will be shared
by the members of the cluster and put it in the /etc/ha.d/authkeys
file.
pete:~# touch /etc/ha.d/authkeys pete:~# chmod 600 /etc/ha.d/authkeys
Edit the file next. In the example contents below, the secret is "ABetterDayToStoreDataBetterIsToday".
pete:~# cat /etc/ha.d/authkeys auth 1 1 sha1 ABetterDayToStoreDataBetterIsToday
This configuration file must be the same on both hosts.
The central configuration file for Heartbeat is /etc/ha.d/ha.cf.
The configuration file on pete is short, but Heartbeat comes with a
long example configuration file with comments explaining each of its
parts. You can read it with zless, which runs the less pager on
compressed files. (Hit "q" to quit zless. Use arrow keys to
navigate the text.)
pete:~# zless /usr/share/doc/heartbeat-2/ha.cf.gz
Here is pete's configuration file. (The remote power switch for
STONITH is connected to the network and is reachable at
benjamin.coraid.com.) The file on mclaws is the same.
pete:~# cat /etc/ha.d/ha.cf keepalive 1 deadtime 10 warntime 5 baud 9600 serial /dev/ttyS0 auto_failback off stonith_host pete wti_nps benjamin.coraid.com changeme stonith_host mclaws wti_nps benjamin.coraid.com changeme node pete node mclaws use_logd yes
The ha.cf on mclaws is the same.
A pair of CLNs with Intel dual-port PCI-X NICs can use the onboard
network interface eth1 for heartbeat over ethernet instead of using a
null modem cable for the heartbeat messages. Such a configuration
would have a ha.cf like the one below instead of the one shown
above.
pete:~# cat /etc/ha.d/ha.cf keepalive 1 deadtime 10 warntime 5 ucast eth1 auto_failback off stonith_host pete wti_nps benjamin.coraid.com changeme stonith_host mclaws wti_nps benjamin.coraid.com changeme node pete node mclaws use_logd yes
In addition, the eth1 interface would require IP configuration. (On
these CLNs, eth2 is the "front" network and eth3 is the "back"
network, leaving eth1 free for handling heartbeat messages.)
pete:~# sed -n '/eth1/,$p' /etc/network/interfaces
auto eth1
iface eth1 inet static
address 192.168.1.1
netmask 255.255.255.0
network 192.168.1.0
broadcast 192.168.1.255
mclaws:~# sed -n '/eth1/,$p' /etc/network/interfaces
auto eth1
iface eth1 inet static
address 192.168.1.2
netmask 255.255.255.0
network 192.168.1.0
broadcast 192.168.1.255
When any machine boots, it performs a sequence of tasks in order to
get to a usable state. Assuming an active role in the cluster is much
like booting, and the /etc/ha.d/haresources file defines the order
of tasks that must be performed on takeover.
Make sure that the lines ending with backslashes don't really end with spaces or tabs.
pete:~# cat /etc/ha.d/haresources pete IPaddr::205.185.197.220 Filesystem::/dev/etherd/e0.0::/mnt/e0.0::xfs killnfsd nfs-common nfs-kernel-server
This example file is the same on pete and mclaws. It says that pete
is the default active server. It lists scripts (found in
/etc/ha.d/resources.d) to run forwards when assuming the active role
or backwards when giving it up.
Notice that everything is under the control of heartbeat, from the NFS export down to the AoE target. It is beyond the scope of this HOWTO, but if you were using md, LVM, or anything else, these additional layers would, of course, need to be represented in the above configuration.
If you haven't already configured syslog on each CLN, now is a good time to do that. It's explained in the CLN-HOWTO. By sending syslog messages to a remote host, you can more easily monitor both machines.
The messages from Heartbeat are copious and make for dry reading, but by reading the logged messages you can get an excellent understanding of how the failover process is working. You can also more easily identify and correct problems.
After changing /etc/syslog.conf on the CLN units, you can restart
syslog services to make sure the new configuration takes effect.
pete:~# /etc/init.d/sysklogd restart Restarting system log daemon: syslogd. pete:~# /etc/init.d/klogd restart Restarting kernel log daemon: klogd.
Configure NFS to export the XFS on both hosts.
Create a mount point on each host for the XFS. Don't mount the XFS, though. That will be done by Heartbeat.
pete:~# mkdir /mnt/e0.0
Tell the NFS server to export the /mnt/e0.0 filesystem by adding a
line to /etc/exports.
pete:~# cat /etc/exports # /etc/exports: the access control list for filesystems which may be exported # to NFS clients. See exports(5). /mnt/e0.0 205.185.197.0/24(rw,sync,no_root_squash)
The exports manpage and the CLN-HOWTO have more information about
the /etc/exports file.
It's important to be familiar with the workings of Heartbeat. Although Heartbeat is simple and reliable, it can occasionally be confusing. Being confused during testing isn't always comfortable, but it's always more comfortable than being confused when the system is in production.
From each node, you should be able to turn off the other node using
the stonith tool that's part of heartbeat-2. Before performing this
test, you should consider temporarily removing the symbolic links that
cause heartbeat to come up at boot.
mclaws:~# update-rc.d -f heartbeat remove Removing any system startup links for /etc/init.d/heartbeat ... /etc/rc0.d/K05heartbeat /etc/rc1.d/K05heartbeat /etc/rc2.d/S75heartbeat /etc/rc3.d/S75heartbeat /etc/rc4.d/S75heartbeat /etc/rc5.d/S75heartbeat /etc/rc6.d/K05heartbeat
To test it, first sync filesystem data on the node you're going to power off.
mclaws:~# sync
Next turn off the target node from the other node. In this example, we're turning off mclaws from pete. Note that this will suddenly kill the power that's going to mclaws.
pete:~# stonith -t wti_nps -p "benjamin.coraid.com changeme" mclaws ** INFO: Successful login to WTI Network Power Switch. stonith: wti_nps device OK. connect() failed: Connection refused ** INFO: Successful login to WTI Network Power Switch. ** INFO: Host is being rebooted: mclaws ** INFO: Power restored to host: mclaws
The "Connection refused" message doesn't matter, because mclaws does get powered down and powered back up. CLNs ship with the BIOS set so that power is restored to the last status (off or on) after power failure.
Before starting the heartbeat service on the CLNs, make sure that
the serial connection between the two CLNs is working properly. The
cat and echo commands are sufficient for a quick check.
First run cat on mclaws to listen for messages from pete.
mclaws:~# cat /dev/ttyS0
Now echo a message into the serial port device to send it to
mclaws.
pete:~# echo hello there > /dev/ttyS0
The cat process on mclaws prints out the message. Mine shows an
extra blank line after the message. Next, kill the cat process by
hitting control-c on mclaws and try sending a message the other way,
with cat first listening on pete.
If the parts are all working, it's time to start Heartbeat. Keep an eye on the syslog messages coming from the CLNs.
If you have previously removed the startup symlinks, now is a good time to add them again on each of the two hosts.
pete:~# update-rc.d heartbeat defaults Adding system startup for /etc/init.d/heartbeat ... /etc/rc0.d/K20heartbeat -> ../init.d/heartbeat /etc/rc1.d/K20heartbeat -> ../init.d/heartbeat /etc/rc6.d/K20heartbeat -> ../init.d/heartbeat /etc/rc2.d/S20heartbeat -> ../init.d/heartbeat /etc/rc3.d/S20heartbeat -> ../init.d/heartbeat /etc/rc4.d/S20heartbeat -> ../init.d/heartbeat /etc/rc5.d/S20heartbeat -> ../init.d/heartbeat
I try to start
heartbeat at about the same time on both hosts, using two xterm
windows on my desktop.
pete:~# sync; /etc/init.d/heartbeat start logd is already running Starting High-Availability services: 2006/04/14_12:35:23 INFO: IPaddr Resource is stopped Done.
mclaws:~# sync; /etc/init.d/heartbeat start logd is already running Starting High-Availability services: 2006/04/14_12:35:25 INFO: IPaddr Resource is stopped Done.
Now one of the hosts should have the cluster IP as well as its own
"personal" IP. I use the arping command (on a third system) as an
easy way to check that.
root@kokone ~# arping clusterb ARPING 205.185.197.220 60 bytes from 00:30:48:88:36:d0 (205.185.197.220): index=0 time=128.984 usec 60 bytes from 00:30:48:88:36:d0 (205.185.197.220): index=1 time=132.084 usec
--- 205.185.197.220 statistics --- 2 packets transmitted, 2 packets received, 0% unanswered root@kokone ~# arping pete ARPING 205.185.197.218 60 bytes from 00:30:48:88:36:d0 (205.185.197.218): index=0 time=130.892 usec 60 bytes from 00:30:48:88:36:d0 (205.185.197.218): index=1 time=132.084 usec
--- 205.185.197.218 statistics --- 2 packets transmitted, 2 packets received, 0% unanswered
Notice that when I arping "clusterb," which is the name that
resolves to the cluster's IP address, I get the same MAC address,
00:30:48:88:36:d0, that I get when I arping pete's IP.
From that I can tell that pete's front-side network interface now has two IP addresses: its own and the cluster's.
With Heartbeat running and managing the NFS services on both nodes, you should be able to use the storage by way of the cluster IP. Be sure to use the cluster IP, not just the IP of one node.
root@kokone ~# mkdir /mnt/clusterb root@kokone ~# mount -t nfs clusterb:/mnt/e0.0 /mnt/clusterb root@kokone ~# df -h /mnt/clusterb Filesystem Size Used Avail Use% Mounted on clusterb:/mnt/e0.0 3.0T 512K 3.0T 1% /mnt/clusterb root@kokone ~# cp -a /usr/share/doc/irb /mnt/clusterb root@kokone ~# find /mnt/clusterb /mnt/clusterb /mnt/clusterb/irb /mnt/clusterb/irb/changelog.Debian.gz /mnt/clusterb/irb/copyright
There are some commands you can use to cause the active and passive
nodes to change roles. Now that pete is the active node, we can run
hb_standby on pete, changing pete to the passive role. Running
hb_takeover on mclaws would trigger the same role switch.
I like to make sure that NFS services aren't interrupted when I
perform this test. On kokone, my NFS client, I unmount and then
remount the NFS filesystem, so that the local cache isn't consulted.
Then after triggering failover, I run a find command on kokone,
requiring NFS service. After a brief pause, the find command should
run to completion.
Step one: Get a fresh NFS mount on kokone.
root@kokone ~# umount /mnt/clusterb root@kokone ~# mount -t nfs clusterb:/mnt/e0.0 /mnt/clusterb
Step two: Trigger a failover on pete, performing step three immediately afterwards.
pete:~# /usr/lib/heartbeat/hb_standby 2006/04/14_13:18:55 Going standby [all].
Step three: Use the NFS service on the client.
root@kokone ~# wc /mnt/clusterb/irb/copyright 73 429 3062 /mnt/clusterb/irb/copyright
Step four: Verify that the roles have been reversed.
root@kokone ~# arp | egrep 'pete|mclaws|clusterb' pete.coraid.com ether 00:30:48:88:36:D0 C eth0 mclaws.coraid.com ether 00:30:48:85:F3:2E C eth0 clusterb.coraid.com ether 00:30:48:85:F3:2E C eth0
I can see that mclaws has the cluster IP. So mclaws is certainly the one answering kokone's NFS requests, but kokone never noticed a thing.
A final test is to actually cut the power of the active node, letting the standby node take over. Don't use the WTI IPS-800 to do it, just pull the power cable physically. The IPS-800 only allows one telnet session at a time, and Heartbeat needs to use it.
For testing purposes, you have the luxury of issuing the sync
command on the active node before you suddenly turn it off. Even
though the ext3 filesystem is quite robust, yanking the power is a
rude thing to do to your CLN.
Still, this test demonstrates conclusively that the failover works. NFS clients should experience only a several-second pause in service after you yank the active node's power.
After gaining initial experience with heartbeat, it becomes clear that despite providing a very real increase in fault tolerance, the "bare bones" configuration will not handle all faults. If the active host keeps performing the heartbeat, then failover won't occur, even if the highly available service is interrupted.
When the highly available service has been interrupted and the heartbeat has not, it is difficult for the servers to tell that failover is necessary in any straightforward and reliable way,
A machine using the highly available service is, however, in the perfect position to know when service has been interrupted. Such a machine can then tell the passive server to take over.
Software is available for performing this service monitoring function. A popular package is "Mon," the service monitoring daemon. Once you have successfully installed and configured a redundant pair of CLNs, it might be your next step.
In setting up a highly available system, it is important to know what software you are using and how it works. A good place to visit next would be the Linux High Availability website.
There is an interesting Linux Journal article written by a CLN Failover Kit expert user, Daniel Bartholomew. It has a good "References" section of its own.