Beowulf Clusters


Beowulf technology, as pioneered by the Beowulf Project, is a method of networking a large number of commodity off the shelf PCs to work together as a single powerful computer.

The Astronomy Unit at Queen Mary, University of London was able to obtain a JREI grant to build a Beowulf cluster for heavy duty simulation in the fields of Space, Solar and Stellar research. I was involved in the construction of the cluster and was responsible for the installation and configuration of the software.

The cluster consists of a dual Pentium III Xeon server with 32 Pentium III nodes. I was also able to write a code to conduct full 3-D hybrid plasma simulations. We had not been able to do this kind of work previously due to the huge amount of memory required. A Beowulf cluster solves this problem by allowing the memory from all the nodes to be pooled. Inter-node communication is then conducted using PVM or, in our case, MPI.

Below are a few notes on the methods that I used to enable the cluster to remote boot from an operating system stored on the server. I found that much of the documentation on the internet relating so configuring Beowulfs was showing its age and it is quite likely that by this time these notes are equally outdated.

Etherboot

In our configuration, each nodes loads a bootstrap program from a floppy disk and downloads the kernel from the server. We are using the Etherboot program, available at http://etherboot.sourceforge.net/. By default, Etherboot uses the TFTP (Trivial File Transfer Protocol) to download the kernel. When I tried this, however, the download consistently failed. Etherboot seemed to be flooding the TFTP daemon with request, which caused the init process to suspend it. However, the -DDOWNLOAD_PROTO_NFS option in the .Config file in Etherboot's source directory changes the initial download protocol to NFS, which appears to be reliable. In our case, the bootstrap file we want is in the src/bin32 directory and is called 3c905c-top.lzrom. There is also a pre-compiled file for booting the floppy. To combine these files and produce a bootable floopy to bootstrap your node, use
cat floppyload.bin.pre 3c905c-tpo.lzrom > /dev/fd0

Remote Booting

The kernel requires compiled in support for the following: Making kernel to start using a root directory mounted via NFS requires a little black magic. First, create a block special file:
mknod /dev/nfsroot b 0 255
Now, the tell the kernel image to use it as root by running the following command on the kernel image (which will be a file called bzImage in a directory called something like /usr/src/linux/arch/i386/boot)
rdev bzImage /dev/nfsroot
Some HOWTOs will encourage you to change some settings in the nfsroot.c header file in order to change the default NFS root directory. We don't have to worry about this because we can use the mknbi program later to change those settings. In order to remote boot from your newly compiled kernel, you need to let Etherboot know how to start it running. For this purpose, Etherboot uses the mknbi utility to convert the kernel into tagged image format. Unfortuanately, I found the mknbi utility supplied with Etherboot to be unusable, so I downloaded the version supplied with Netboot at http://www.han.de/~gero/netboot.html. You'll want the version in the mknbi-linux directory. The man pages supplied with mknbi were out of date, so I had to find the command line options by looking at the source code. In our case, I used the following command on a kernel image called bzImage with an NFS root directory called /nfsroot. The tagged image file is called kernel:
mknbi --root-dir=/nfsroot bzImage kernel
I also created an image file called kernel.install for installation purposes. This starts with a root directory of /nfsroot/install.
mknbi --root-dir=/nfsroot/install bzImage kernel.install

Prototype node and installation script

In order to have a template from which to copy configuration information, I set up one node as a prototype. This node had an operating system and minimal software installed. It was set up in the usual way with a copy of the RedHat CDROM which I'd put on the server's /scratch disk. The var and tmp directories were copied to a tar file. You also need a configuration file for setting up your partitions - in our case it's called /etc/rc.d/rc4.d/hda.out. I created this by using our prototype node. The command used to write its partition information in a format suitable for sfdisk was
sfdisk -d /dev/hda > hda.out
In order to configure a virgin node using this information, I used a bootstrap installation script, listed here. The script uses sfdisk to partition the local hard disk. It creates two ext2 and six swap file systems. Finally, it copies over the node specific system files. I ran it from /nfsroot/install/etc/rc.d/rc4.d/S99local. You'll also need to set the default runlevel to 4 in /nfsroot/install/etc/inittab. The reason I used a runlevel of 4 is that I really don't want this script to be run accidentally (and runlevel 4 is unused on RedHat systems).


Page Maintained By : Rob Lowe
Last Revision : 24th February 2003