Despite the increasing computational power of the modern computers, there are several contexts where a greater availability of resources is useful. A possible way to achieve this goal is given by the parallel computation: no matter how powerful the workstation processor is, if it is possible to write programs that scale from one processor to thousands, this application will take advantage of the additional computational power due to parallelism. Even high-end workstation companies acknowledge this by providing multiprocessors and support for clusters.
In global optimization, since finding a solution (a global optimum) cannot be guaranteed in a finite number of steps, the performance of an algorithm is also bounded by the time resource: for that reason we chose to invest some of our resources and (a lot) of our time in the attempt to realize a working one. So we wrote this page, primarily because we needed a trace of what we did and hopefully because someone can found useful our experience.
The cluster actually is composed of 5 machines: a master + 4 nodes.
We partitioned the master node hard disk as follows:
/home 197 GB
/swap 2.8 GB
/ 100 GB for the root directory
We installed a Debian Linux (Sarge) system, with kernel 2.6.8-2 (2.4
may be unable to recognize the SATA hard disk), using a net-install CD,
only on the master node.
The hard disk SATA is mounted as sda device, the DVD-RW as hda device.
We used as download mirror http://ftp.it.debian.org/debian.
We selected the base-config installing the desktop
environment, the DNS server and the file server.
At the end of the installation we also added a line in /etc/hosts:
192.168.100.100 nodo00.dsi.unifi.it nodo00
in order to define the hostname in the internal network (the cluster network). It's a good idea to substitute in /etc/apt/sources.list the word stable with sarge: when we'll launch the update of the system we'll not risk the "changed-release" surprise!
In order to quickly reproduce the same installation on multiple machines we used an automatic installation software called FAI (fully automatic installation...) that is also available as .deb package. We installed the Debian package fai using apt-get; we also installed/updated the following packages:
FAI requires the creation of a local Debian mirror in order to retrieve packages for the remote installation of the other nodes. We created the following directory for the mirror:
#mkdir -p /files/scratch/debmirror
We added the following lines in the file /etc/exports:
/files/scratch/debmirror *(ro)
We copied /etc/apt/sources.list in /etc/fai/.
We modified the following FAI configuration files:
We added a new user, tom in group linuxadmin, we logged in as tom, and we started the mirror creation (this operation takes several hours):
#/usr/share/doc/fai/examples/utils/mkdebmirror --debug
We made a backup of the mirror in /debmirror_backup because in
some case, during a failed FAI installation, the mirror may be
completely erased (!).
In order to avoid a mount error during the installation we modified, in /usr/sbin/make-fai-nfsroot, the line:
mount -o ro,noatime,rsize=8192 $FAI_DEBMIRROR $NFSROOT/$MNTPOINT || \
in:
mount --bind -o ro,noatime,rsize=8192 $FAI_DEBMIRROR $NFSROOT/$MNTPOINT || \
We launched the FAI setup:
#fai-setup -v
After the setup completed successfully, we performed the following
commands and files modifications in order to fix a number of problems
encountered in first attempts of remote node installation.
We used Beowulf example for FAI configuration files:
#cp -a /usr/share/doc/fai/examples/beowulf/* /usr/local/share/fai
#chown -R tom /usr/local/share/fai
In each file in /usr/local/share/fai we replaced atom00
with nodo00.
We put UTC=no in /usr/local/share/fai/class/atomclient.var.
We modified /usr/local/share/fai/disk_config/ATOMCLIENT as:
disk_config sda
primary / 75000- rw, ;ext3
logical swap 2000 rw
We replaced rsh with ssh in /usr/local/share/fai/package_config/beowulf.
We collected MAC adresses in the following way: on the master we simply typed
#ifconfig -a
and we took the internal network MAC. On each node (note that at this point no node has operating system installed yet!), in order to activate LAN boot, we modified in the BIOS configuration (at boot time):
advanced -> onboard devices configuration -> onboard lan boot rom ENABLED
then we rebooted and, again in BIOS configuration:
boot -> boot devices priority -> first boot devices YUKON PXE
then we rebooted and waited for the error message that contains MAC address.
After MAC collection, we performed the following actions, as suggested by FAI documentation:
#chgrp -R linuxadmin /boot/fai/
#chmod -R g+rwx /boot/fai/
#cp /usr/share/doc/fai/examples/utils/* /usr/local/bin/
#apt-get install tftpd-hpa
#cp /usr/share/doc/fai/examples/etc/dhcpd.conf /etc/dhcp3/dhcpd.conf
In /etc/dhcp3/dhcpd.conf we replaced FAISERVER with nodo00,
then we uncommented the line:
option nis-domain "beowulf"
In /etc/inetd.conf we added the line:
tftp dgram udp wait root /usr/sbin/in.tftpd in.tftpd -s /boot/fai/
then we created the file /etc/ethers:
[nodo00 MAC address] 192.168.100.100
[nodo01 MAC address] 192.168.100.101
[nodo02 MAC address] 192.168.100.102
[nodo03 MAC address] 192.168.100.103
[nodo04 MAC address] 192.168.100.104
In order to set via dhcp the internal network addresses we modified /etc/dhcp3/dhcp.conf.
We copied /etc/hosts in /usr/lib/fai/nfsroot/etc/hosts.
Before the creation of the system image we had to modify a number of
FAI scripts:
#cp /usr/share/doc/fai/examples/simple/class/20-hwdetect.source
/usr/local/share/fai/class
In /usr/lib/fai/nfsroot/etc/fai/fai.conf we modified the line:
FAI_DEBMIRROR=/files/scratch/debmirror/debian
in:
FAI_DEBMIRROR=$mirrorhost:/files/scratch/debmirror/debian
In /usr/lib/fai/nfsroot/usr/sbin/fai we put
#set -xv #for full debugging
In order to use our version of the kernel (that is different from the FAI one!) in the nodes, we modified FAI scripts to copy kernel and modules files from the Sarge installation on the master to the nodes hard disks at FAI installation time. To do this we performed the following actions:
#mkdir -p /files/scratch/debmirror/debian/kernel
#cp /boot/system.map-2.6.8-2 /files/scratch/debmirror/debian/kernel/
#cp /boot/config-2.6.8-2-386 /files/scratch/debmirror/debian/kernel/
#cp /boot/initrd.img-2.6.8-2-386
/files/scratch/debmirror/debian/kernel/
#cp /boot/vmlinuz-2.6.8-2-386 /files/scratch/debmirror/debian/kernel/
#cd /lib/modules/2.6.8-2-386
#tar cvzf moduli.tar.gz *
#cp moduli.tar.gz /files/scratch/debmirror/debian/kernel/
#cd /usr/lib/fai/nfsroot/var/lib/apt/lists/
#cp ftp.it.debian.org_* _mnt2_*
#cp /lib/modules/2.6.8-2-386/modules.dep
/files/scratch/debmirror/debian/kernel
In the file /usr/lib/fai/nfsroot/usr/share/fai/subroutines-linux inside the task_prepareapt function body, after the line:
echo $hostname > $fairoot/etc/hostname
we added the line:
mount nodo00:/files/scratch/debmirror/ /tmp/target/mnt2/ # LINE ADDED BY US
cp -v /var/lib/apt/lists/_mnt2* $FAI_ROOT/var/lib/apt/lists/ # LINE ADDED BY US
In task_instsoft function body, at the beginning, we added the
lines:
mount nodo00:/files/scratch/debmirror/ /tmp/target/mnt2/ # LINE ADDED BY US
cp -v /var/lib/apt/lists/_mnt2* $FAI_ROOT/var/lib/apt/lists/ # LINE ADDED BY US
cp -v /tmp/target/mnt2/debian/kernel/* /tmp/target/boot/ # LINE ADDED BY US
mkdir -p /tmp/target/lib/modules/2.6.8-2-386/ # LINE ADDED BY US
cp -v /tmp/target/boot/modul* /tmp/target/lib/modules/2.6.8-2-386 # LINE ADDED BY US
cd /tmp/target/lib/modules/2.6.8-2-386/ # LINE ADDED BY US
tar xvzf moduli.tar.gz # LINE ADDED BY US
In the file /usr/lib/fai/nfsroot/etc/fai/fai.conf we put:
FAI_DEBMIRROR=$mirrorhost: /files/scratch/debmirror
In /usr/lib/fai/nfsroot/etc/apt/sources.list we put:
deb file:///mnt2/debian sarge main contrib non-free
deb file:///mnt2/debian-non-us main contrib non-free
deb file:///mnt2/debian-security/sarge/updates main contrib non-free
In /usr/local/share/fai/package_config/BEOWULF we deleted apache
and we added vim, module-init-tools, modutils, nfs-common,
nfs-kernel-server under packages install.
In /usr/local/share/fai/package_config/DEFAULT we deleted memtest86+
and we added kernel-image-2.6.8-i386 under i386.
In /usr/lib/fai/nfsroot/etc/network/interfaces we put:
auto lo
iface lo init loopback
auto eth0
iface eth0 inet static
address [node address]
netmask 255.255.255.0
network 192.168.100.0
dns-nameservers 192.168.100.100
At the end of the file /etc/dhcp3/dhcp.conf we added the
following lines:
host nodo01 {hardware ethernet [nodo01 MAC address]; fixed-address nodo01}
host nodo02 {hardware ethernet [nodo02 MAC address]; fixed-address nodo02}
host nodo03 {hardware ethernet [nodo03 MAC address]; fixed-address nodo03}
host nodo04 {hardware ethernet [nodo04 MAC address]; fixed-address nodo04}
Then we started NFS daemons and exported shared directories:
#/etc/init.d/nfs-kernel-server restart
#/etc/init.d/nfs-common restart
#exportfs -ra
In the end, we logged in as tom, started to broadcast the operating system image:
#su tom
#fai-chboot -IFv [nodoxx]
and booted each node with LAN boot activated.
IMPORTANT: make sure to backup your FAI scripts.
After the successful installation on each node, we needed to configure ssh and NFS daemons.
In the file: /etc/hosts.allow we put:
portmap: 192.168.100.0/255.255.255.0
In the file /etc/exports we added the line:
/home 192.168.100.100/255.255.255.0 (async,rw)
On each node, in /etc/fstab we put the lines:
nodo00:/files/scratch/debmirror /mnt2 nfs ro 0 0
nodo00:/home/ /home/ nfs rw,exec,async 0 0
in this way each user home is automatically shared.
In order to allow the user authentication in ssh comunication, we followed this procedure:
#su [user name]
#ssh-keygen -t rsa
#cp /home/[user home]/.ssh/id_rsa.pub /home/[user home]/.ssh/authorized_keys
We didn't use .deb package for mpi, but we downloaded directly the source code in the tarball mpich_1.2.7p1.tar.gz from the website http://www-unix.mcs.anl.gov/mpi/mpich/download.html, then we compiled and set it on each node:
#tar xvzf mpich_1.2.7p1.tar.gz
#export RSHCOMMAND=ssh
#cd mpich_1.2.7p1
#./configure --prefix=/usr/local/
#make
#make install
We added in /etc/profile this line in order to use ssh for mpi comunication:
export RSHCOMMAND="ssh"
>From the website http://charm.cs.uiuc.edu/
we downloaded charm-5.9_src.tar.gz and we launched:
#tar xvzf charm-5.9_src.tar.gz
#cd charm-5.9
#./build net-linux
>From the website http://clusterresources.com/downloads/torque
we downloaded torque2.0p5.tar.gz and we installed it on each
node:
#tar -xzvf torque2.0p5.tar.gz
#cd torque2.0p5
#./configure
#make
#make install
In the file /usr/spool/PBS/servername we put:
nodo00
In the file /usr/spool/PBS/mom_priv/config we added this line:
$usecp *./home /home
Then, only on the master, we launched:
#cd /root/torque2.0p5
#./torque.setup
On each node we started the "mom" daemon:
#/usr/local/sbin/pbs_mom
Then only on the master we started the server daemon:
#/usr/local/sbin/pbs_server
We didn't use the standard pbs scheduler, but we installed the more
flexible Maui scheduler. We downloaded from the same website the
tarball maui-3.2.6p13.tar.gz and we installed it:
#tar xvzf maui-3.2.6p13.tar.gz
#cd maui-3.2.6p13
#./configure --prefix=/usr/local/
After the installation we wrote a series of useful scripts to quickly execute often used sequences of commands.
addusercluster [username]
Adds a user [username] and authorizes him to log on each node by ssh
without password request.
adduser $1
export numero_nodi=4
export contatore=1
for(($contatore; $contatore<=$numero_nodi; contatore++));
do ssh nodo0$contatore useradd $1;
#do echo nodo0$contatore;
done;
sudo -u $1 ssh-keygen -t rsa -N '' -f /home/$1/.ssh/id_rsa
sudo -u $1 cp /home/$1/.ssh/id_rsa.pub /home/$1/.ssh/authorized_keys
delusercluster [username]
Deletes the user [username] and his home.
deluser $1
export numero_nodi=4
export contatore=1
for(($contatore; $contatore<=$numero_nodi; contatore++));
do ssh nodo0$contatore userdel $1;
done;
rm -fR /home/$1
checkalljobs [keyword]
Prints all processes that contains the word [keyword] in lists separated by node. It can be used for job names, user names, daemons and so on.
#!/bin/bash
echo " "
echo nodo00:
ps aux | grep $1
export numero_nodi=4
export contatore=1
for((contatore=1; contatore<=$numero_nodi; contatore++));{
echo " "
echo nodo0$contatore:
ssh nodo0$contatore ps aux | grep $1
}
rebootcluster
Reboots first the compute nodes and at last the master.
#!/bin/bash
export numero_nodi=4
export contatore=1
for((contatore=1; contatore<=$numero_nodi; contatore++));{
ssh nodo0$contatore reboot
}
reboot
PBScluster
Starts on each node all PBS daemons and the Maui scheduler after a reboot.
#!/bin/bash
/usr/local/sbin/pbs_mom
export numero_nodi=4
export contatore=1
for((contatore=1; contatore<=$numero_nodi; contatore++));{
ssh nodo0$contatore /usr/local/sbin/pbs_mom
}
/usr/local/sbin/pbs_server
/usr/local/maui/sbin/maui
In case a node falls (for example, when a physical hard-disk damage occours) and you need to reinstall the system on the node [nodexx], perform the following operations:
#fai-chboot -IFv [nodexx]
#scp [master node]:/etc/passwd /etc
#scp [master node]:/etc/group /etc
#scp [master node]:/etc/shadow /etc
#scp [master node]:/root/.ssh/* /root/.ssh/