Linux High Performance Clustering

I work in the High Performance Computing Center at the University of Southern California. Here's a few things of interest to the community...

xCAT

xCAT is an extensible and powerful cluster management suite by Egan Ford at IBM. Due to legal reasons, the OSS components of xCAT are hosted here.

xcat-dist-oss-1.2.0.tgz md5 signed md5 Manifest, xCAT OSS package

xcat-dist-oss-1.3.0.tgz md5 signed md5 Manifest, xCAT OSS package

xcat-ipmitool.tgz md5 signed md5 README.ipmitool, ipmitool support for xCAT for 1.1.x (1.2.x already includes it)

PBS Utilities

perl-PBS, perl module for PBS client libraries, and includes a newer pbstop. This is still alpha code.

dumpmom, dump some info from pbs_mom for scripting/debugging purposes. configure and makefile requires TORQUE-2.1.0 or newer.

submit_p4shmem csh script, PBS job script for the mpich p4shmem device, it tries to fix some lameness in mpirun.

TORQUE 1.2.0 patches

These are extra or experimental patches for the TORQUE resource manager.

torque-1.1.0p4-qstat-empty-headers.patch makes qstat print column headers even with empty output.

torque-1.2.0b0-dumpmom.patch adds the dumpmom command.

torque-1.2.0b0-down_on_error.patch marks nodes down if they have an ERROR message (see health check docs).

torque-1.2.0p1-momupdateinternval.patch is a trivial patch that increases the mom stat update interval. The non-configurable default is too low in my opinion.

torque-1.2.0p5-jobnanny.patch protects against jobs that are stuck in an exiting or preexiting state by adding a "job deletion nanny" that periodically tries to kill jobs that have been killed (by qdel or your scheduler). This mechanism also purges jobs that don't exist on the mother superior node. The code is disabled unless JOB_DELETE_NANNY is defined at compile.

torque-1.2.0p5-jobdepterm2.patch ensures that deleted or aborted jobs also remove any dependant jobs.

Linux 32bit uid/gid process accounting fix

BSD Process accounting on Linux is broken if you have UIDs and/or GIDs over 65536(2^16). These patches fix the problem while maintaining backwards compatibility.

32bit-pacct-howto.txt read this before doing anything else

linux-acct-uid32.patch, to fix BSD process accounting in linux for uids/gids over 2^16.

acct.h-uid32.patch, fix up /usr/include/sys/acct.h if you've patched the kernel with linux-acct-uid32.patch.

process accounting tools - acct-6.3.2-32bit.patch - psacct-6.3.2-9uid32.src.rpm, for psacct-6.3.2 on RedHat 7.2

Linux 2.4.20 patches

symlink_unbalanced_kunmap.diff, fixes a kernel oops when many nodes create the same symlink at the same time in an NFS mount using a Solaris server

big-ring-buffer.patch, if the top of "dmesg" is getting lost, use this patch

linux-2.4.20-ext3.patch, important ext3 fixes for 2.4.20

irqbalance-2.4.20-MRC.patch, IRQ load balancing performance enhancement

linux-2.4.20-VFS-lock.patch, filesystem locking within the VFS (mostly for LVM and ext3/quotas)

linux-2.4.20-mrc-base.patch, fix filesystem quotas for 32bit uids

linux-ipmi-2.4.20-v21.diff, OpenIPMI driver

preempt-kernel-rml-2.4.20-3.patch, improve system responiveness with Robert Love's wonderful preemptible patch (it's no coincidence that his site looks just like mine; I stole his stylesheet!)

tg3-2.4.20.patch, version 1.4 of the tigon3 driver (use this instead of that nappy bcm5700 driver).

Linux 2.4.21 patches

big-ring-buffer.patch, if the top of "dmesg" is getting lost, use this patch

preempt-kernel-rml-2.4.21-1.patch, improve system responsiveness with Robert Love's wonderful preemptible patch

Send questions to garrick@usc.edu
Consider everything on this page (unless noted as from another author) to be GPL'd
My GPG pubkey
Valid HTML 4.01