Theory

  • MPI is an environment for running parallel jobs (which can communicate to each other). There are many implementations of MPI, most notably LAM-MPI and OpenMPI. I am using LAM-MPI here.
  • MPI works if you run parallel stuff on your own, at a home cluster. With colleagues, you need a scheduler like SGE, SLURM, or Torque.
  • In R, you can use one of a few packages to run your programs in parallel. snowfall seems the most simple to me, because you can run your script on the supercluster as well as on your local machine (in parallel or sequential), without too much code changing.
  • Alternatives would have been snow (but snowfall is just a convenient wrapper for snow), and Rmpi (I’m not sure if you can run your code on a single computer with this package). Also, snow/snowfall use Rmpi functionalities if you run it in MPI mode.

With snowfall, you use list functions (lapply etc). Ideally, have a list with chunks of data, so each worker gets a chunk of the data.

Write your R script

Set up an R script:

## testscript.R
library(snowfall)
 
sfInit(parallel = TRUE, cpus = 16, type = "MPI")  # Initialize snowfall environment
 
mxmult <- function(n=1000){
	## cat() within a cluster function doesn't work :(
    A <- matrix(rnorm(n^2), nrow=n)
	B <- matrix(rnorm(n^2), nrow=n)
	return(det(A %*% B))
}
 
## use sfExport() to export any functions or data that you will need
##  on the workers in addition
 
dets <- sfSapply(1:32, mxmult, n=2000)  # snowfall-version of sapply()
 
save(dets, file="testscript-mxmult.RData")
 
sfStop()  # always stop your cluster

Testing

To test your script, use a small subset of data and instead of qsub, which submits the job to the scheduler, run it on qrsh (with appropriate parameters, i.e. qrsh -pe lammpi 10-32 myscript.sh, which is an “interactive” shell that shows you stdout directly. If this runs through, proceed.

Set up a shell script

Then, set up a shell script that calls execution of this R script. The best way is to put SGE parameters in the script file as such:

#### first_script.sh
#!/bin/bash
 
## Instead of passing gommand line parameters to qsub, do this:
#$ -cwd
#$ -o output.out
#$ -M alex@chaotic-neutral.de
#$ -m beas
 
## Resource requests:
#$ -l h_rt=01:00:00
#$ -l h_vmem=1G
 
## Parallel options:
#$ -pe lammpi 16
 
R --no-save < testscript.R

Advanced

Check cores and RAM available

Use qfree to see all nodes and what’s available. The parameter -l h_vmem is valid for one single worker, not the complete thing.

Script should run on local machine and cluster

The command qsub sets a few environment variables (see man qsub for which ones). You can query the value of this value, and if it is set, initialize a SGE cluster, but if it is unset, initialize a parallel environment on your local machines (with 4 cores in this example). Replace the sfInit() call with this snippet:

n_cpus <- Sys.getenv("NSLOTS")
 
if(n_cpus==""){
    sfInit(parallel = TRUE, cpus = 4, type="SOCK")  # local parallelization
} else {
    sfInit(parallel = TRUE, cpus = as.numeric(n_cpus), type = "MPI")  # SGE
}

Now the code runs regardless of whether you submit it to the SGE or run it locally.

Network random number generator

Start the network random number generator to make sure each cluster uses different seeds: sfClusterSetupRNG()

Load balancing

By using sfClusterApplyLB instead of sfLapply, faster machines are given further work without having to wait for the slowest machine to finish.

Saving/Restoring intermediate results

See ?sfClusterApplySR for that.

Supply a range of cores, if number is not too important

Use #$ -pe lammpi 32-64 if you would like 64 slots, but would also be happy with 32 slots if that meant less waiting time.

High memory needed

Submit your job to the queue maxmem.q: qsub -q maxmem.q myscript.sh

Lessons learned

  • Learn how to use the scheduler. qstat, qfree, and the likes.
  • Before running the big job, find out how much Memory and cores are free (qfree). Then find out how much memory your single jobs need:
  • Run the job on a small subset (or with -l h_rt=00:30:00), then after it aborts, run qstat -j to see how much max vmem it needed. Use this value to run the giant job with.

Sources