New HPC Cluster Grace

Thursday, May 29, 2014 - 5:39am

The High Performance Computing team is pleased to announce the upcoming availability of a new HPC cluster – Grace. As final testing concludes, we would like to share more information about Grace; named after computer scientist and United States Navy Rear Admiral Grace Murray Hopper, who received her Ph.D. in Mathematics from Yale in 1934.

Grace is an IBM System x High Performance Computing Cluster installed at Yale’s West Campus Data Center. The cluster consists of 72 compute nodes each with 20 cores and 128Gb of RAM. The processors are Intel Xeon E5-2660v2’s running at 2.2GHz. All nodes are running RHEL 6.4. Attached storage is in the form of 1PB of GPFS (General Parallel File System). The cluster nodes are connected internally via FDR InfiniBand.

The expected general availability date is Monday June 2nd at 12:00pm.

All users with accounts on the BulldogJ, BulldogK and Omega cluster will be provisioned an account on Grace. Like Omega,  Grace will support only SSH key authentication. For those with accounts on Omega, we will copy the public key(s) installed on Omega to Grace. Other users will receive an email notification containing the instructions for providing a public SSH key.

Please be aware that the scheduling, managing, monitoring and reporting of cluster workloads on Grace will be handled differently from other clusters. Instead of Moab, Grace will run the IBM Platform LSF (or simply, LSF) scheduler. Simplified documentation will be made available shortly by the HPC team, and comprehensive documentation will be added to the HPC website.

Please consider the following guidelines when deciding where to run jobs:
Run on Grace if:

  1. your jobs do not depend heavily on MPI
  2. your jobs require many cores and/or lots of memory on a single node
  3. your jobs are embarrassingly parallel (e.g. they use SimpleQueue)
  4. your jobs can share a node with other jobs
  5. your jobs tend to use many small-to-medium files

Run on Omega if:

  1. you depend on MPI to run in parallel on large numbers of cores/nodes
  2. you need high-performance (parallel) I/O, particularly with large files
  3. you have special node/queue privileges on Omega
  4. your jobs require exclusive node access
  5. you need GPUs

In other cases, you may wish to select a cluster based on the observed cluster load or according to which one has the proper software for your work.

As previously communicated, steps are being taken to decommission the BulldogJ and BulldogK clusters. If you are affected by this, please begin identifying data that needs to be retained and moved elsewhere. It
is strongly recommended that unnecessary data be deleted at this time
prior to migration. It should be expected that it may take up to 10
hours to transfer 1TB of data to a new location, though actual transfer
times will depend on numerous factors.  Data transfers must be completed
by July 11th.

If you are currently using the Omega cluster, please consider using
Grace instead based on the guidelines above.

If you have any questions or concerns about this exciting new offering,
please contact hpc@yale.edu.