9/27/2022: One-day maintenance will affect some Grace nodes and all Milgram compute nodes

In order to perform maintenance to the electrical supply providing power to part of the HPC Data Center at West Campus in preparation for adding additional hardware, some compute nodes will be unavailable starting on Tuesday, September 27, 2022, at 8:00 am. Maintenance is expected to be completed by the end of the day and nodes will then be reenabled.

The impacted nodes are all compute nodes on Milgram and those with a node name starting “p08” on Grace. This affects the following commons and PI partitions, but in some cases not all nodes in the partition are affected:
 

  Milgram  
    All compute nodes
  Grace  
  bigmem 3 nodes (5 nodes unaffected)
  day 66 nodes (233 nodes unaffected)
  gpu 4 nodes with V100 GPUs
5 nodes with RTX 2080 ti GPUs
(22 nodes with a100, k80, p100, rtx5000 GPUs unaffected)
  gpu_devel 1 node
  mpi 88 nodes (44 nodes unaffected)
  transfer 2 nodes affected
  week 17 nodes (8 nodes unaffected)
  pi_balou 9 nodes (44 nodes unaffected)
  pi_berry 1 nodes
  pi_econ_io 6 nodes
  pi_econ_lp 5 nodes (8 nodes unaffected)
  pi_esi 36 nodes
  pi_gelernter 1 node (1 node unaffected)
  pi_hodgson 1 node
  pi_howard 1 node
  pi_jorgensen 3 nodes
  pi_levine 20 nodes
  pi_lora 4 nodes
  pi_manohar 4 nodes (11 nodes unaffected)
  pi_ohern 2 nodes (20 nodes unaffected)
  pi_polimanti 2 nodes

The system will automatically start using the nodes again once they are available. An email notification will be sent when the maintenance has been completed, and the nodes are available.

As the maintenance window approaches, the Slurm scheduler will not start any job on the impacted nodes if the job’s requested wallclock time extends past the start of the maintenance period (8:00 am on September 27, 2022). If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail.” (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.)