In order to perform maintenance to the electrical supply providing power to part of the HPC Data Center at West Campus in preparation for adding additional hardware, some compute nodes will be unavailable starting on Tuesday, September 27, 2022, at 8:00 am. Maintenance is expected to be completed by the end of the day and nodes will then be reenabled.
The impacted nodes are all compute nodes on Milgram and those with a node name starting “p08” on Grace. This affects the following commons and PI partitions, but in some cases not all nodes in the partition are affected:
Milgram | ||
All compute nodes | ||
Grace | ||
bigmem | 3 nodes (5 nodes unaffected) | |
day | 66 nodes (233 nodes unaffected) | |
gpu | 4 nodes with V100 GPUs 5 nodes with RTX 2080 ti GPUs (22 nodes with a100, k80, p100, rtx5000 GPUs unaffected) |
|
gpu_devel | 1 node | |
mpi | 88 nodes (44 nodes unaffected) | |
transfer | 2 nodes affected | |
week | 17 nodes (8 nodes unaffected) | |
pi_balou | 9 nodes (44 nodes unaffected) | |
pi_berry | 1 nodes | |
pi_econ_io | 6 nodes | |
pi_econ_lp | 5 nodes (8 nodes unaffected) | |
pi_esi | 36 nodes | |
pi_gelernter | 1 node (1 node unaffected) | |
pi_hodgson | 1 node | |
pi_howard | 1 node | |
pi_jorgensen | 3 nodes | |
pi_levine | 20 nodes | |
pi_lora | 4 nodes | |
pi_manohar | 4 nodes (11 nodes unaffected) | |
pi_ohern | 2 nodes (20 nodes unaffected) | |
pi_polimanti | 2 nodes |
The system will automatically start using the nodes again once they are available. An email notification will be sent when the maintenance has been completed, and the nodes are available.
As the maintenance window approaches, the Slurm scheduler will not start any job on the impacted nodes if the job’s requested wallclock time extends past the start of the maintenance period (8:00 am on September 27, 2022). If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail.” (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.)