Slurm Controller Deployment Guide#

The following instructions are for deploying the Slurm Controller.

Prerequisites#

This guide is written for a Red Hat Enterprise Linux 8 based operating system which is operating within a cluster of systems and the following are the prerequisites:

Follow-on Deployments#

The following guides can be applied after the deployment of their associated nodes.

References#

These instructions were written for Slurm 22.05.6.

The official Slurm source code can be downloaded from:
https://www.schedmd.com/downloads.php

Deployment Scripts#

An example bash script of the instructions has been provided: deploy-slurm-controller.sh

The required source files have been provided:
slurm-22.05.6.tar.bz2
SHA1: eb282954911b807b365d35298bddff459eb7383a

The example configuration files have been provided:
cgroup.conf
slurm.conf
slurmdbd.conf

Deployment Steps#

Important

The Red Hat’s CodeReady Linux Builder Repository or AlmaLinux PowerTools/CRB Repository are required for building Slurm. These additional repositories are not needed on the user/compute nodes.

Note

Instructions assume execution using the root account.

  1. Connect the system to the NFS Server:

  1. Connect the system to the IdM Server:

Build Setup#

  1. Enable Red Hat’s CodeReady Linux Builder/AlmaLinux PowerTools/CRB Repository:

For Red Hat Enterprise Linux 8

subscription-manager repos --enable codeready-builder-for-rhel-8-x86_64-rpms
dnf distro-sync

For AlmaLinux 8

dnf config-manager --set-enabled powertools
dnf distro-sync
  1. Install dependencies:

dnf -y install gcc binutils ncurses-devel automake autoconf \
    rpm-devel rpm-build pam-devel readline-devel python3 \
    perl-ExtUtils-MakeMaker mariadb-server mariadb-devel \
    mariadb-connector-odbc openssl-devel pkgconf-pkg-config \
    zlib bzip2 bzip2-devel make hwloc-devel hwloc hwloc-libs \
    hwloc-plugins munge munge-devel munge-libs lua lua-devel \
    lua-json lua-libs lua-posix lua-socket rrdtool-lua \
    lua-filesystem rrdtool rrdtool-devel python3-rrdtool \
    rrdtool-perl mailx lz4 lz4-libs lz4-devel libcurl \
    libcurl-devel infiniband-diags-devel ibacm
  1. Exclude DNF/Yum Slurm packages:

cat >> /etc/dnf/dnf.conf <<EOL

# Exclude slurm Packages
excludepkgs=slurm*

EOL
  1. Create the build directory and go to it:

mkdir -p /root/tmp/
cd /root/tmp/
  1. Upload files:

Note

Any file transfer method can be used.

scp slurm-22.05.6.tar.bz2 root@slurm.engwsc.example.com:/root/tmp/
scp cgroup.conf           root@slurm.engwsc.example.com:/root/tmp/
scp slurm.conf            root@slurm.engwsc.example.com:/root/tmp/
scp slurmdbd.conf         root@slurm.engwsc.example.com:/root/tmp/
  1. Set a database password for Slurm Accounting:

Important

Replace MARIADB_SLURM_USER_PASSWD with a confidential secure password that will be used for the Slurm Accouting Database.

sed -i 's/StoragePass=.*/StoragePass=MARIADB_SLURM_USER_PASSWD/g' /root/tmp/slurmdbd.conf
  1. Set a ClusterName, SlurmctldHost, and AccountingStorageHost:

Important

Replace engwsc.example.com with the domain name of your network.

sed -i 's/ClusterName=.*/ClusterName=engwsc/g' /root/tmp/slurm.conf
sed -i 's/SlurmctldHost=.*/SlurmctldHost=slurm/g' /root/tmp/slurm.conf
sed -i 's/AccountingStorageHost=.*/AccountingStorageHost=slurm.engwsc.example.com/g' /root/tmp/slurm.conf
  1. Modify /root/tmp/slurm.conf with the configuration of the cluster:

Important

The following is just an example configuration. Replace with the configuration of your cluster.

# COMPUTE NODES (Adjust to match real hardware)
NodeName=comp01.engwsc.example.com CPUs=32 State=UNKNOWN
NodeName=comp02.engwsc.example.com CPUs=32 State=UNKNOWN
NodeName=comp03.engwsc.example.com CPUs=32 State=UNKNOWN
NodeName=comp04.engwsc.example.com CPUs=32 State=UNKNOWN
PartitionName=common Nodes=ALL Default=YES MaxTime=1440 State=UP

Munge#

  1. Create a Munge authentication key:

Important

/etc/munge/munge.key should be readable only by the munge account and remain confidential.

mkdir -p /etc/munge/
chown -R munge:munge /etc/munge
chmod 700 /etc/munge
/usr/sbin/create-munge-key -f
chmod 600 /etc/munge/munge.key
  1. Enable Munge service:

systemctl enable --now munge

Slurm#

  1. Create local Slurm user account and group:

groupadd -g 64030 slurm
useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u 64030 -g slurm -s /bin/bash slurm
  1. Build Slurm RPM:

rpmbuild -ta slurm-22.05.6.tar.bz2
  1. Install Slurm Controller RPMs:

rpm -ivh \
    /root/rpmbuild/RPMS/x86_64/slurm-22.05.6-1.el8.x86_64.rpm \
    /root/rpmbuild/RPMS/x86_64/slurm-example-configs-22.05.6-1.el8.x86_64.rpm \
    /root/rpmbuild/RPMS/x86_64/slurm-perlapi-22.05.6-1.el8.x86_64.rpm \
    /root/rpmbuild/RPMS/x86_64/slurm-slurmctld-22.05.6-1.el8.x86_64.rpm \
    /root/rpmbuild/RPMS/x86_64/slurm-slurmd-22.05.6-1.el8.x86_64.rpm \
    /root/rpmbuild/RPMS/x86_64/slurm-slurmdbd-22.05.6-1.el8.x86_64.rpm \
    /root/rpmbuild/RPMS/x86_64/slurm-torque-22.05.6-1.el8.x86_64.rpm
  1. Copy configuration files to Slurm RPMs path for inclusion in the tar file:

Note

Files munge.key, slurm.conf, and cgroup.conf are required to be installed on all nodes.
File slurmdbd.conf is only required on the accounting/controller node.

cp /etc/munge/munge.key    /root/rpmbuild/RPMS/x86_64/
cp /root/tmp/cgroup.conf   /root/rpmbuild/RPMS/x86_64/
cp /root/tmp/slurm.conf    /root/rpmbuild/RPMS/x86_64/
cp /root/tmp/slurmdbd.conf /root/rpmbuild/RPMS/x86_64/
  1. TAR RPMs for use by Compute/Client systems:

EXEC_PWD=`pwd`

cd /root/rpmbuild/RPMS/x86_64/
tar -cf ${EXEC_PWD}/slurm-22.05.6.tar slurm-*.rpm cgroup.conf slurm.conf slurmdbd.conf munge.key

cd ${EXEC_PWD}
gzip -9 slurm-22.05.6.rpm.tar
sha256sum -b slurm-22.05.6.rpm.tar.gz > slurm-22.05.6.rpm.tar.gz.sha256
  1. Install the configuration files:

Warning

File /etc/slurm/slurm.conf is hashed by the controller and must be exactly the same on all systems or the controller service (slurmctld) will not start.

mkdir -p /etc/slurm/

cp cgroup.conf   /etc/slurm/
cp slurm.conf    /etc/slurm/
cp slurmdbd.conf /etc/slurm/

chmod 755 /etc/slurm/
chmod 600 /etc/slurm/cgroup.conf
chmod 600 /etc/slurm/slurm.conf
chmod 600 /etc/slurm/slurmdbd.conf

chown -R slurm:slurm /etc/slurm/
  1. Create log files:

mkdir -p /var/log/slurm/

touch /var/log/slurm/slurm.jobcomp.log
touch /var/log/slurm/slurmctld.log
touch /var/log/slurm/slurmdbd.log

chmod 755 /var/log/slurm/
chmod 644 /var/log/slurm/*.log

chown -R slurm:slurm /var/log/slurm
  1. Create spool files:

mkdir -p /var/spool/slurmctld/
mkdir -p /var/spool/slurmdbd/

chmod 700 /var/spool/slurmctld
chmod 700 /var/spool/slurmdbd

chown -R slurm:slurm /var/spool/slurmctld
chown -R slurm:slurm /var/spool/slurmdbd
  1. Setup log rotation:

cat >> /etc/logrotate.d/slurm <<EOL
/var/log/slurm/slurm.jobcomp.log
/var/log/slurm/slurmctld.log
/var/log/slurm/slurmdbd.log
{
    missingok
    notifempty
    rotate 4
    weekly
    create
}

EOL

chmod 644 /etc/logrotate.d/slurm

MariaDB#

  1. Enable MariaDB service:

systemctl enable --now mariadb
  1. Secure databases:

Important

Replace MARIADB_ROOT_PASSWD with a confidential secure password that will be used for the MariaDB root account.

# Enter current password for root (enter for none): <enter>
# Set root password? [Y/n] Y
# New password: MARIADB_ROOT_PASSWD
# Re-enter new password: MARIADB_ROOT_PASSWD
# Remove anonymous users? [Y/n] Y
# Disallow root login remotely? [Y/n] Y
# Remove test database and access to it? [Y/n] Y
# Reload privilege tables now? [Y/n] Y

mysql_secure_installation
  1. Modify the innodb configuration:

mkdir -p /etc/my.cnf.d/
cat >> /etc/my.cnf.d/innodb.cnf <<EOL
[mysqld]
innodb_buffer_pool_size=1024M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900

EOL
  1. Delete MariaDB logs and restart service:

systemctl stop mariadb
rm -f /var/lib/mysql/ib_logfile?
systemctl start mariadb
  1. Create accounting database for Slurm:

Important

Replace MARIADB_ROOT_PASSWD with the password provided in step 22.
Replace MARIADB_SLURM_USER_PASSWD with the password provided in step 7.

mysql -u root -pMARIADB_ROOT_PASSWD --execute "grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by 'MARIADB_SLURM_USER_PASSWD' with grant option;"
mysql -u root -pMARIADB_ROOT_PASSWD --execute "create database slurm_acct_db;"

Firewall#

  1. Configure firewalld rules:

Important

Replace the IPv4 Address and Subnet mask with the value of your network.

systemctl enable --now firewalld
firewall-cmd --remove-port=3306/tcp --permanent 2> /dev/null
firewall-cmd --remove-port=3306/udp --permanent 2> /dev/null

firewall-cmd --zone=public --add-source=192.168.1.0/24 --permanent
firewall-cmd --zone=public --add-port={6817/tcp,6818/tcp,6819/tcp} --permanent
firewall-cmd --reload

Slurm Controller Services#

  1. Enable Slurm Controller services:

systemctl enable --now slurmdbd
systemctl enable --now slurmctld