Slurm Controller Deployment Guide#
The following instructions are for deploying the Slurm Controller.
Prerequisites#
This guide is written for a Red Hat Enterprise Linux 8 based operating system which is operating within a cluster of systems and the following are the prerequisites:
Follow-on Deployments#
The following guides can be applied after the deployment of their associated nodes.
References#
These instructions were written for Slurm
22.05.6
.
The official Slurm source code can be downloaded from:
https://www.schedmd.com/downloads.php
The following instructions are based off of the official and unofficial documentation found here:
https://github.com/dun/munge/wiki/Installation-Guide
https://slurm.schedmd.com/quickstart_admin.html
https://southgreenplatform.github.io/trainings/hpc/slurminstallation/
Slurm Configuration References:
- Slurm Configuration Tool: https://slurm.schedmd.com/configurator.html
- cgroup.conf: https://slurm.schedmd.com/cgroup.conf.html
- slurm.conf: https://slurm.schedmd.com/slurm.conf.html
- slurmdbd.conf: https://slurm.schedmd.com/slurmdbd.conf.html
Deployment Scripts#
An example bash script of the instructions has been provided:
deploy-slurm-controller.sh
The required source files have been provided:
slurm-22.05.6.tar.bz2
SHA1: eb282954911b807b365d35298bddff459eb7383a
The example configuration files have been provided:
cgroup.conf
slurm.conf
slurmdbd.conf
Deployment Steps#
Important
The Red Hat’s CodeReady Linux Builder Repository or AlmaLinux PowerTools/CRB Repository are required for building Slurm. These additional repositories are not needed on the user/compute nodes.
Note
Instructions assume execution using the root
account.
Connect the system to the NFS Server:
See Guide: NFS Client Deployment Guide
Connect the system to the IdM Server:
See Guide: IdM Client Deployment Guide
Build Setup#
Enable Red Hat’s CodeReady Linux Builder/AlmaLinux PowerTools/CRB Repository:
For Red Hat Enterprise Linux 8
subscription-manager repos --enable codeready-builder-for-rhel-8-x86_64-rpms dnf distro-syncFor AlmaLinux 8
dnf config-manager --set-enabled powertools dnf distro-sync
Install dependencies:
dnf -y install gcc binutils ncurses-devel automake autoconf \ rpm-devel rpm-build pam-devel readline-devel python3 \ perl-ExtUtils-MakeMaker mariadb-server mariadb-devel \ mariadb-connector-odbc openssl-devel pkgconf-pkg-config \ zlib bzip2 bzip2-devel make hwloc-devel hwloc hwloc-libs \ hwloc-plugins munge munge-devel munge-libs lua lua-devel \ lua-json lua-libs lua-posix lua-socket rrdtool-lua \ lua-filesystem rrdtool rrdtool-devel python3-rrdtool \ rrdtool-perl mailx lz4 lz4-libs lz4-devel libcurl \ libcurl-devel infiniband-diags-devel ibacm
Exclude DNF/Yum Slurm packages:
cat >> /etc/dnf/dnf.conf <<EOL # Exclude slurm Packages excludepkgs=slurm* EOL
Create the build directory and go to it:
mkdir -p /root/tmp/ cd /root/tmp/
Upload files:
Note
Any file transfer method can be used.
scp slurm-22.05.6.tar.bz2 root@slurm.engwsc.example.com:/root/tmp/ scp cgroup.conf root@slurm.engwsc.example.com:/root/tmp/ scp slurm.conf root@slurm.engwsc.example.com:/root/tmp/ scp slurmdbd.conf root@slurm.engwsc.example.com:/root/tmp/
Set a database password for Slurm Accounting:
Important
Replace
MARIADB_SLURM_USER_PASSWD
with a confidential secure password that will be used for the Slurm Accouting Database.sed -i 's/StoragePass=.*/StoragePass=MARIADB_SLURM_USER_PASSWD/g' /root/tmp/slurmdbd.conf
Set a ClusterName, SlurmctldHost, and AccountingStorageHost:
Important
Replace
engwsc.example.com
with the domain name of your network.sed -i 's/ClusterName=.*/ClusterName=engwsc/g' /root/tmp/slurm.conf sed -i 's/SlurmctldHost=.*/SlurmctldHost=slurm/g' /root/tmp/slurm.conf sed -i 's/AccountingStorageHost=.*/AccountingStorageHost=slurm.engwsc.example.com/g' /root/tmp/slurm.conf
Modify
/root/tmp/slurm.conf
with the configuration of the cluster:
Important
The following is just an example configuration. Replace with the configuration of your cluster.
# COMPUTE NODES (Adjust to match real hardware) NodeName=comp01.engwsc.example.com CPUs=32 State=UNKNOWN NodeName=comp02.engwsc.example.com CPUs=32 State=UNKNOWN NodeName=comp03.engwsc.example.com CPUs=32 State=UNKNOWN NodeName=comp04.engwsc.example.com CPUs=32 State=UNKNOWN PartitionName=common Nodes=ALL Default=YES MaxTime=1440 State=UP
Munge#
Create a Munge authentication key:
Important
/etc/munge/munge.key
should be readable only by themunge
account and remain confidential.mkdir -p /etc/munge/ chown -R munge:munge /etc/munge chmod 700 /etc/munge /usr/sbin/create-munge-key -f chmod 600 /etc/munge/munge.key
Enable Munge service:
systemctl enable --now munge
Slurm#
Create local Slurm user account and group:
groupadd -g 64030 slurm useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u 64030 -g slurm -s /bin/bash slurm
Build Slurm RPM:
rpmbuild -ta slurm-22.05.6.tar.bz2
Install Slurm Controller RPMs:
rpm -ivh \ /root/rpmbuild/RPMS/x86_64/slurm-22.05.6-1.el8.x86_64.rpm \ /root/rpmbuild/RPMS/x86_64/slurm-example-configs-22.05.6-1.el8.x86_64.rpm \ /root/rpmbuild/RPMS/x86_64/slurm-perlapi-22.05.6-1.el8.x86_64.rpm \ /root/rpmbuild/RPMS/x86_64/slurm-slurmctld-22.05.6-1.el8.x86_64.rpm \ /root/rpmbuild/RPMS/x86_64/slurm-slurmd-22.05.6-1.el8.x86_64.rpm \ /root/rpmbuild/RPMS/x86_64/slurm-slurmdbd-22.05.6-1.el8.x86_64.rpm \ /root/rpmbuild/RPMS/x86_64/slurm-torque-22.05.6-1.el8.x86_64.rpm
Copy configuration files to Slurm RPMs path for inclusion in the tar file:
Note
Files
munge.key
,slurm.conf
, andcgroup.conf
are required to be installed on all nodes.
Fileslurmdbd.conf
is only required on the accounting/controller node.cp /etc/munge/munge.key /root/rpmbuild/RPMS/x86_64/ cp /root/tmp/cgroup.conf /root/rpmbuild/RPMS/x86_64/ cp /root/tmp/slurm.conf /root/rpmbuild/RPMS/x86_64/ cp /root/tmp/slurmdbd.conf /root/rpmbuild/RPMS/x86_64/
TAR RPMs for use by Compute/Client systems:
EXEC_PWD=`pwd` cd /root/rpmbuild/RPMS/x86_64/ tar -cf ${EXEC_PWD}/slurm-22.05.6.tar slurm-*.rpm cgroup.conf slurm.conf slurmdbd.conf munge.key cd ${EXEC_PWD} gzip -9 slurm-22.05.6.rpm.tar sha256sum -b slurm-22.05.6.rpm.tar.gz > slurm-22.05.6.rpm.tar.gz.sha256
Install the configuration files:
Warning
File /etc/slurm/slurm.conf is hashed by the controller and must be exactly the same on all systems or the controller service (slurmctld) will not start.
mkdir -p /etc/slurm/ cp cgroup.conf /etc/slurm/ cp slurm.conf /etc/slurm/ cp slurmdbd.conf /etc/slurm/ chmod 755 /etc/slurm/ chmod 600 /etc/slurm/cgroup.conf chmod 600 /etc/slurm/slurm.conf chmod 600 /etc/slurm/slurmdbd.conf chown -R slurm:slurm /etc/slurm/
Create log files:
mkdir -p /var/log/slurm/ touch /var/log/slurm/slurm.jobcomp.log touch /var/log/slurm/slurmctld.log touch /var/log/slurm/slurmdbd.log chmod 755 /var/log/slurm/ chmod 644 /var/log/slurm/*.log chown -R slurm:slurm /var/log/slurm
Create spool files:
mkdir -p /var/spool/slurmctld/ mkdir -p /var/spool/slurmdbd/ chmod 700 /var/spool/slurmctld chmod 700 /var/spool/slurmdbd chown -R slurm:slurm /var/spool/slurmctld chown -R slurm:slurm /var/spool/slurmdbd
Setup log rotation:
cat >> /etc/logrotate.d/slurm <<EOL /var/log/slurm/slurm.jobcomp.log /var/log/slurm/slurmctld.log /var/log/slurm/slurmdbd.log { missingok notifempty rotate 4 weekly create } EOL chmod 644 /etc/logrotate.d/slurm
MariaDB#
Enable MariaDB service:
systemctl enable --now mariadb
Secure databases:
Important
Replace
MARIADB_ROOT_PASSWD
with a confidential secure password that will be used for the MariaDBroot
account.# Enter current password for root (enter for none): <enter> # Set root password? [Y/n] Y # New password: MARIADB_ROOT_PASSWD # Re-enter new password: MARIADB_ROOT_PASSWD # Remove anonymous users? [Y/n] Y # Disallow root login remotely? [Y/n] Y # Remove test database and access to it? [Y/n] Y # Reload privilege tables now? [Y/n] Y mysql_secure_installation
Modify the innodb configuration:
mkdir -p /etc/my.cnf.d/ cat >> /etc/my.cnf.d/innodb.cnf <<EOL [mysqld] innodb_buffer_pool_size=1024M innodb_log_file_size=64M innodb_lock_wait_timeout=900 EOL
Delete MariaDB logs and restart service:
systemctl stop mariadb rm -f /var/lib/mysql/ib_logfile? systemctl start mariadb
Create accounting database for Slurm:
Important
Replace
MARIADB_ROOT_PASSWD
with the password provided in step 22.
ReplaceMARIADB_SLURM_USER_PASSWD
with the password provided in step 7.mysql -u root -pMARIADB_ROOT_PASSWD --execute "grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by 'MARIADB_SLURM_USER_PASSWD' with grant option;" mysql -u root -pMARIADB_ROOT_PASSWD --execute "create database slurm_acct_db;"
Firewall#
Configure firewalld rules:
Important
Replace the IPv4 Address and Subnet mask with the value of your network.
systemctl enable --now firewalld firewall-cmd --remove-port=3306/tcp --permanent 2> /dev/null firewall-cmd --remove-port=3306/udp --permanent 2> /dev/null firewall-cmd --zone=public --add-source=192.168.1.0/24 --permanent firewall-cmd --zone=public --add-port={6817/tcp,6818/tcp,6819/tcp} --permanent firewall-cmd --reload
Slurm Controller Services#
Enable Slurm Controller services:
systemctl enable --now slurmdbd systemctl enable --now slurmctld