|
| 1 | +## How do I update slurm's configuration? |
| 2 | +The slurm configuration needs to be modified on the slurm master node (slurm-01) as root. The configuration files are located in /etc/slurm: |
| 3 | + |
| 4 | +* allowed_devices_file.conf : devices allowed by slurm |
| 5 | +* cgroup : folder that contains some configuration regarding to cgroups suspension and affinity. |
| 6 | +* cgroup.conf : cgroups setup |
| 7 | +* epilog : folder that contains the scripts to be executed in the epilog stage |
| 8 | +* gres.conf : configuration of Intel MICs and GPUs nodes |
| 9 | +* partitions.conf : configuration of the partitions (aka queues) |
| 10 | +* nodes.conf : configuration of the compute nodes (aka queues) |
| 11 | +* prolog : folder that contains the scripts to be executed in the prolog stage |
| 12 | +* slurm.conf : main config file |
| 13 | + |
| 14 | +After applying some changes in the slurm configuration, you will need to propagate the changes across the cluster, and after that, run “scontrol reconfig”. It's a good practice to review the logs in the system and ensure that all is working perfectly. It's also a good practice to use source control for this configuration |
| 15 | + |
| 16 | +## How do I suspend and reestart jobs? |
| 17 | +``` |
| 18 | +scontrol suspend <jobid> |
| 19 | +scontrol resume <jobid> |
| 20 | +``` |
| 21 | +## How do I create a reservation? |
| 22 | +``` |
| 23 | +scontrol create res StartTime=2014-04-01T08:00:00 Duration=5:00:00 Users=jordi NodeCnt=338 |
| 24 | +Reservation created: jordi_1 |
| 25 | +``` |
| 26 | +Update a reservation |
| 27 | +``` |
| 28 | +scontrol update Reservation=jordi_1 Flags=Maint NodeCnt=20 |
| 29 | +``` |
| 30 | + |
| 31 | +Delete a reservation |
| 32 | +``` |
| 33 | +scontrol delete Reservation=jordi_1 |
| 34 | +``` |
| 35 | +## How do I use a previously created reservation? |
| 36 | +Submit a job and use a reservation to allocate resources (i.e. reservation jordi_1). |
| 37 | +``` |
| 38 | +sbatch --reservation=jordi_1 submitscript.sl |
| 39 | +``` |
| 40 | +## How do I update a node status? |
| 41 | +Generally only “DOWN”, “DRAIN”, “FAIL” and “RESUME” should be used. Be aware that some of them requires to add a reason. |
| 42 | +Resume |
| 43 | +``` |
| 44 | +scontrol update NodeName=wm074 state=RESUME |
| 45 | +``` |
| 46 | +Drain/Down |
| 47 | +``` |
| 48 | +scontrol update NodeName=wm[081-088] state=DOWN reason="not yet integrated into slurm cluster" |
| 49 | +``` |
| 50 | +## How do I check that high-availability is working? |
| 51 | + |
| 52 | +``` |
| 53 | +scontrol ping |
| 54 | +Slurmctld(primary/backup) at slurm01/slurm02 are UP/UP |
| 55 | +``` |
| 56 | +## How do I dump the current configuration in memory? |
| 57 | +``` |
| 58 | +scontrol show config |
| 59 | +``` |
| 60 | +## How do I modify a requested job allocation? |
| 61 | +If changing the time limit of a step, either specify a new time limit value or precede the time with a “+” or “-” to increment or decrement the current +time limit (e.g. “TimeLimit=+30”). In order to increment or decrement the current time limit, the StepId specification must precede the TimeLimit specification. |
| 62 | +``` |
| 63 | +scontrol update jobid=5001 TimeLimit=+10:00:00 # This will increase the time limit 10h mor |
| 64 | +``` |
| 65 | +## How do I increase the priority of a job? |
| 66 | +``` |
| 67 | +scontrol update jobid=5001 priority=3000 |
| 68 | +``` |
| 69 | +## Backup |
| 70 | +* Generate a backup file with the QoS and accouting associations: sacctmgr dump mycluster File=/var/local/backup/slurm/acct_backup-XXXXX.cfg |
| 71 | +* Stop the slurmdbd daemons from slurmdb01: systemctl stop slurmdbd |
| 72 | +* Drop the database : mysql -u slurm --password=XXXXXXXXXX slurm_acct_db -e "DROP DATABASE slurm_acct_db;" |
| 73 | +* Remove all the files in the spool folder : rm -fr /var/spool/slurm/* |
| 74 | +* Start slurmdbd from slurmdb01: systemctl start slurmdbd |
| 75 | +* Create the cluster 'mycluster' in the slurm database from slurmdb01: sacctmgr create cluster mycluster |
| 76 | +* Start slurmctld in clean mode in the master node (slurm01), wait for 60 seconds and press Ctrl+C: slurmctld -D -c -vvv |
| 77 | +* Load the QoS and accounting associations from the latest backup file : sacctmgr load /var/local/backup/slurm/acct_backup-XXXXX.cfg |
0 commit comments