Skip to content

Commit ab10200

Browse files
authored
Create 10-most-common-administrator-operations.md
1 parent 3ed7897 commit ab10200

File tree

1 file changed

+77
-0
lines changed

1 file changed

+77
-0
lines changed
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
## How do I update slurm's configuration?
2+
The slurm configuration needs to be modified on the slurm master node (slurm-01) as root. The configuration files are located in /etc/slurm:
3+
4+
* allowed_devices_file.conf : devices allowed by slurm
5+
* cgroup : folder that contains some configuration regarding to cgroups suspension and affinity.
6+
* cgroup.conf : cgroups setup
7+
* epilog : folder that contains the scripts to be executed in the epilog stage
8+
* gres.conf : configuration of Intel MICs and GPUs nodes
9+
* partitions.conf : configuration of the partitions (aka queues)
10+
* nodes.conf : configuration of the compute nodes (aka queues)
11+
* prolog : folder that contains the scripts to be executed in the prolog stage
12+
* slurm.conf : main config file
13+
14+
After applying some changes in the slurm configuration, you will need to propagate the changes across the cluster, and after that, run “scontrol reconfig”. It's a good practice to review the logs in the system and ensure that all is working perfectly. It's also a good practice to use source control for this configuration
15+
16+
## How do I suspend and reestart jobs?
17+
```
18+
scontrol suspend <jobid>
19+
scontrol resume <jobid>
20+
```
21+
## How do I create a reservation?
22+
```
23+
scontrol create res StartTime=2014-04-01T08:00:00 Duration=5:00:00 Users=jordi NodeCnt=338
24+
Reservation created: jordi_1
25+
```
26+
Update a reservation
27+
```
28+
scontrol update Reservation=jordi_1 Flags=Maint NodeCnt=20
29+
```
30+
31+
Delete a reservation
32+
```
33+
scontrol delete Reservation=jordi_1
34+
```
35+
## How do I use a previously created reservation?
36+
Submit a job and use a reservation to allocate resources (i.e. reservation jordi_1).
37+
```
38+
sbatch --reservation=jordi_1 submitscript.sl
39+
```
40+
## How do I update a node status?
41+
Generally only “DOWN”, “DRAIN”, “FAIL” and “RESUME” should be used. Be aware that some of them requires to add a reason.
42+
Resume
43+
```
44+
scontrol update NodeName=wm074 state=RESUME
45+
```
46+
Drain/Down
47+
```
48+
scontrol update NodeName=wm[081-088] state=DOWN reason="not yet integrated into slurm cluster"
49+
```
50+
## How do I check that high-availability is working?
51+
52+
```
53+
scontrol ping
54+
Slurmctld(primary/backup) at slurm01/slurm02 are UP/UP
55+
```
56+
## How do I dump the current configuration in memory?
57+
```
58+
scontrol show config
59+
```
60+
## How do I modify a requested job allocation?
61+
If changing the time limit of a step, either specify a new time limit value or precede the time with a “+” or “-” to increment or decrement the current +time limit (e.g. “TimeLimit=+30”). In order to increment or decrement the current time limit, the StepId specification must precede the TimeLimit specification.
62+
```
63+
scontrol update jobid=5001 TimeLimit=+10:00:00 # This will increase the time limit 10h mor
64+
```
65+
## How do I increase the priority of a job?
66+
```
67+
scontrol update jobid=5001 priority=3000
68+
```
69+
## Backup
70+
* Generate a backup file with the QoS and accouting associations: sacctmgr dump mycluster File=/var/local/backup/slurm/acct_backup-XXXXX.cfg
71+
* Stop the slurmdbd daemons from slurmdb01: systemctl stop slurmdbd
72+
* Drop the database : mysql -u slurm --password=XXXXXXXXXX slurm_acct_db -e "DROP DATABASE slurm_acct_db;"
73+
* Remove all the files in the spool folder : rm -fr /var/spool/slurm/*
74+
* Start slurmdbd from slurmdb01: systemctl start slurmdbd
75+
* Create the cluster 'mycluster' in the slurm database from slurmdb01: sacctmgr create cluster mycluster
76+
* Start slurmctld in clean mode in the master node (slurm01), wait for 60 seconds and press Ctrl+C: slurmctld -D -c -vvv
77+
* Load the QoS and accounting associations from the latest backup file : sacctmgr load /var/local/backup/slurm/acct_backup-XXXXX.cfg

0 commit comments

Comments
 (0)