AllowGroups Not Allowed

问题现象

在 SLURM 队列配置了 AllowGroups 之后,使用配置组成员登录集群查看不到配置好的队列信息

排查过程

  1. SLURM 通过 GroupUpdateForce 配置更新组成员信息
1
2
3
# scontrol show conf | grep GroupUpdate
GroupUpdateForce = 1
GroupUpdateTime = 600 sec

https://slurm.schedmd.com/slurm.conf.html#OPT_GroupUpdateForce

  • GroupUpdateTime 代表同步间隔时间
  1. 查看 SLURM 源码获取方式为 getgrnam_r()

getgrnam() 函数返回一个指向结构的指针,该结构包含组数据库(例如,本地组文件 /etc/group、NIS 和 LDAP)中与组名 name 匹配的记录的分解字段。

https://github.com/SchedMD/slurm/blob/slurm-22.05/src/slurmctld/groups.c

  1. scontrol reconfig 更新缓存用户信息时查看系统日志 nslcd 报错
1
2
3
4
5
6
7
8
9
10
Jun 03 19:39:37 login03 nslcd[3738472]: [bd3aa3] <passwd(all)> ldap_result() failed: Size limit exceeded
Jun 03 19:39:37 login03 nslcd[3738472]: [0bad55] <passwd(all)> ldap_result() failed: Size limit exceeded
Jun 03 19:39:37 login03 nslcd[3738472]: [175ef2] <passwd(all)> ldap_result() failed: Size limit exceeded
Jun 03 19:39:37 login03 nslcd[3738472]: [1bccf5] <passwd(all)> ldap_result() failed: Size limit exceeded
Jun 03 19:39:37 login03 nslcd[3738472]: [466daf] <passwd(all)> ldap_result() failed: Size limit exceeded
Jun 03 19:39:37 login03 nslcd[3738472]: [5b6c3a] <passwd(all)> ldap_result() failed: Size limit exceeded
Jun 03 19:39:37 login03 nslcd[3738472]: [2a9914] <passwd(all)> ldap_result() failed: Size limit exceeded
Jun 03 19:39:38 login03 nslcd[3738472]: [b3ea93] <passwd(all)> ldap_result() failed: Size limit exceeded
Jun 03 19:39:38 login03 nslcd[3738472]: [9e1415] <passwd(all)> ldap_result() failed: Size limit exceeded
Jun 03 19:39:38 login03 nslcd[3738472]: [d4a2f2] <passwd(all)> ldap_result() failed: Size limit exceeded

推断 openldap 单次查询数量限制导致 nslcd 服务获取不到超过上限的用户信息

  1. 尝试修改配置解决 nslcd 服务查询 ldap 问题

T<**pagesize**T> _NUMBER_Set this to a number greater than 0 to request paged results from the LDAP server in accordance with RFC2696. The default (0) is to not request paged results.

This is useful for LDAP servers that contain a lot of entries (e.g. more than 500) and limit the number of entries that are returned with one request. For OpenLDAP servers you may need to set T<**sizelimit size.prtotal=unlimited**T> for allowing more entries to be returned over multiple pages.

解决方案

  1. 登录 ldap 节点修改查询上限
1
2
3
ssh ldap
cd /etc/openldap
vim sizelimit.ldif

sizelimit.ldif

1
2
3
4
dn: cn=config
changetype: modify
replace: olcSizeLimit
olcSizeLimit: 5000
  1. 保存后应用
1
ldapmodify -Y EXTERNAL -H ldapi:/// -f sizelimit.ldif
  1. 验证生效
1
2
# ldapsearch -x |grep dn: |wc -l
2429
  1. 验证 nslcd 服务
1
2
# getent passwd |wc -l
1984
  • 使用 journalctl -u nslcd -f 查看日志已无报错
  1. 同步 slurm 缓存
1
scontrol reconfig
  1. 验证 slurm 配置生效
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
7542-64C-512G up infinite 3 alloc fat[07-09]
6126-24C-768G up infinite 1 mix fat05
6126-24C-768G up infinite 1 alloc fat06
gpu_v100 up infinite 2 down* gpu[13-14]
gpu_v100 up infinite 1 drng gpu05
gpu_v100 up infinite 3 drain gpu[12,16-17]
gpu_v100 up infinite 5 mix gpu[06-07,10-11,20]
normal_test up infinite 1 idle fat01
qigl up infinite 1 idle gpu18
2680v4-28C-256G up infinite 2 idle fat[02,04]
normal up infinite 2 alloc c02n[01-02]
normal up infinite 30 idle c02n[03-16],c03n[01-16]
6240-36C-192G up infinite 13 alloc sd01n[02-03],sd02n[01,03-04],sd03n[03-04],sd04n[01-03],sd05n04,sd06n02,sd07n04
6240-36C-192G up infinite 13 idle sd01n[01,04],sd02n02,sd03n[01-02],sd04n04,sd05n03,sd06n[01,03-04],sd07n[01-03]
6126-24C-192G up infinite 10 idle c07n[07-16]