问题现象 在 SLURM 队列配置了 AllowGroups
之后,使用配置组成员登录集群查看不到配置好的队列信息
排查过程
SLURM 通过 GroupUpdateForce 配置更新组成员信息
1 2 3 GroupUpdateForce = 1 GroupUpdateTime = 600 sec
https://slurm.schedmd.com/slurm.conf.html#OPT_GroupUpdateForce
查看 SLURM 源码获取方式为 getgrnam_r()
getgrnam() 函数返回一个指向结构的指针,该结构包含组数据库(例如,本地组文件 /etc/group、NIS 和 LDAP)中与组名 name 匹配的记录的分解字段。
https://github.com/SchedMD/slurm/blob/slurm-22.05/src/slurmctld/groups.c
scontrol reconfig
更新缓存用户信息时查看系统日志 nslcd 报错
1 2 3 4 5 6 7 8 9 10 Jun 03 19:39:37 login03 nslcd[3738472]: [bd3aa3] <passwd(all)> ldap_result() failed: Size limit exceeded Jun 03 19:39:37 login03 nslcd[3738472]: [0bad55] <passwd(all)> ldap_result() failed: Size limit exceeded Jun 03 19:39:37 login03 nslcd[3738472]: [175ef2] <passwd(all)> ldap_result() failed: Size limit exceeded Jun 03 19:39:37 login03 nslcd[3738472]: [1bccf5] <passwd(all)> ldap_result() failed: Size limit exceeded Jun 03 19:39:37 login03 nslcd[3738472]: [466daf] <passwd(all)> ldap_result() failed: Size limit exceeded Jun 03 19:39:37 login03 nslcd[3738472]: [5b6c3a] <passwd(all)> ldap_result() failed: Size limit exceeded Jun 03 19:39:37 login03 nslcd[3738472]: [2a9914] <passwd(all)> ldap_result() failed: Size limit exceeded Jun 03 19:39:38 login03 nslcd[3738472]: [b3ea93] <passwd(all)> ldap_result() failed: Size limit exceeded Jun 03 19:39:38 login03 nslcd[3738472]: [9e1415] <passwd(all)> ldap_result() failed: Size limit exceeded Jun 03 19:39:38 login03 nslcd[3738472]: [d4a2f2] <passwd(all)> ldap_result() failed: Size limit exceeded
推断 openldap 单次查询数量限制导致 nslcd 服务获取不到超过上限的用户信息
尝试修改配置解决 nslcd 服务查询 ldap 问题
T<**pagesize**T> _NUMBER_Set this to a number greater than 0 to request paged results from the LDAP server in accordance with RFC2696. The default (0) is to not request paged results.
This is useful for LDAP servers that contain a lot of entries (e.g. more than 500) and limit the number of entries that are returned with one request. For OpenLDAP servers you may need to set T<**sizelimit size.prtotal=unlimited**T> for allowing more entries to be returned over multiple pages.
解决方案
登录 ldap 节点修改查询上限
1 2 3 ssh ldap cd /etc/openldapvim sizelimit.ldif
sizelimit.ldif
1 2 3 4 dn: cn=config changetype: modify replace: olcSizeLimit olcSizeLimit: 5000
保存后应用
1 ldapmodify -Y EXTERNAL -H ldapi:/// -f sizelimit.ldif
验证生效
验证 nslcd 服务
使用 journalctl -u nslcd -f
查看日志已无报错
同步 slurm 缓存
验证 slurm 配置生效
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 7542-64C-512G up infinite 3 alloc fat[07-09] 6126-24C-768G up infinite 1 mix fat05 6126-24C-768G up infinite 1 alloc fat06 gpu_v100 up infinite 2 down* gpu[13-14] gpu_v100 up infinite 1 drng gpu05 gpu_v100 up infinite 3 drain gpu[12,16-17] gpu_v100 up infinite 5 mix gpu[06-07,10-11,20] normal_test up infinite 1 idle fat01 qigl up infinite 1 idle gpu18 2680v4-28C-256G up infinite 2 idle fat[02,04] normal up infinite 2 alloc c02n[01-02] normal up infinite 30 idle c02n[03-16],c03n[01-16] 6240-36C-192G up infinite 13 alloc sd01n[02-03],sd02n[01,03-04],sd03n[03-04],sd04n[01-03],sd05n04,sd06n02,sd07n04 6240-36C-192G up infinite 13 idle sd01n[01,04],sd02n02,sd03n[01-02],sd04n04,sd05n03,sd06n[01,03-04],sd07n[01-03] 6126-24C-192G up infinite 10 idle c07n[07-16]