Slurm Enroot 使用指南

enroot 容器作业交互

  1. 启动容器

交互式终端

1
2
3
4
5
6
7
8
$ srun -p 1080 \
--container-name=ai-test \
--container-image=/share/home/mark/nvidia+pytorch+21.04-py3.sqsh \
--container-mounts=/share/home/mark/workspace:/workspace \
--container-remap-root \
--container-writable \
--pty bash
root@gpu03:/workspace#

jupyter lab

1
2
3
4
5
6
7
8
9
10
11
12
$ srun -p 1080 \
--container-name=ai-test \
--container-image=/share/home/mark/nvidia+pytorch+21.04-py3.sqsh \
--container-mounts=/share/home/mark/workspace:/workspace \
--container-remap-root \
--container-writable \
jupyter lab \
--allow-root \
--no-browser \
--notebook-dir=/workspace \
--ip=0.0.0.0 \
--port=11999
  1. 终端交互(对于运行中的的容器)

2.1 通过 sattach 可以连接到启动的交互终端

1
2
$ sattach --pty <jobid.stepid>
root@gpu03:/workspace#

已知限制:退出终端会结束作业

2.2 通过 enroot exec 可以在作业节点上进入容器

1
$ enroot exec <pid> bash

已知限制:需要切换到作业节点

测试通过以下脚本方法(基于 2.2)可以通过任意节点直接连接容器:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
exec() {
local job_id="$1"
local rc="$2"

# 检查参数
if [[ -z "$job_id" ]]; then
echo "Error: Job ID must be specified."
usage
exit 1
fi

# 默认 bash
if [[ -z "$rc" ]]; then
rc="/bin/bash"
fi

# 获取作业节点
local node
node=$(scontrol show job "$job_id" | grep -owP 'NodeList=\K\S+')
if [[ -z "$node" ]]; then
echo "Error: Failed to get node for job $job_id"
exit 1
fi

# SSH 到节点并获取 PID
local pid
pid=$(ssh "$node" "scontrol listpids ${job_id}.0 | awk 'NR==2 {print \$1}'")
if [[ -z "$pid" ]]; then
echo "Error: Failed to get PID on node $node for job $job_id"
exit 1
fi

echo "Job $job_id is running on node $node with PID $pid"

# SSH 到节点并进入容器
ssh -t "$node" "export XDG_RUNTIME_DIR=/tmp && enroot exec -e TERM=xterm $pid $rc"
}

sbatch 作业脚本

Jupyter Lab 作业模板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#!/bin/bash

#SBATCH --job-name=${jobname}
#SBATCH --partition=${partition}
#SBATCH -N 1
#SBATCH --cpus-per-task=${cpu_num}
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=${gpu_num}
#SBATCH -o %j.out
#SBATCH -e %j.err

#生成随机端口
port=$((RANDOM % (65535 - 2048) + 1024))

# 检查端口是否被占用
while [[ $(ss -tuln | grep ":$port ") ]]; do
port=$((port + 1))
done

#查看启动节点
node=$(cat /etc/hosts |grep -m 1 `hostname -s` |awk '{print $1}')
host=$(hostname -s)

relative_path=$(pwd | sed "s|^$HOME/||")

container_workdir="/workspace/$relative_path"

container_name=$(echo "${job_container}" | sed 's/^pyxis_//')

export XDG_RUNTIME_DIR=/tmp

if [ ${gpu_num} -eq 0 ]; then
export NVIDIA_VISIBLE_DEVICES=void
fi

srun_common_options="\
--container-name=$container_name \
--container-mounts=$HOME:/workspace \
--container-workdir=$container_workdir \
--container-writable \
"

jupyter_common_options="\
jupyter lab \
--allow-root \
--no-browser \
--notebook-dir=/workspace \
--NotebookApp.base_url=/interact/$host/$port \
--NotebookApp.allow_origin='*' \
--ip=$node \
--port=$port \
"

if [ ${enable_root} ]; then
eval "srun $srun_common_options --container-remap-root $jupyter_common_options" > out.log 2>&1 &
else
eval "srun $srun_common_options --no-container-remap-root $jupyter_common_options" > out.log 2>&1 &
fi


token=$(grep -oP 'token=\K[^ ]+' out.log | head -n 1)
while true
do
if [[ "$token" == "" ]];then
sleep 2
token=$(grep -oP 'token=\K[^ ]+' out.log | head -n 1)
else
break
fi
done
url="http://$node:$port/interact/$host/$port?token=$token"

sleep infinity

SSH 作业模板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#!/bin/bash

#SBATCH --job-name=${job_name}
#SBATCH --partition=${partition}
#SBATCH -N 1
#SBATCH --cpus-per-task=${cpu_num}
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=${gpu_num}
#SBATCH -o %j.out
#SBATCH -e %j.err

#生成随机端口
port=$((RANDOM % (65535 - 2048) + 1024))

# 检查端口是否被占用
while [[ $(ss -tuln | grep ":$port ") ]]; do
port=$((port + 1))
done

#查看启动节点
node=$(cat /etc/hosts |grep -m 1 `hostname -s` |awk '{print $1}')
host=$(hostname -s)

relative_path=$(pwd | sed "s|^$HOME/||")

container_workdir="/workspace/$relative_path"

container_name=$(echo "${job_container}" | sed 's/^pyxis_//')

export XDG_RUNTIME_DIR=/tmp

if [ ${gpu_num} -eq 0 ]; then
export NVIDIA_VISIBLE_DEVICES=void
fi

srun_common_options="\
--container-name=$container_name \
--container-mounts=$HOME/.ssh:$HOME/.ssh \
--container-workdir=$HOME \
--container-writable \
"

ssh_common_options="\
sshd -D -e -p $port
"

# 登陆节点作为跳转节点
jump_node="10.10.22.32"
ssh_cmd="ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -J $USER@$jump_node $USER@$host -p $port"

eval "srun $srun_common_options --no-container-remap-root $ssh_common_options"

已知问题

如何从 docker 容器转换为 enroot 容器?

  1. 构建一个 hello-world 镜像(hello 是二进制可执行文件)
1
2
3
4
# syntax=docker/dockerfile:1
FROM ubuntu
ADD hello /
CMD ["./hello"]
1
$ docker build --tag hello .
  1. 转换容器并导出镜像到目录中
1
2
3
$ docker create --name hello-container hello
$ mkdir hello-world
$ docker export hello-container | tar -C hello-world -p -s --same-owner -xv
  • -p:保持文件的权限不变(保持原始权限)。
  • --same-owner:保持文件的所有者与导出时一致,恢复文件的所有权信息。

tar2sqfs 可以跳过中间文件夹的步骤将 tar 包转换为 sqsh
https://github.com/AgentD/squashfs-tools-ng

1
tar2sqfs hello-world.sqsh < hello-world.tar
  1. 转换为 SquashFS 文件系统
1
$ mksquashfs hello-world hello-world.sqsh
  1. 通过 enroot 运行
1
2
$ enroot create hello-world hello-world.sqsh
$ enroot start hello-world /hello

不依赖三方工具的完整流程

1
2
3
4
5
6
7
8
$ docker pull tensorflow/tensorflow
$ docker create --name tensorflow tensorflow/tensorflow
$ docker export tensorflow -o tensorflow.tar
$ mkdir tensorflow
$ tar -xf tensorflow.tar -C tensorflow
$ mksquashfs tensorflow tensorflow.sqsh
$ enroot create tensorflow.sqsh
$ enroot start -w tensorflow

通过 enroot 容器启动 jupyter,页面 kernel 报错

1
Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/web.py", line 1704, in _execute result = await result File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 769, in run yielded = self.gen.throw(*exc_info) # type: ignore File "/opt/conda/lib/python3.8/site-packages/notebook/services/sessions/handlers.py", line 69, in post model = yield maybe_future( File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 769, in run yielded = self.gen.throw(*exc_info) # type: ignore File "/opt/conda/lib/python3.8/site-packages/notebook/services/sessions/sessionmanager.py", line 88, in create_session kernel_id = yield self.start_kernel_for_session(session_id, path, name, type, kernel_name) File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 769, in run yielded = self.gen.throw(*exc_info) # type: ignore File "/opt/conda/lib/python3.8/site-packages/notebook/services/sessions/sessionmanager.py", line 100, in start_kernel_for_session kernel_id = yield maybe_future( File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/opt/conda/lib/python3.8/site-packages/notebook/services/kernels/kernelmanager.py", line 176, in start_kernel kernel_id = await maybe_future(self.pinned_superclass.start_kernel(self, **kwargs)) File "/opt/conda/lib/python3.8/site-packages/jupyter_client/multikernelmanager.py", line 185, in start_kernel km, kernel_name, kernel_id = self.pre_start_kernel(kernel_name, kwargs) File "/opt/conda/lib/python3.8/site-packages/jupyter_client/multikernelmanager.py", line 170, in pre_start_kernel km = self.kernel_manager_factory(connection_file=os.path.join( File "/opt/conda/lib/python3.8/site-packages/jupyter_client/multikernelmanager.py", line 87, in create_kernel_manager km.shell_port = self._find_available_port(km.ip) File "/opt/conda/lib/python3.8/site-packages/jupyter_client/multikernelmanager.py", line 101, in _find_available_port tmp_sock.bind((ip, 0)) OSError: [Errno 99] Cannot assign requested address

问题原因:(可能)jupyter 容器中无法 ping 通宿主机网络

解决方案:在容器内安装 iproute2 后重启容器解决(充分条件)

1
2
# apt-get update
# apt-get install -y iproute2

作业日志报错 slurmstepd: error: pyxis: mkdir: cannot create directory ‘/run/user/1006’: Permission denied

问题原因:enroot 使用 XDG_RUNTIME_DIR 作为运行父级路径,环境变量获取失败时使用 /run/user/<uid> 作为临时路径产生的权限问题。

1
#ENROOT_RUNTIME_PATH        ${XDG_RUNTIME_DIR}/enroot

ref:
https://github.com/NVIDIA/enroot/blob/master/doc/configuration.md#configuration
https://github.com/NVIDIA/enroot/issues/13

解决方案: (在模板中)初始化环境变量

1
export XDG_RUNTIME_DIR=/tmp

sbatch 方案

1
2
#SBATCH --export=none,XDG_RUNTIME_DIR=/tmp
#SBATCH --get-user-env

容器作业启动报错 [ERROR] Command not found: nvidia-container-cli

问题原因:enroot 依赖 libnvidia-container

解决方案:参考安装步骤

非 GPU 节点启动容器的话需要指定 export NVIDIA_VISIBLE_DEVICES=void

ref: https://github.com/NVIDIA/enroot/issues/105

启动 SSHD 报错 /var/empty must be owned by root and not group or world-writable.

问题原因:OpenSSH 的权限隔离机制会检查 /var/empty 目录所属

解决方案:命令行进入容器,修改 /etc/ssh/sshd_config 配置文件

1
UsePrivilegeSeparation no

容器导出时报错 Unrecognised xattr prefix lustre.lov

问题原因:SquashFS 不支持 ACL

ref: https://github.com/AgentD/squashfs-tools-ng/issues/25

容器启动时报错 enroot-mount: failed to mount: /dev/cdrom at /share/<user>/.local/share/enroot/alpine-docker/media/cdrom: Operation not permitted

问题原因:
tl;dr: docker 容器转化的 SquashFS 中挂载点配置影响容器启动(enroot import 的容器正常)

alpine 的官方镜像中,包含以下设备的挂载:

1
2
/dev/cdrom	/media/cdrom	iso9660	noauto,ro 0 0
/dev/usbdisk /media/usb vfat noauto,ro 0 0

实际上节点无光驱/usb硬件或无权限访问导致挂载失败

解决方案:删除容器 /etc/fstab 文件中的挂载点配置

为什么无法使用 root 身份启动和连接 SSH 容器

问题原因:OpenSSH 不支持容器内重新映射 root 的形式

ref:
https://github.com/NVIDIA/pyxis/issues/85
https://github.com/NVIDIA/enroot/issues/92