enroot 容器作业交互
- 启动容器
交互式终端
1 | $ srun -p 1080 \ |
jupyter lab
1 | $ srun -p 1080 \ |
- 终端交互(对于运行中的的容器)
2.1 通过 sattach
可以连接到启动的交互终端
1 | $ sattach --pty <jobid.stepid> |
已知限制:退出终端会结束作业
2.2 通过 enroot exec
可以在作业节点上进入容器
1 | $ enroot exec <pid> bash |
已知限制:需要切换到作业节点
测试通过以下脚本方法(基于 2.2)可以通过任意节点直接连接容器:
1 | exec() { |
sbatch 作业脚本
Jupyter Lab 作业模板
1 |
|
SSH 作业模板
1 | #!/bin/bash |
已知问题
如何从 docker 容器转换为 enroot 容器?
- 构建一个 hello-world 镜像(
hello
是二进制可执行文件)
1 | # syntax=docker/dockerfile:1 |
1 | $ docker build --tag hello . |
- 转换容器并导出镜像到目录中
1 | $ docker create --name hello-container hello |
-p
:保持文件的权限不变(保持原始权限)。--same-owner
:保持文件的所有者与导出时一致,恢复文件的所有权信息。
tar2sqfs
可以跳过中间文件夹的步骤将 tar 包转换为 sqsh
https://github.com/AgentD/squashfs-tools-ng
1 | tar2sqfs hello-world.sqsh < hello-world.tar |
- 转换为 SquashFS 文件系统
1 | $ mksquashfs hello-world hello-world.sqsh |
- 通过 enroot 运行
1 | $ enroot create hello-world hello-world.sqsh |
不依赖三方工具的完整流程
1 | $ docker pull tensorflow/tensorflow |
通过 enroot 容器启动 jupyter,页面 kernel 报错
1 | Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/web.py", line 1704, in _execute result = await result File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 769, in run yielded = self.gen.throw(*exc_info) # type: ignore File "/opt/conda/lib/python3.8/site-packages/notebook/services/sessions/handlers.py", line 69, in post model = yield maybe_future( File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 769, in run yielded = self.gen.throw(*exc_info) # type: ignore File "/opt/conda/lib/python3.8/site-packages/notebook/services/sessions/sessionmanager.py", line 88, in create_session kernel_id = yield self.start_kernel_for_session(session_id, path, name, type, kernel_name) File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 769, in run yielded = self.gen.throw(*exc_info) # type: ignore File "/opt/conda/lib/python3.8/site-packages/notebook/services/sessions/sessionmanager.py", line 100, in start_kernel_for_session kernel_id = yield maybe_future( File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/opt/conda/lib/python3.8/site-packages/notebook/services/kernels/kernelmanager.py", line 176, in start_kernel kernel_id = await maybe_future(self.pinned_superclass.start_kernel(self, **kwargs)) File "/opt/conda/lib/python3.8/site-packages/jupyter_client/multikernelmanager.py", line 185, in start_kernel km, kernel_name, kernel_id = self.pre_start_kernel(kernel_name, kwargs) File "/opt/conda/lib/python3.8/site-packages/jupyter_client/multikernelmanager.py", line 170, in pre_start_kernel km = self.kernel_manager_factory(connection_file=os.path.join( File "/opt/conda/lib/python3.8/site-packages/jupyter_client/multikernelmanager.py", line 87, in create_kernel_manager km.shell_port = self._find_available_port(km.ip) File "/opt/conda/lib/python3.8/site-packages/jupyter_client/multikernelmanager.py", line 101, in _find_available_port tmp_sock.bind((ip, 0)) OSError: [Errno 99] Cannot assign requested address |
问题原因:(可能)jupyter 容器中无法 ping 通宿主机网络
解决方案:在容器内安装 iproute2
后重启容器解决(充分条件)
1 | # apt-get update |
作业日志报错 slurmstepd: error: pyxis: mkdir: cannot create directory ‘/run/user/1006’: Permission denied
问题原因:enroot 使用 XDG_RUNTIME_DIR
作为运行父级路径,环境变量获取失败时使用 /run/user/<uid>
作为临时路径产生的权限问题。
1 | #ENROOT_RUNTIME_PATH ${XDG_RUNTIME_DIR}/enroot |
ref:
https://github.com/NVIDIA/enroot/blob/master/doc/configuration.md#configuration
https://github.com/NVIDIA/enroot/issues/13
解决方案: (在模板中)初始化环境变量
1 | export XDG_RUNTIME_DIR=/tmp |
sbatch
方案
1 | #SBATCH --export=none,XDG_RUNTIME_DIR=/tmp |
容器作业启动报错 [ERROR] Command not found: nvidia-container-cli
问题原因:enroot 依赖 libnvidia-container
解决方案:参考安装步骤
非 GPU 节点启动容器的话需要指定 export NVIDIA_VISIBLE_DEVICES=void
ref: https://github.com/NVIDIA/enroot/issues/105
启动 SSHD 报错 /var/empty must be owned by root and not group or world-writable.
问题原因:OpenSSH 的权限隔离机制会检查 /var/empty
目录所属
解决方案:命令行进入容器,修改 /etc/ssh/sshd_config
配置文件
1 | UsePrivilegeSeparation no |
容器导出时报错 Unrecognised xattr prefix lustre.lov
问题原因:SquashFS 不支持 ACL
ref: https://github.com/AgentD/squashfs-tools-ng/issues/25
容器启动时报错 enroot-mount: failed to mount: /dev/cdrom at /share/<user>/.local/share/enroot/alpine-docker/media/cdrom: Operation not permitted
问题原因:
tl;dr: docker 容器转化的 SquashFS 中挂载点配置影响容器启动(enroot import 的容器正常)
alpine 的官方镜像中,包含以下设备的挂载:
1 | /dev/cdrom /media/cdrom iso9660 noauto,ro 0 0 |
实际上节点无光驱/usb硬件或无权限访问导致挂载失败
解决方案:删除容器 /etc/fstab
文件中的挂载点配置
为什么无法使用 root 身份启动和连接 SSH 容器
问题原因:OpenSSH 不支持容器内重新映射 root 的形式
ref:
https://github.com/NVIDIA/pyxis/issues/85
https://github.com/NVIDIA/enroot/issues/92