一、引言
在安装好nvidia container toolkit的节点上,我们先起一个普通容器,并进入容器执行nvidia-smi命令,发现无法执行:
# docker run -d --name normal-container nginx:latest
992ed0b4cb7134b7cb528124b4ebed193215f0987ed288a582fb088486a9b67a
# docker exec -ti normal-container bash
root@992ed0b4cb71:/# nvidia-smi
bash: nvidia-smi: command not found
在上述命令加上--gpus all
参数创建另外一个容器,并进入该容器执行nvidia-smi命令,发现可以正常执行:
# docker run -d --name nvidia-container --gpus all nginx:latest
81281dc9dc0a7d3c9de5e90dffdfa593975976c5a2a07c7a5ebddfd4e704bbe3
# docker exec -ti nvidia-container bash
root@81281dc9dc0a:/# nvidia-smi
Sun May 18 12:49:09 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.02 Driver Version: 560.94 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti On | 00000000:01:00.0 On | N/A |
| 0% 43C P8 8W / 165W | 835MiB / 16380MiB | 4% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 24 G /Xwayland N/A |
+-----------------------------------------------------------------------------------------+
为什么在创建命令中增加了一个--gpus all
参数,nvidia-container里就有了nvidia-smi命令呢?这个可执行文件到底是怎么来的呢?又与nvidia container toolkit哪些组件有关呢?本文带着这些问题,尝试深入理解nvidia container toolkit。
二、nvidia container toolkit组件组成
继续在安装好nvidia container toolkit的节点上输入nvidia-c
并按tab键补全,会有如下输出:
# nvidia-c
nvidia-cdi-hook nvidia-container-cli nvidia-container-runtime nvidia-container-runtime-hook nvidia-container-toolkit nvidia-ctk
这些就是nvidia container toolkit的可执行组件,nvidia container toolkit可执行组件对应的源码在两个仓库:
https://github.com/NVIDIA/nvidia-container-toolkit
:go语言开发https://github.com/NVIDIA/libnvidia-container
:c语言开发
为了更快速的了解github开源项目源码,除了看项目的readme和官网文档,还可以尝试这个AI工具:deepwiki。上述两个仓库在deepwiki上分别对应:
https://deepwiki.com/NVIDIA/nvidia-container-toolkit
https://deepwiki.com/NVIDIA/libnvidia-container
对应的各个组件及基本功能:
nvidia-cdi-hook
:go开发的可执行文件,代码在https://github.com/NVIDIA/nvidia-container-toolkit项目中,main函数入口:https://github.com/NVIDIA/nvidia-container-toolkit/cmd/nvidia-cdi-hook/main.go。支持CDI(Container Device Interface)环境的hook,如果环境不支持CDI,用的还是nvidia-container-runtime-hook。nvidia-container-cli
:c开发的可执行文件,代码在https://github.com/NVIDIA/libnvidia-container项目中,main函数入口:https://github.com/NVIDIA/libnvidia-container/src/cli/main.c。核心命令行工具,负责设备驱动库、nvidia-smi等命令注入(挂载)相关工作。nvidia-container-runtime
:go开发的可执行文件,代码在https://github.com/NVIDIA/nvidia-container-toolkit项目中,main函数入口:https://github.com/NVIDIA/nvidia-container-toolkit/cmd/nvidia-container-runtime。容器运行时封装层,拦截并扩展容器创建流程,为OCI spec注入preStartHook(例如nvidia-container-runtime-hook)后调用底层runc等工具创建容器。nvidia-container-runtime-hook
:go开发的可执行文件,代码在https://github.com/NVIDIA/nvidia-container-toolkit项目中,main函数入口:https://github.com/NVIDIA/nvidia-container-toolkit/cmd/nvidia-container-runtime-hook。容器的preStartHook,容器启动前执行,主要是拼接参数调用nvidia-container-cli。nvidia-container-toolkit
:go开发的可执行文件,代码在https://github.com/NVIDIA/nvidia-container-toolkit项目中,main函数入口:https://github.com/NVIDIA/nvidia-container-toolkit/tools/container/nvidia-toolkit/run.go。可以借助该工具辅助安装nvidia-container-runtime(例如配置docker配置文件并重启docker)。nvidia-ctk
:go开发的可执行文件,代码在https://github.com/NVIDIA/nvidia-container-toolkit项目中,main函数入口:https://github.com/NVIDIA/nvidia-container-toolkit/cmd/nvidia-ctk。包含hook、runtime、cdi、config等子命令,cdi、csv、graphics等场景会用到。
2.1.2 数据流示意图
在docker+runc的环境中,以前面的命令docker run -d --name normal-container nginx:latest
创建容器为例,数据流如下:
在我的环境中不支持CDI,如果加上--gpus all
参数后,创建使用GPU的容器时相关的nvidia container toolkit组件只有三个:nvidia-container-runtime
、nvidia-container-runtime-hook
、nvidia-container-cli
,数据流变成了如下形式:
三、关键组件源码分析
有了上述流程理解,下面再从源码角度来深入验证上述流程。
以下代码基于https://github.com/NVIDIA/nvidia-container-toolkit@1.17.4
3.1 nvidia-container-runtime
- main函数入口
// nvidia-container-toolkit/cmd/nvidia-container-runtime/main.go
func main() {
r := runtime.New()
err := r.Run(os.Args)
if err != nil {
os.Exit(1)
}
}
- Run函数
Run函数逻辑不多,主要是解析配置文件内容、初始化nvidia-ctk和nvidia-container-runtime-hook配置、配置runtime对象并执行Exec方法:
// nvidia-container-toolkit/internal/runtime/runtime.go
func (r rt) Run(argv []string) (rerr error) {
...
// 获取配置文件
cfg, err := config.GetConfig()
...
// 配置nvidia-container-runtime-hook路径
cfg.NVIDIAContainerRuntimeHookConfig.Path = config.ResolveNVIDIAContainerRuntimeHookPath(&logger.NullLogger{}, cfg.NVIDIAContainerRuntimeHookConfig.Path)
...
driver := root.New(
root.WithLogger(r.logger),
root.WithDriverRoot(cfg.NVIDIAContainerCLIConfig.Root),
)
r.logger.Tracef("Command line arguments: %v", argv)
runtime, err := newNVIDIAContainerRuntime(r.logger, cfg, argv, driver)
if err != nil {
return fmt.Errorf("failed to create NVIDIA Container Runtime: %v", err)
}
if printVersion {
fmt.Print("\n")
}
return runtime.Exec(argv)
}
- 解析配置文件
配置文件首先会先读取环境变量XDG_CONFIG_HOME的值,如果不为空,则配置文件为$XDG_CONFIG_HOME/nvidia-container-runtime/config.toml,否则默认为/etc/nvidia-container-runtime/config.toml
:
// nvidia-container-toolkit/internal/config/config.go
func GetConfig() (*Config, error) {
cfg, err := New(
WithConfigFile(GetConfigFilePath()),
)
if err != nil {
return nil, err
}
return cfg.Config()
}
// nvidia-container-toolkit/internal/config/config.go
func GetConfigFilePath() string {
// configOverride = XDG_CONFIG_HOME
if XDGConfigDir := os.Getenv(configOverride); len(XDGConfigDir) != 0 {
return filepath.Join(XDGConfigDir, configFilePath) // configFilePath = nvidia-container-runtime/config.toml
}
return filepath.Join("/etc", configFilePath)
}
// nvidia-container-toolkit/internal/config/toml.go
func (t *Toml) Config() (*Config, error) {
cfg, err := t.configNoOverrides()
if err != nil {
return nil, err
}
if err := cfg.assertValid(); err != nil {
return nil, err
}
return cfg, nil
}
// nvidia-container-toolkit/internal/config/toml.go
func (t *Toml) configNoOverrides() (*Config, error) {
cfg, err := GetDefault()
if err != nil {
return nil, err
}
if t == nil {
return cfg, nil
}
if err := t.Unmarshal(cfg); err != nil {
return nil, fmt.Errorf("failed to unmarshal config: %v", err)
}
return cfg, nil
}
// nvidia-container-toolkit/internal/config/config.go
func GetDefault() (*Config, error) {
d := Config{
AcceptEnvvarUnprivileged: true,
SupportedDriverCapabilities: image.SupportedDriverCapabilities.String(),
NVIDIAContainerCLIConfig: ContainerCLIConfig{
LoadKmods: true,
Ldconfig: getLdConfigPath(),
User: getUserGroup(),
},
NVIDIACTKConfig: CTKConfig{
Path: nvidiaCTKExecutable, // nvidiaCTKExecutable = nvidia-ctk
},
NVIDIAContainerRuntimeConfig: RuntimeConfig{
DebugFilePath: "/dev/null",
LogLevel: "info",
Runtimes: []string{"docker-runc", "runc", "crun"},
Mode: "auto",
Modes: modesConfig{
CSV: csvModeConfig{
MountSpecPath: "/etc/nvidia-container-runtime/host-files-for-container.d",
},
CDI: cdiModeConfig{
DefaultKind: "nvidia.com/gpu",
AnnotationPrefixes: []string{cdi.AnnotationPrefix}, // cdi.AnnotationPrefix = cdi.k8s.io/
SpecDirs: cdi.DefaultSpecDirs,
},
},
},
NVIDIAContainerRuntimeHookConfig: RuntimeHookConfig{
Path: NVIDIAContainerRuntimeHookExecutable,
},
}
return &d, nil
}
而默认的配置文件如下:
$ cat /etc/nvidia-container-runtime/config.toml
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"
[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc", "crun"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false
[nvidia-ctk]
path = "nvidia-ctk"
- 配置nvidia-container-runtime-hook
先去$PATH、/usr/local/sbin、/usr/local/bin、/usr/sbin、/usr/bin、/sbin、/bin下查找nvidia-container-runtime-hook,如果没找到则默认使用/usr/bin/nvidia-container-runtime-hook:
func ResolveNVIDIAContainerRuntimeHookPath(logger logger.Interface, nvidiaContainerRuntimeHookPath string) string {
return resolveWithDefault(
logger,
"NVIDIA Container Runtime Hook",
nvidiaContainerRuntimeHookPath, // 配置文件中读取,默认配置为nvidia-container-runtime-hook
nvidiaContainerRuntimeHookDefaultPath, // nvidiaContainerRuntimeHookDefaultPath = /usr/bin/nvidia-container-runtime-hook
)
}
- 初始化driver与runtime
初始化driver对象时会传入配置cfg.NVIDIAContainerCLIConfig.Root,该配置默认为""。而在初始化runtime的核心关键函数newNVIDIAContainerRuntime中,会先查找更底层的runtime(我本地环境只有runc,所以这里找到的是runc),之后判断命令中是否包含创建
子命令,如果没有则直接透传上层的调用命令给底层runc执行,否则用NewModifyingRuntimeWrapper方法初始化一个wrapper并调用该wrpper的Exec方法去执行。
// nvidia-container-toolkit/internal/runtime/runtime.go
func (r rt) Run(argv []string) (rerr error) {
...
driver := root.New(
root.WithLogger(r.logger),
root.WithDriverRoot(cfg.NVIDIAContainerCLIConfig.Root), // 配置文件中读取,默认配置为""(注释掉的配置)
)
...
runtime, err := newNVIDIAContainerRuntime(r.logger, cfg, argv, driver)
...
}
// nvidia-container-toolkit/internal/runtime/runtime_factory.go
func newNVIDIAContainerRuntime(logger logger.Interface, cfg *config.Config, argv []string, driver *root.Driver) (oci.Runtime, error) {
// 查找更底层的runtime:从$PATH、/usr/local/sbin、/usr/local/bin、/usr/sbin、/usr/bin、/sbin、/bin下查找"docker-runc", "runc", "crun"
// 如果查找到了,封装返回一个pathRuntime对象
lowLevelRuntime, err := oci.NewLowLevelRuntime(logger, cfg.NVIDIAContainerRuntimeConfig.Runtimes)// 配置文件中读取,默认配置为["docker-runc", "runc", "crun"]
if err != nil {
return nil, fmt.Errorf("error constructing low-level runtime: %v", err)
}
logger.Tracef("Using low-level runtime %v", lowLevelRuntime.String())
// 检查是否包含create的子命令,检查规则启动参数中不包含-b/-bundle参数,且包含create参数
if !oci.HasCreateSubcommand(argv) {
logger.Tracef("Skipping modifier for non-create subcommand")
return lowLevelRuntime, nil
}
ociSpec, err := oci.NewSpec(logger, argv)
if err != nil {
return nil, fmt.Errorf("error constructing OCI specification: %v", err)
}
specModifier, err := newSpecModifier(logger, cfg, ociSpec, driver)
if err != nil {
return nil, fmt.Errorf("failed to construct OCI spec modifier: %v", err)
}
// Create the wrapping runtime with the specified modifier.
r := oci.NewModifyingRuntimeWrapper(
logger,
lowLevelRuntime,
ociSpec,
specModifier,
)
return r, nil
}
- wrapper Exec
可以看到走到这里时wrapper Exec会先对OCI Spec做modify操作,然后再调用底层runc执行创建命令。
// nvidia-container-toolkit/internal/oci/runtime_modifier.go
func (r *modifyingRuntimeWrapper) Exec(args []string) error {
if HasCreateSubcommand(args) {
r.logger.Debugf("Create command detected; applying OCI specification modifications")
err := r.modify()
if err != nil {
return fmt.Errorf("could not apply required modification to OCI specification: %w", err)
}
r.logger.Debugf("Applied required modification to OCI specification")
}
r.logger.Debugf("Forwarding command to runtime %v", r.runtime.String())
return r.runtime.Exec(args)
}
// nvidia-container-toolkit/internal/oci/runtime_modifier.go
func (r *modifyingRuntimeWrapper) modify() error {
_, err := r.ociSpec.Load()
if err != nil {
return fmt.Errorf("error loading OCI specification for modification: %v", err)
}
err = r.ociSpec.Modify(r.modifier)
if err != nil {
return fmt.Errorf("error modifying OCI spec: %v", err)
}
err = r.ociSpec.Flush()
if err != nil {
return fmt.Errorf("error writing modified OCI specification: %v", err)
}
return nil
}
- OCI Spec Modify
Modify过程中先基于NewCUDAImageFromSpec初始化一个cuda image对象,从NewCUDAImageFromSpec函数的WithEnv
和WithMounts
可以看出,注入的信息只会与环境变量和mount信息有关。再根据mode、cfg、ociSpec、image等信息调用newModeModifier初始化modeModifyier对象,并根据环境所支持的mode可能返回多个modifier。我本地环境支持"mode"(本地环境对应StableRuntimeModifier)、“graphics”(对应GraphicsModifier)、“feature-gated”(对应FeatureGatedModifier),所以经过三个modifier的Modify。
// nvidia-container-toolkit/internal/runtime/runtime_factory.go
func newSpecModifier(logger logger.Interface, cfg *config.Config, ociSpec oci.Spec, driver *root.Driver) (oci.SpecModifier, error) {
rawSpec, err := ociSpec.Load()
...
image, err := image.NewCUDAImageFromSpec(rawSpec)
...
mode := info.ResolveAutoMode(logger, cfg.NVIDIAContainerRuntimeConfig.Mode, image)
modeModifier, err := newModeModifier(logger, mode, cfg, ociSpec, image)
...
var modifiers modifier.List
for _, modifierType := range supportedModifierTypes(mode) {
switch modifierType {
case "mode":
modifiers = append(modifiers, modeModifier)
case "graphics":
graphicsModifier, err := modifier.NewGraphicsModifier(logger, cfg, image, driver)
if err != nil {
return nil, err
}
modifiers = append(modifiers, graphicsModifier)
case "feature-gated":
featureGatedModifier, err := modifier.NewFeatureGatedModifier(logger, cfg, image)
if err != nil {
return nil, err
}
modifiers = append(modifiers, featureGatedModifier)
}
}
return modifiers, nil
}
// nvidia-container-toolkit/internal/config/image/cuda_image.go
type CUDA struct {
env map[string]string
mounts []specs.Mount
}
// nvidia-container-toolkit/internal/config/image/cuda_image.go
func NewCUDAImageFromSpec(spec *specs.Spec) (CUDA, error) {
var env []string
if spec != nil && spec.Process != nil {
env = spec.Process.Env
}
return New(
WithEnv(env),
WithMounts(spec.Mounts),
)
}
- Modify StableRuntimeModifier
StableRuntimeModifier的Modify方法比较简单,只是给ociSpec增加preStart hook,hook路径由配置文件的nvidia-container-runtime-hook配置下的path指定,默认为nvidia-container-runtime-hook。
// nvidia-container-toolkit/internal/modifier/stable.go
func (m stableRuntimeModifier) Modify(spec *specs.Spec) error {
// If an NVIDIA Container Runtime Hook already exists, we don't make any modifications to the spec.
if spec.Hooks != nil {
for _, hook := range spec.Hooks.Prestart {
hook := hook
if isNVIDIAContainerRuntimeHook(&hook) {
m.logger.Infof("Existing nvidia prestart hook (%v) found in OCI spec", hook.Path)
return nil
}
}
}
path := m.nvidiaContainerRuntimeHookPath
m.logger.Infof("Using prestart hook path: %v", path)
args := []string{filepath.Base(path)}
if spec.Hooks == nil {
spec.Hooks = &specs.Hooks{}
}
spec.Hooks.Prestart = append(spec.Hooks.Prestart, specs.Hook{
Path: path,
Args: append(args, "prestart"),
})
return nil
}
- Modify GraphicsModifier
默认NVIDIA_DRIVER_CAPABILITIES=compute,utility,不需要graphics和display等图形显示相关能力,因此该modifier会返回nil,最终相当于没对ociSpec做任何修改。
// nvidia-container-toolkit/internal/config/image/capabilities.go
func NewGraphicsModifier(logger logger.Interface, cfg *config.Config, containerImage image.CUDA, driver *root.Driver) (oci.SpecModifier, error) {
if required, reason := requiresGraphicsModifier(containerImage); !required {
logger.Infof("No graphics modifier required: %v", reason)
return nil, nil
}
nvidiaCDIHookPath := cfg.NVIDIACTKConfig.Path
mounts, err := discover.NewGraphicsMountsDiscoverer(
logger,
driver,
nvidiaCDIHookPath,
)
...
// In standard usage, the devRoot is the same as the driver.Root.
devRoot := driver.Root
drmNodes, err := discover.NewDRMNodesDiscoverer(
logger,
containerImage.DevicesFromEnvvars(image.EnvVarNvidiaVisibleDevices),
devRoot,
nvidiaCDIHookPath,
)
...
d := discover.Merge(
drmNodes,
mounts,
)
return NewModifierFromDiscoverer(logger, d)
}
// nvidia-container-toolkit/internal/config/image/capabilities.go
func requiresGraphicsModifier(cudaImage image.CUDA) (bool, string) {
if devices := cudaImage.VisibleDevicesFromEnvVar(); len(devices) == 0 {
return false, "no devices requested"
}
// DriverCapabilityGraphics = "graphics"
// DriverCapabilityDisplay = "display"
if !cudaImage.GetDriverCapabilities().Any(image.DriverCapabilityGraphics, image.DriverCapabilityDisplay) {
return false, "no required capabilities requested"
}
return true, ""
}
- Modify FeatureGatedModifier
默认环境变量NVIDIA_GDS、NVIDIA_MOFED、NVIDIA_NVSWITCH、NVIDIA_GDRCOPY都没配,所以直接走到NewModifierFromDiscoverer。
// nvidia-container-toolkit/internal/modifier/gated.go
func NewFeatureGatedModifier(logger logger.Interface, cfg *config.Config, image image.CUDA) (oci.SpecModifier, error) {
if devices := image.VisibleDevicesFromEnvVar(); len(devices) == 0 {
logger.Infof("No modification required; no devices requested")
return nil, nil
}
var discoverers []discover.Discover
driverRoot := cfg.NVIDIAContainerCLIConfig.Root
devRoot := cfg.NVIDIAContainerCLIConfig.Root
if image.Getenv("NVIDIA_GDS") == "enabled" {
d, err := discover.NewGDSDiscoverer(logger, driverRoot, devRoot)
if err != nil {
return nil, fmt.Errorf("failed to construct discoverer for GDS devices: %w", err)
}
discoverers = append(discoverers, d)
}
if image.Getenv("NVIDIA_MOFED") == "enabled" {
d, err := discover.NewMOFEDDiscoverer(logger, devRoot)
if err != nil {
return nil, fmt.Errorf("failed to construct discoverer for MOFED devices: %w", err)
}
discoverers = append(discoverers, d)
}
if image.Getenv("NVIDIA_NVSWITCH") == "enabled" {
d, err := discover.NewNvSwitchDiscoverer(logger, devRoot)
if err != nil {
return nil, fmt.Errorf("failed to construct discoverer for NVSWITCH devices: %w", err)
}
discoverers = append(discoverers, d)
}
if image.Getenv("NVIDIA_GDRCOPY") == "enabled" {
d, err := discover.NewGDRCopyDiscoverer(logger, devRoot)
if err != nil {
return nil, fmt.Errorf("failed to construct discoverer for GDRCopy devices: %w", err)
}
discoverers = append(discoverers, d)
}
return NewModifierFromDiscoverer(logger, discover.Merge(discoverers...))
}
// nvidia-container-toolkit/internal/modifier/discover.go
func NewModifierFromDiscoverer(logger logger.Interface, d discover.Discover) (oci.SpecModifier, error) {
m := discoverModifier{
logger: logger,
discoverer: d,
}
return &m, nil
}
// nvidia-container-toolkit/internal/modifier/discover.go
func (m discoverModifier) Modify(spec *specs.Spec) error {
specEdits, err := edits.NewSpecEdits(m.logger, m.discoverer)
if err != nil {
return fmt.Errorf("failed to get required container edits: %v", err)
}
return specEdits.Modify(spec)
}
// nvidia-container-toolkit/internal/modifier/discover.go
func FromDiscoverer(d discover.Discover) (*cdi.ContainerEdits, error) {
devices, err := d.Devices()
if err != nil {
return nil, fmt.Errorf("failed to discover devices: %v", err)
}
mounts, err := d.Mounts()
if err != nil {
return nil, fmt.Errorf("failed to discover mounts: %v", err)
}
hooks, err := d.Hooks()
if err != nil {
return nil, fmt.Errorf("failed to discover hooks: %v", err)
}
c := NewContainerEdits()
for _, d := range devices {
edits, err := device(d).toEdits()
if err != nil {
return nil, fmt.Errorf("failed to created container edits for device: %v", err)
}
c.Append(edits)
}
for _, m := range mounts {
c.Append(mount(m).toEdits())
}
for _, h := range hooks {
c.Append(hook(h).toEdits())
}
return c, nil
}
// nvidia-container-toolkit/internal/edits/edits.go
func (e *edits) Modify(spec *ociSpecs.Spec) error {
...
// Apply的实现在第三方包:tags.cncf.io/container-device-interface/pkg/cdi/container-edits.go
return e.Apply(spec)
}
3.2 nvidia-container-runtime-hook
main函数入口如下,主要做容器进程的preStart工作:
// nvidia-container-toolkit/cmd/nvidia-container-runtime-hook/main.go
func main() {
...
switch args[0] {
case "prestart":
doPrestart()
os.Exit(0)
...
}
}
doPrestart比较简单,就是找到nvidia-container-cli并拼接相关参数调用nvidia-container-cli(注意一定会拼一个configure参数):
func doPrestart() {
var err error
defer exit()
log.SetFlags(0)
hook, err := getHookConfig()
if err != nil || hook == nil {
log.Panicln("error getting hook config:", err)
}
cli := hook.NVIDIAContainerCLIConfig
container := hook.getContainerConfig()
nvidia := container.Nvidia
if nvidia == nil {
// Not a GPU container, nothing to do.
return
}
if !hook.NVIDIAContainerRuntimeHookConfig.SkipModeDetection && info.ResolveAutoMode(&logInterceptor{}, hook.NVIDIAContainerRuntimeConfig.Mode, container.Image) != "legacy" {
log.Panicln("invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported. Please use the NVIDIA Container Runtime (e.g. specify the --runtime=nvidia flag) instead.")
}
rootfs := getRootfsPath(container)
args := []string{getCLIPath(cli)}
if cli.Root != "" {
args = append(args, fmt.Sprintf("--root=%s", cli.Root))
}
if cli.LoadKmods {
args = append(args, "--load-kmods")
}
if hook.Features.DisableImexChannelCreation.IsEnabled() {
args = append(args, "--no-create-imex-channels")
}
if cli.NoPivot {
args = append(args, "--no-pivot")
}
if *debugflag {
args = append(args, "--debug=/dev/stderr")
} else if cli.Debug != "" {
args = append(args, fmt.Sprintf("--debug=%s", cli.Debug))
}
if cli.Ldcache != "" {
args = append(args, fmt.Sprintf("--ldcache=%s", cli.Ldcache))
}
if cli.User != "" {
args = append(args, fmt.Sprintf("--user=%s", cli.User))
}
args = append(args, "configure")
if !hook.Features.AllowCUDACompatLibsFromContainer.IsEnabled() {
args = append(args, "--no-cntlibs")
}
if ldconfigPath := cli.NormalizeLDConfigPath(); ldconfigPath != "" {
args = append(args, fmt.Sprintf("--ldconfig=%s", ldconfigPath))
}
if cli.NoCgroups {
args = append(args, "--no-cgroups")
}
if devicesString := strings.Join(nvidia.Devices, ","); len(devicesString) > 0 {
args = append(args, fmt.Sprintf("--device=%s", devicesString))
}
if len(nvidia.MigConfigDevices) > 0 {
args = append(args, fmt.Sprintf("--mig-config=%s", nvidia.MigConfigDevices))
}
if len(nvidia.MigMonitorDevices) > 0 {
args = append(args, fmt.Sprintf("--mig-monitor=%s", nvidia.MigMonitorDevices))
}
if imexString := strings.Join(nvidia.ImexChannels, ","); len(imexString) > 0 {
args = append(args, fmt.Sprintf("--imex-channel=%s", imexString))
}
for _, cap := range strings.Split(nvidia.DriverCapabilities, ",") {
if len(cap) == 0 {
break
}
args = append(args, capabilityToCLI(cap))
}
for _, req := range nvidia.Requirements {
args = append(args, fmt.Sprintf("--require=%s", req))
}
args = append(args, fmt.Sprintf("--pid=%s", strconv.FormatUint(uint64(container.Pid), 10)))
args = append(args, rootfs)
env := append(os.Environ(), cli.Environment...)
//nolint:gosec // TODO: Can we harden this so that there is less risk of command injection?
err = syscall.Exec(args[0], args, env)
log.Panicln("exec failed:", err)
}
3.3 nvidia-container-cli
以下代码基于https://github.com/NVIDIA/libnvidia-container@v1.17.7
github.com/NVIDIA/libnvidia-container项目要看作两部分代码:一个是编译nvidia-container-cli可执行文件的代码,一部分是编译libnvidia-container.so动态库代码。
libnvidia-container.so相关函数以nvc_开头,例如nvc_driver_mount,而nvidia-container-cli部分调用libnvidia-container.so库函数则去掉了前缀nvc_,例如driver_mount。这点对于查看源码的读者应该会很有帮助
main函数入口如下,先是加载libnvc库相关函数,然后解析参数并执行相关命令:
// github.com/NVIDIA/libnvidia-container/src/cli/main.c
int
main(int argc, char *argv[])
{
struct context ctx = {.uid = (uid_t)-1, .gid = (gid_t)-1};
int rv;
if ((rv = load_libnvc()) != 0)
goto fail;
argp_parse(&usage, argc, argv, ARGP_IN_ORDER, NULL, &ctx);
rv = ctx.command->func(&ctx);
fail:
free(ctx.devices);
free(ctx.init_flags);
free(ctx.container_flags);
free(ctx.mig_config);
free(ctx.mig_monitor);
free(ctx.imex_channels);
free(ctx.driver_opts);
return (rv);
}
load_libnvc,分为v0和v1两个方法,我这里环境适用于v1方法,也就是走load_libnvc_v1:
// github.com/NVIDIA/libnvidia-container/src/cli/libnvc.c
int
load_libnvc(void)
{
if (is_tegra() && !nvml_available())
return load_libnvc_v0();
return load_libnvc_v1();
}
load_libnvc_v1:
// github.com/NVIDIA/libnvidia-container/src/cli/libnvc.c
static int
load_libnvc_v1(void)
{
#define load_libnvc_func(func) \
libnvc.func = nvc_##func
load_libnvc_func(config_free);
load_libnvc_func(config_new);
load_libnvc_func(container_config_free);
load_libnvc_func(container_config_new);
load_libnvc_func(container_free);
load_libnvc_func(container_new);
load_libnvc_func(context_free);
load_libnvc_func(context_new);
load_libnvc_func(device_info_free);
load_libnvc_func(device_info_new);
load_libnvc_func(device_mount);
load_libnvc_func(driver_info_free);
load_libnvc_func(driver_info_new);
load_libnvc_func(driver_mount);
load_libnvc_func(error);
load_libnvc_func(init);
load_libnvc_func(ldcache_update);
load_libnvc_func(shutdown);
load_libnvc_func(version);
load_libnvc_func(nvcaps_style);
load_libnvc_func(nvcaps_device_from_proc_path);
load_libnvc_func(mig_device_access_caps_mount);
load_libnvc_func(mig_config_global_caps_mount);
load_libnvc_func(mig_monitor_global_caps_mount);
load_libnvc_func(device_mig_caps_mount);
load_libnvc_func(imex_channel_mount);
return (0);
}
再看看带有configure参数对应的函数:
// github.com/NVIDIA/libnvidia-container/src/cli/configure.c
int
configure_command(const struct context *ctx)
{
struct nvc_context *nvc = NULL;
struct nvc_config *nvc_cfg = NULL;
struct nvc_driver_info *drv = NULL;
struct nvc_device_info *dev = NULL;
struct nvc_container *cnt = NULL;
struct nvc_container_config *cnt_cfg = NULL;
bool eval_reqs = true;
struct devices devices = {0};
struct devices mig_config_devices = {0};
struct devices mig_monitor_devices = {0};
struct error err = {0};
int rv = EXIT_FAILURE;
if (perm_set_capabilities(&err, CAP_PERMITTED, pcaps, nitems(pcaps)) < 0 ||
perm_set_capabilities(&err, CAP_INHERITABLE, NULL, 0) < 0 ||
perm_set_bounds(&err, bcaps, nitems(bcaps)) < 0) {
warnx("permission error: %s", err.msg);
return (rv);
}
/* Initialize the library and container contexts. */
int c = ctx->load_kmods ? NVC_INIT_KMODS : NVC_INIT;
if (perm_set_capabilities(&err, CAP_EFFECTIVE, ecaps[c], ecaps_size(c)) < 0) {
warnx("permission error: %s", err.msg);
goto fail;
}
if ((nvc = libnvc.context_new()) == NULL ||
(nvc_cfg = libnvc.config_new()) == NULL ||
(cnt_cfg = libnvc.container_config_new(ctx->pid, ctx->rootfs)) == NULL) {
warn("memory allocation failed");
goto fail;
}
nvc->no_pivot = ctx->no_pivot;
nvc_cfg->uid = ctx->uid;
nvc_cfg->gid = ctx->gid;
nvc_cfg->root = ctx->root;
nvc_cfg->ldcache = ctx->ldcache;
if (parse_imex_info(&err, ctx->imex_channels, &nvc_cfg->imex) < 0) {
warnx("error parsing IMEX info: %s", err.msg);
goto fail;
}
if (libnvc.init(nvc, nvc_cfg, ctx->init_flags) < 0) {
warnx("initialization error: %s", libnvc.error(nvc));
goto fail;
}
if (perm_set_capabilities(&err, CAP_EFFECTIVE, ecaps[NVC_CONTAINER], ecaps_size(NVC_CONTAINER)) < 0) {
warnx("permission error: %s", err.msg);
goto fail;
}
cnt_cfg->ldconfig = ctx->ldconfig;
if ((cnt = libnvc.container_new(nvc, cnt_cfg, ctx->container_flags)) == NULL) {
warnx("container error: %s", libnvc.error(nvc));
goto fail;
}
/* Query the driver and device information. */
if (perm_set_capabilities(&err, CAP_EFFECTIVE, ecaps[NVC_INFO], ecaps_size(NVC_INFO)) < 0) {
warnx("permission error: %s", err.msg);
goto fail;
}
if ((drv = libnvc.driver_info_new(nvc, ctx->driver_opts)) == NULL ||
(dev = libnvc.device_info_new(nvc, NULL)) == NULL) {
warnx("detection error: %s", libnvc.error(nvc));
goto fail;
}
/* Allocate space for selecting GPU devices and MIG devices */
if (new_devices(&err, dev, &devices) < 0) {
warn("memory allocation failed: %s", err.msg);
goto fail;
}
/* Allocate space for selecting which devices are available for MIG config */
if (new_devices(&err, dev, &mig_config_devices) < 0) {
warn("memory allocation failed: %s", err.msg);
goto fail;
}
/* Allocate space for selecting which devices are available for MIG monitor */
if (new_devices(&err, dev, &mig_monitor_devices) < 0) {
warn("memory allocation failed: %s", err.msg);
goto fail;
}
/* Select the visible GPU devices. */
if (dev->ngpus > 0) {
if (select_devices(&err, ctx->devices, dev, &devices) < 0) {
warnx("device error: %s", err.msg);
goto fail;
}
}
/* Select the devices available for MIG config among the visible devices. */
if (select_mig_config_devices(&err, ctx->mig_config, &devices, &mig_config_devices) < 0) {
warnx("mig-config error: %s", err.msg);
goto fail;
}
/* Select the devices available for MIG monitor among the visible devices. */
if (select_mig_monitor_devices(&err, ctx->mig_monitor, &devices, &mig_monitor_devices) < 0) {
warnx("mig-monitor error: %s", err.msg);
goto fail;
}
/*
* Check the container requirements.
* Try evaluating per visible device first, and globally otherwise.
*/
for (size_t i = 0; i < devices.ngpus; ++i) {
struct dsl_data data = {drv, devices.gpus[i]};
for (size_t j = 0; j < ctx->nreqs; ++j) {
if (dsl_evaluate(&err, ctx->reqs[j], &data, rules, nitems(rules)) < 0) {
warnx("requirement error: %s", err.msg);
goto fail;
}
}
eval_reqs = false;
}
for (size_t i = 0; i < devices.nmigs; ++i) {
struct dsl_data data = {drv, devices.migs[i]->parent};
for (size_t j = 0; j < ctx->nreqs; ++j) {
if (dsl_evaluate(&err, ctx->reqs[j], &data, rules, nitems(rules)) < 0) {
warnx("requirement error: %s", err.msg);
goto fail;
}
}
eval_reqs = false;
}
if (eval_reqs) {
struct dsl_data data = {drv, NULL};
for (size_t j = 0; j < ctx->nreqs; ++j) {
if (dsl_evaluate(&err, ctx->reqs[j], &data, rules, nitems(rules)) < 0) {
warnx("requirement error: %s", err.msg);
goto fail;
}
}
}
/* Mount the driver, visible devices, mig-configs, mig-monitors, and imex-channels. */
if (perm_set_capabilities(&err, CAP_EFFECTIVE, ecaps[NVC_MOUNT], ecaps_size(NVC_MOUNT)) < 0) {
warnx("permission error: %s", err.msg);
goto fail;
}
if (libnvc.driver_mount(nvc, cnt, drv) < 0) {
warnx("mount error: %s", libnvc.error(nvc));
goto fail;
}
for (size_t i = 0; i < devices.ngpus; ++i) {
if (libnvc.device_mount(nvc, cnt, devices.gpus[i]) < 0) {
warnx("mount error: %s", libnvc.error(nvc));
goto fail;
}
}
if (!mig_config_devices.all && !mig_monitor_devices.all) {
for (size_t i = 0; i < devices.nmigs; ++i) {
if (libnvc.mig_device_access_caps_mount(nvc, cnt, devices.migs[i]) < 0) {
warnx("mount error: %s", libnvc.error(nvc));
goto fail;
}
}
}
if (mig_config_devices.all && mig_config_devices.ngpus) {
if (libnvc.mig_config_global_caps_mount(nvc, cnt) < 0) {
warnx("mount error: %s", libnvc.error(nvc));
goto fail;
}
for (size_t i = 0; i < mig_config_devices.ngpus; ++i) {
if (libnvc.device_mig_caps_mount(nvc, cnt, mig_config_devices.gpus[i]) < 0) {
warnx("mount error: %s", libnvc.error(nvc));
goto fail;
}
}
}
if (mig_monitor_devices.all && mig_monitor_devices.ngpus) {
if (libnvc.mig_monitor_global_caps_mount(nvc, cnt) < 0) {
warnx("mount error: %s", libnvc.error(nvc));
goto fail;
}
for (size_t i = 0; i < mig_monitor_devices.ngpus; ++i) {
if (libnvc.device_mig_caps_mount(nvc, cnt, mig_monitor_devices.gpus[i]) < 0) {
warnx("mount error: %s", libnvc.error(nvc));
goto fail;
}
}
}
for (size_t i = 0; i < nvc_cfg->imex.nchans; ++i) {
if (libnvc.imex_channel_mount(nvc, cnt, &nvc_cfg->imex.chans[i]) < 0) {
warnx("mount error: %s", libnvc.error(nvc));
goto fail;
}
}
/* Update the container ldcache. */
if (perm_set_capabilities(&err, CAP_EFFECTIVE, ecaps[NVC_LDCACHE], ecaps_size(NVC_LDCACHE)) < 0) {
warnx("permission error: %s", err.msg);
goto fail;
}
if (libnvc.ldcache_update(nvc, cnt) < 0) {
warnx("ldcache error: %s", libnvc.error(nvc));
goto fail;
}
if (perm_set_capabilities(&err, CAP_EFFECTIVE, ecaps[NVC_SHUTDOWN], ecaps_size(NVC_SHUTDOWN)) < 0) {
warnx("permission error: %s", err.msg);
goto fail;
}
rv = EXIT_SUCCESS;
fail:
free(nvc_cfg->imex.chans);
free_devices(&devices);
libnvc.shutdown(nvc);
libnvc.container_free(cnt);
libnvc.device_info_free(dev);
libnvc.driver_info_free(drv);
libnvc.container_config_free(cnt_cfg);
libnvc.config_free(nvc_cfg);
libnvc.context_free(nvc);
error_reset(&err);
return (rv);
}
在libnvc.driver_info_new函数中,lookup_paths
主要是查找库、nvidia-smi等可执行文件,lookup_devices
主要是查找GPU设备:
// github.com/NVIDIA/libnvidia-container/src/nvc_info.c
struct nvc_driver_info *
nvc_driver_info_new(struct nvc_context *ctx, const char *opts)
{
struct nvc_driver_info *info;
int32_t flags;
if (validate_context(ctx) < 0)
return (NULL);
if (opts == NULL)
opts = default_driver_opts;
if ((flags = options_parse(&ctx->err, opts, driver_opts, nitems(driver_opts))) < 0)
return (NULL);
log_infof("requesting driver information with '%s'", opts);
if ((info = xcalloc(&ctx->err, 1, sizeof(*info))) == NULL)
return (NULL);
if (driver_get_rm_version(&ctx->err, &info->nvrm_version) < 0)
goto fail;
if (driver_get_cuda_version(&ctx->err, &info->cuda_version) < 0)
goto fail;
if (lookup_paths(&ctx->err, &ctx->dxcore, info, ctx->cfg.root, flags, ctx->cfg.ldcache) < 0)
goto fail;
if (lookup_devices(&ctx->err, &ctx->dxcore, info, ctx->cfg.root, flags) < 0)
goto fail;
if (lookup_ipcs(&ctx->err, info, ctx->cfg.root, flags) < 0)
goto fail;
return (info);
fail:
nvc_driver_info_free(info);
return (NULL);
}
挂载driver逻辑:
// github.com/NVIDIA/libnvidia-container/src/nvc_mount.c
int
nvc_driver_mount(struct nvc_context *ctx, const struct nvc_container *cnt, const struct nvc_driver_info *info)
{
const char **mnt, **ptr, **tmp;
size_t nmnt;
int rv = -1;
if (validate_context(ctx) < 0)
return (-1);
if (validate_args(ctx, cnt != NULL && info != NULL) < 0)
return (-1);
if (ns_enter(&ctx->err, cnt->mnt_ns, CLONE_NEWNS) < 0)
return (-1);
nmnt = 2 + info->nbins + info->nlibs + cnt->nlibs + info->nlibs32 + info->nipcs + info->ndevs + info->nfirmwares;
mnt = ptr = (const char **)array_new(&ctx->err, nmnt);
if (mnt == NULL)
goto fail;
/* Procfs mount */
if (ctx->dxcore.initialized)
log_warn("skipping procfs mount on WSL");
else if ((*ptr++ = mount_procfs(&ctx->err, ctx->cfg.root, cnt)) == NULL)
goto fail;
/* Application profile mount */
if (cnt->flags & OPT_GRAPHICS_LIBS) {
if (ctx->dxcore.initialized)
log_warn("skipping app profile mount on WSL");
else if ((*ptr++ = mount_app_profile(&ctx->err, cnt)) == NULL)
goto fail;
}
/* Host binary and library mounts */
if (info->bins != NULL && info->nbins > 0) {
if ((tmp = (const char **)mount_files(&ctx->err, ctx->cfg.root, cnt, cnt->cfg.bins_dir, info->bins, info->nbins)) == NULL)
goto fail;
ptr = array_append(ptr, tmp, array_size(tmp));
free(tmp);
}
if (info->libs != NULL && info->nlibs > 0) {
if ((tmp = (const char **)mount_files(&ctx->err, ctx->cfg.root, cnt, cnt->cfg.libs_dir, info->libs, info->nlibs)) == NULL)
goto fail;
ptr = array_append(ptr, tmp, array_size(tmp));
free(tmp);
}
if ((cnt->flags & OPT_COMPAT32) && info->libs32 != NULL && info->nlibs32 > 0) {
if ((tmp = (const char **)mount_files(&ctx->err, ctx->cfg.root, cnt, cnt->cfg.libs32_dir, info->libs32, info->nlibs32)) == NULL)
goto fail;
ptr = array_append(ptr, tmp, array_size(tmp));
free(tmp);
}
if (symlink_libraries(&ctx->err, cnt, mnt, (size_t)(ptr - mnt)) < 0)
goto fail;
/* Container library mounts */
if ((cnt->flags & OPT_CUDA_COMPAT_MODE_MOUNT) && cnt->libs != NULL && cnt->nlibs > 0) {
if ((tmp = (const char **)mount_files(&ctx->err, cnt->cfg.rootfs, cnt, cnt->cfg.libs_dir, cnt->libs, cnt->nlibs)) == NULL) {
goto fail;
}
ptr = array_append(ptr, tmp, array_size(tmp));
free(tmp);
}
/* Firmware mounts */
for (size_t i = 0; i < info->nfirmwares; ++i) {
if ((*ptr++ = mount_firmware(&ctx->err, ctx->cfg.root, cnt, info->firmwares[i])) == NULL) {
log_errf("error mounting firmware path %s", info->firmwares[i]);
goto fail;
}
}
/* IPC mounts */
for (size_t i = 0; i < info->nipcs; ++i) {
/* XXX Only utility libraries require persistenced or fabricmanager IPC, everything else is compute only. */
if (str_has_suffix(NV_PERSISTENCED_SOCKET, info->ipcs[i]) || str_has_suffix(NV_FABRICMANAGER_SOCKET, info->ipcs[i])) {
if (!(cnt->flags & OPT_UTILITY_LIBS))
continue;
} else if (!(cnt->flags & OPT_COMPUTE_LIBS))
continue;
if ((*ptr++ = mount_ipc(&ctx->err, ctx->cfg.root, cnt, info->ipcs[i])) == NULL)
goto fail;
}
/* Device mounts */
for (size_t i = 0; i < info->ndevs; ++i) {
/* On WSL2 we only mount the /dev/dxg device and as such these checks are not applicable. */
if (!ctx->dxcore.initialized) {
/* XXX Only compute libraries require specific devices (e.g. UVM). */
if (!(cnt->flags & OPT_COMPUTE_LIBS) && major(info->devs[i].id) != NV_DEVICE_MAJOR)
continue;
/* XXX Only display capability requires the modeset device. */
if (!(cnt->flags & OPT_DISPLAY) && minor(info->devs[i].id) == NV_MODESET_DEVICE_MINOR)
continue;
}
if (!(cnt->flags & OPT_NO_DEVBIND)) {
if ((*ptr++ = mount_device(&ctx->err, ctx->cfg.root, cnt, &info->devs[i])) == NULL)
goto fail;
}
if (!(cnt->flags & OPT_NO_CGROUPS)) {
if (setup_device_cgroup(&ctx->err, cnt, info->devs[i].id) < 0)
goto fail;
}
}
rv = 0;
fail:
if (rv < 0) {
for (size_t i = 0; mnt != NULL && i < nmnt; ++i)
unmount(mnt[i]);
assert_func(ns_enter_at(NULL, ctx->mnt_ns, CLONE_NEWNS));
} else {
rv = ns_enter_at(&ctx->err, ctx->mnt_ns, CLONE_NEWNS);
}
array_free((char **)mnt, nmnt);
return (rv);
}
挂载设备逻辑:
// github.com/NVIDIA/libnvidia-container/src/nvc_mount.c
int
nvc_device_mount(struct nvc_context *ctx, const struct nvc_container *cnt, const struct nvc_device *dev)
{
int rv = -1;
if (validate_context(ctx) < 0)
return (-1);
if (validate_args(ctx, cnt != NULL && dev != NULL) < 0)
return (-1);
if (ns_enter(&ctx->err, cnt->mnt_ns, CLONE_NEWNS) < 0)
return (-1);
if (ctx->dxcore.initialized)
rv = device_mount_dxcore(ctx, cnt);
else rv = device_mount_native(ctx, cnt, dev);
if (rv < 0)
assert_func(ns_enter_at(NULL, ctx->mnt_ns, CLONE_NEWNS));
else rv = ns_enter_at(&ctx->err, ctx->mnt_ns, CLONE_NEWNS);
return (rv);
}