k8s pod的volume实现原理

最新推荐文章于 2025-12-02 21:31:13 发布

原创最新推荐文章于 2025-12-02 21:31:13 发布 · 405 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#kubernetes #容器 #云原生

kubernetes 专栏收录该内容

4 篇文章

订阅专栏

Kubernetes 中的 Pod 是最小的调度和管理单元，理解其实现原理对于掌握 Kubernetes 的核心机制至关重要。Pod 就是最小并且最简单的 Kubernetes 对象，这个简单的对象能够启动一个后端进程并在集群的内部为调用方提供服务。

Pod 是什么？

一个 Pod 可以包含一个或多个容器（通常是 1 个主容器 + 0 个或多个辅助容器），这些容器共享网络、存储和生命周期。

apiVersion: v1
kind: Pod
metadata:
  name: redis-pod
  labels:
    app: redis
spec:
  containers:
    # 容器名
    - name: redis
      image: redis:latest
      # 容器镜像拉取策略，如果本地没有就从容器镜像仓库拉取
      imagePullPolicy: IfNotPresent
      ports:
        - containerPort: 6379
          hostPort: 6379
      command:
        - redis-server
        - --appendonly yes  # 开启持久化（AOF）
      resources:
        limits:
          memory: "256Mi"
          cpu: "500m"
  # 容器重启策略
  restartPolicy: Always

这个yaml描述了一个pod启动时，运行的容器和命令以及容器的重启策略，容器镜像拉取策略 IfNotPresent 表示优先使用本地镜像，如果本地没有从镜像仓库拉取镜像，容器重启策略 Always 当容器启动失败，默认会被k8s自动再次尝试拉起。

在同一个 Pod 中，有几个概念特别值得关注：

共享同一个 网络命名空间（IP 和端口）
共享 存储卷（Volumes）
被 原子性地调度 到同一个节点上
同时创建、运行、重启、销毁

Pod 的实现原理（底层机制）

Pod 的实现依赖于 Linux 的 命名空间（Namespaces） 和 控制组（cgroups），并通过 pause 容器（Pod Infra Container） 实现资源共享。

1. Pause 容器（Pod 基础设施容器）

Kubernetes 为每个 Pod 创建一个特殊的 pause 容器（如 k8s.gcr.io/pause）。
这个容器：
- 是 Pod 中第一个启动的容器
- 几乎不消耗资源（只运行一个 sleep 进程）
- 负责持有 Pod 的 网络命名空间、IPC 命名空间、UTS 命名空间

🔍 你可以把它理解为 Pod 的“操作系统内核”——它提供了共享环境的基础。

示例：
当你创建一个包含 nginx 和 fluentd 的 Pod 时，Kubernetes 实际上会启动三个容器：

pause（基础设施容器）
nginx（共享 pause 的网络）
fluentd（共享 pause 的网络）

2. 网络共享：同一个 IP 和端口空间

所有容器共享 pause 容器的网络命名空间。
因此：
- 所有容器拥有相同的 IP 地址
- 所有容器共享同一个端口空间
- 容器之间通过 localhost 通信（无需 Service 或 DNS）

🌐 例如：nginx 在 :80，sidecar 可以用 curl http://localhost 访问它。

3. 存储共享：通过 Volumes

Pod 定义的 volumes 被挂载到 pause 容器或直接由所有容器共享。
多个容器可以通过挂载同一个 volume 实现文件共享。

volumes:
  - name: shared-data
    emptyDir: {}

volumeMounts:
  - name: shared-data
    mountPath: /data

💡 emptyDir 是最简单的共享存储，生命周期与 Pod 一致。

4. 生命周期管理：原子性操作

Pod 是不可变的：不能修改其 spec（只能删除重建）。
所有容器：
- 同时启动（但可通过 initContainers 控制顺序）
- 同时终止
- 一起被调度到同一个节点
如果 Pod 失败，Kubernetes 会根据控制器（如 Deployment）重建整个 Pod

5. 资源隔离与共享

资源	是否共享	说明
网络	✅ 共享	同一个 IP，localhost 互通
存储	✅ 共享	通过 Volumes
PID 命名空间	❌ 不共享（默认）	每个容器有独立进程树
IPC	✅ 可共享	通过 `pod.spec.shareProcessNamespace: true`
主机网络	❌ 默认不共享	可通过 `hostNetwork: true` 开启

容器

每个pod中都具有两种不同的容器，一种是 InitContainer，这种容器会在pod启动时运行，主要用于初始化一些配置，另外一种是Pod在Running状态时内部存活的 Container，主要用于对外提供服务或者为工作节点处理异步任务等等。

通过不同类型的命名可以看出 InitContainer 容器会在 Container 容器之前启动，具体是不是这样我们可以通过查看 kubeGenericRuntimeManager.SyncPod 方法得出结论。

// SyncPod syncs the running pod into the desired pod by executing following steps://  
//  1. Compute sandbox and container changes.  
//  2. Kill pod sandbox if necessary.  
//  3. Kill any containers that should not be running.  
//  4. Create sandbox if necessary.  
//  5. Create ephemeral containers.  
//  6. Create init containers.  
//  7. Resize running containers (if InPlacePodVerticalScaling==true)  
//  8. Create normal containers.  
func (m *kubeGenericRuntimeManager) SyncPod(ctx context.Context, pod *v1.Pod, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, backOff *flowcontrol.Backoff) (result kubecontainer.PodSyncResult) {  
    // Step 1: Compute sandbox and container changes.  
    podContainerChanges := m.computePodActions(ctx, pod, podStatus)  
    // Step 2: Kill the pod if the sandbox has changed.  
    if podContainerChanges.KillPod {  
    } else {  
       // Step 3: kill any running containers in this pod which are not to keep.  
          m.killContainer(ctx, pod, containerID, containerInfo.name, containerInfo.message, containerInfo.reason, nil, nil);
    }  
  
 // 是在启动 Pod 容器之前清理已终止的 Init Containers
  m.pruneInitContainersBeforeStart(ctx, pod, podStatus)  
  
    // Step 4: Create a sandbox for the pod if necessary.  
    podSandboxID := podContainerChanges.SandboxID  
    // Step 5: start ephemeral containers  
    // These are started "prior" to init containers to allow running ephemeral containers even when there    // are errors starting an init container. In practice init containers will start first since ephemeral    // containers cannot be specified on pod creation.    
    for _, idx := range podContainerChanges.EphemeralContainersToStart {  
       start(ctx, "ephemeral container", metrics.EphemeralContainer, ephemeralContainerStartSpec(&pod.Spec.EphemeralContainers[idx]))  
    }  
  
	// Step 6: start init containers.  
  for _, idx := range podContainerChanges.InitContainersToStart {  
    container := &pod.Spec.InitContainers[idx]  
    // Start the next init container.  
    if err := start(ctx, "init container", metrics.InitContainer, containerStartSpec(container)); err != nil {  
       if podutil.IsRestartableInitContainer(container) {  
          continue  
       }  
       return  
    }    

    // Step 7: For containers in podContainerChanges.ContainersToUpdate[CPU,Memory] list, invoke UpdateContainerResources  
    if resizable, _ := allocation.IsInPlacePodVerticalScalingAllowed(pod); resizable {  
       if len(podContainerChanges.ContainersToUpdate) > 0 || podContainerChanges.UpdatePodResources {  
          result.SyncResults = append(result.SyncResults, m.doPodResizeAction(pod, podContainerChanges))  
       }  
    }  
  
    // Step 8: start containers in podContainerChanges.ContainersToStart.  
    for _, idx := range podContainerChanges.ContainersToStart {  
       start(ctx, "container", metrics.Container, containerStartSpec(&pod.Spec.Containers[idx]))  
    }  
  
    return  
}

通过分析k8s源码，可以看出，init容器和普通容器只是 typeName, metricLabel不一样，其他的并没有什么区别，所以对于k8s来说两种容器也只是启动的顺序不同，其他的并没有什么区别。

卷

每一个 Pod 中的容器是可以通过 Volume 的方式共享文件目录的，这些 Volume 能够存储持久化的数据；但需要根据 Volume 的类型进行区分**。Kubernetes 中的 Volume（卷） 确实用于在 Pod 内容器间共享数据，但其是否能持久化、在 Pod 重启或更新后保留数据，取决于 Volume 的具体类型。

正确理解：Volume 的持久化能力是类型相关的

Volume 类型	是否持久化	Pod 重启后数据是否保留	说明
`emptyDir`	❌ 否	❌ 否	存在于节点内存或磁盘，Pod 删除或节点故障时数据丢失
`hostPath`	⚠️ 有限	✅ 是（如果节点正常）	数据存在节点本地，节点故障或 Pod 被调度到其他节点则丢失
`persistentVolumeClaim` (PVC)	✅ 是	✅ 是	使用外部存储（如 NFS、云盘），真正实现持久化
`nfs`	✅ 是	✅ 是	网络文件系统，独立于节点
`configMap` / `secret`	✅ 配置持久	✅ 是	存在于 etcd，数据不随 Pod 消失

为什么emptyDir不能持久化

我们知道创建Pod时，有一步就是挂挂载volumes， Kubelet.SyncPod 接口实现了Pod的创建，其主要实现的功能如下：

If the pod is being created, record pod worker start latency
Call generateAPIPodStatus to prepare an v1.PodStatus for the pod
If the pod is being seen as running for the first time, record pod start latency
Update the status of the pod in the status manager
Stop the pod’s containers if it should not be running due to soft admission
Ensure any background tracking for a runnable pod is started
Create a mirror pod if the pod is a static pod, and does not already have a mirror pod
Create the data directories for the pod if they do not exist
Wait for volumes to attach/mount
Fetch the pull secrets for the pod
Call the container runtime’s SyncPod callback// - Update the traffic shaping for the pod’s ingress and egress limits

在 WaitForAttachAndMount 方法中实现了 volumes的挂载等待，具体的挂载是在另外一个协程异步实现的

func (kl *Kubelet) SyncPod(ctx context.Context, updateType kubetypes.SyncPodType, pod, mirrorPod *v1.Pod, podStatus *kubecontainer.PodStatus) (isTerminal bool, err error) {  
    
...
    // Wait for volumes to attach/mount  
    if err := kl.volumeManager.WaitForAttachAndMount(ctx, pod); err != nil {  
       if !wait.Interrupted(err) {  
          kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedMountVolume, "Unable to attach or mount volumes: %v", err)  
          klog.ErrorS(err, "Unable to attach or mount volumes for pod; skipping pod", "pod", klog.KObj(pod))  
       }  
       return false, err  
    }  
...
    return false, result.Error()  
}

  
func (vm *volumeManager) WaitForAttachAndMount(ctx context.Context, pod *v1.Pod) error {  
 
  
    // Some pods expect to have Setup called over and over again to update.  
    // Remount plugins for which this is true. (Atomically updating volumes,    // like Downward API, depend on this to update the contents of the volume).    vm.desiredStateOfWorldPopulator.ReprocessPod(uniquePodName)  
  ...
    err := wait.PollUntilContextTimeout(  
       ctx,  
       podAttachAndMountRetryInterval,  
       podAttachAndMountTimeout,  
       true,  
       vm.verifyVolumesMountedFunc(uniquePodName, expectedVolumes))  
  ...
    return nil  
}

WaitForAttachAndMount 只是函数的封装没有什么特别的，我们具体看下 verifyVolumesMountedFunc

func (vm *volumeManager) verifyVolumesMountedFunc(podName types.UniquePodName, expectedVolumes []string) wait.ConditionWithContextFunc {  
    return func(_ context.Context) (done bool, err error) {  
    // desiredStateOfWorld 中的 PopPodErrors 是使用 podErrors 字段来存放的，而这个字段就是用来记录 volumes错误的
       if errs := vm.desiredStateOfWorld.PopPodErrors(podName); len(errs) > 0 {  
          return true, errors.New(strings.Join(errs, "; "))  
       }  
       for _, expectedVolume := range expectedVolumes {
       // 当挂载没有错误，并且挂载状态为 VolumeMounted 说明volumes挂载成功。  
          _, found := vm.actualStateOfWorld.GetMountedVolumeForPodByOuterVolumeSpecName(podName, expectedVolume)  
          if !found {  
             return false, nil  
          }  
       }  
       return true, nil  
    }  
}

func (dsw *desiredStateOfWorld) PopPodErrors(podName types.UniquePodName) []string {  
    dsw.Lock()  
    defer dsw.Unlock()  
  
    if errs, found := dsw.podErrors[podName]; found {  
       delete(dsw.podErrors, podName)  
       return sets.List(errs)  
    }  
    return []string{}  
}

type desiredStateOfWorld struct {
	// volumesToMount is a map containing the set of volumes that should be
	// attached to this node and mounted to the pods referencing it. The key in
	// the map is the name of the volume and the value is a volume object
	// containing more information about the volume.
	volumesToMount map[v1.UniqueVolumeName]volumeToMount
	// volumePluginMgr is the volume plugin manager used to create volume
	// plugin objects.
	volumePluginMgr *volume.VolumePluginMgr
	// podErrors are errors caught by desiredStateOfWorldPopulator about volumes for a given pod.
	podErrors map[types.UniquePodName]sets.Set[string]
	// seLinuxTranslator translates v1.SELinuxOptions to a file SELinux label.
	seLinuxTranslator util.SELinuxLabelTranslator

	sync.RWMutex
}

当挂载没有错误，并且挂载状态为 VolumeMounted 说明volumes挂载成功。

func (asw *actualStateOfWorld) GetMountedVolumeForPodByOuterVolumeSpecName(  
    podName volumetypes.UniquePodName, outerVolumeSpecName string) (MountedVolume, bool) {  
    asw.RLock()  
    defer asw.RUnlock()  
    for _, volumeObj := range asw.attachedVolumes {  
       if podObj, hasPod := volumeObj.mountedPods[podName]; hasPod { 
       // 检查挂载状态是否为  VolumeMounted
          if podObj.volumeMountStateForPod == operationexecutor.VolumeMounted && podObj.outerVolumeSpecName == outerVolumeSpecName {  
             return getMountedVolume(&podObj, &volumeObj), true  
          }  
       }  
    }  
  
    return MountedVolume{}, false  
}

const (  
    // VolumeMounted means volume has been mounted in pod's local path
        VolumeMounted VolumeMountState = "VolumeMounted"  
  
    // VolumeMountUncertain means volume may or may not be mounted in pods' local path    
    VolumeMountUncertain VolumeMountState = "VolumeMountUncertain"  
  
    // VolumeNotMounted means volume has not be mounted in pod's local path    VolumeNotMounted 
    VolumeMountState = "VolumeNotMounted"  
)

而真实的挂载步骤是在 VolumeManager.Run 中实现的

func (vm *volumeManager) Run(ctx context.Context, sourcesReady config.SourcesReady) {  
 ...
    klog.InfoS("Starting Kubelet Volume Manager")  
    // volumes 挂载协程
    go vm.reconciler.Run(ctx.Done())  
  ...
}

func (rc *reconciler) Run(stopCh <-chan struct{}) {  
    rc.reconstructVolumes()  
    klog.InfoS("Reconciler: start to sync state")  
    wait.Until(rc.reconcile, rc.loopSleepDuration, stopCh)  
}

// reconstructVolumes tries to reconstruct the actual state of world by scanning all pods' volume// directories from the disk. For the volumes that cannot support or fail reconstruction, it will  
// put the volumes to volumesFailedReconstruction to be cleaned up later when DesiredStateOfWorld// is populated.  
func (rc *reconciler) reconstructVolumes() {  
    // Get volumes information by reading the pod's directory  
    podVolumes, err := getVolumesFromPodDir(rc.kubeletPodsDir)  
    if err != nil {  
       klog.ErrorS(err, "Cannot get volumes from disk, skip sync states for volume reconstruction")  
       return  
    }  
    reconstructedVolumes := make(map[v1.UniqueVolumeName]*globalVolumeInfo)  
    reconstructedVolumeNames := []v1.UniqueVolumeName{}  
    for _, volume := range podVolumes {  
       if rc.actualStateOfWorld.VolumeExistsWithSpecName(volume.podName, volume.volumeSpecName) {  
          klog.V(4).InfoS("Volume exists in actual state, skip cleaning up mounts", "podName", volume.podName, "volumeSpecName", volume.volumeSpecName)  
          // There is nothing to reconstruct  
          continue  
       }  
       reconstructedVolume, err := rc.reconstructVolume(volume)  
       if err != nil {  
          klog.InfoS("Could not construct volume information", "podName", volume.podName, "volumeSpecName", volume.volumeSpecName, "err", err)  
          // We can't reconstruct the volume. Remember to check DSW after it's fully populated and force unmount the volume when it's orphaned.  
          rc.volumesFailedReconstruction = append(rc.volumesFailedReconstruction, volume)  
          continue  
       }  
       klog.V(4).InfoS("Adding reconstructed volume to actual state and node status", "podName", volume.podName, "volumeSpecName", volume.volumeSpecName)  
       gvl := &globalVolumeInfo{  
          volumeName:        reconstructedVolume.volumeName,  
          volumeSpec:        reconstructedVolume.volumeSpec,  
          devicePath:        reconstructedVolume.devicePath,  
          deviceMounter:     reconstructedVolume.deviceMounter,  
          blockVolumeMapper: reconstructedVolume.blockVolumeMapper,  
          mounter:           reconstructedVolume.mounter,  
       }  
       if cachedInfo, ok := reconstructedVolumes[reconstructedVolume.volumeName]; ok {  
          gvl = cachedInfo  
       }  
       gvl.addPodVolume(reconstructedVolume)  
  
       reconstructedVolumeNames = append(reconstructedVolumeNames, reconstructedVolume.volumeName)  
       reconstructedVolumes[reconstructedVolume.volumeName] = gvl  
    }  
  
    if len(reconstructedVolumes) > 0 {  
       // Add the volumes to ASW  
       rc.updateStates(reconstructedVolumes)  
  
       // Remember to update devicePath from node.status.volumesAttached  
       rc.volumesNeedUpdateFromNodeStatus = reconstructedVolumeNames  
    }  
    klog.V(2).InfoS("Volume reconstruction finished")  
}

这段代码实现了**卷状态重建（Volume Reconstruction）**功能，主要用于在 kubelet 重启后恢复卷的实际状态。具体功能包括：

扫描磁盘上的卷信息
- 通过 getVolumesFromPodDir 扫描 rc.kubeletPodsDir 目录下所有 Pod 的卷目录
- 获取当前磁盘上存在的卷信息
重建卷状态
- 遍历扫描到的卷，检查是否已存在于 actualStateOfWorld 中
- 对于不存在的卷，调用 rc.reconstructVolume 进行重建
- 将无法重建的卷记录到 rc.volumesFailedReconstruction 供后续清理
更新实际状态
- 对于成功重建的卷，调用 rc.updateStates 将其添加到 actualStateOfWorld 中
- 将这些卷标记为"不确定状态"(Uncertain)，因为 kubelet 还不确定它们的真实挂载状态
设备路径更新准备
- 将需要更新设备路径的卷记录在 rc.volumesNeedUpdateFromNodeStatus 中
- 后续会通过 updateReconstructedFromNodeStatus 从节点状态中获取正确的设备路径

这个功能主要用于 kubelet 重启后恢复对已存在卷的管理，避免因为 kubelet 重启而误删正在使用的卷。

卷状态重建完成，就需要对没有挂在的volumes进行挂载了，挂载是在 reconciler.reconcile 中进行的

func (rc *reconciler) reconcile() {  
...
    // Next we mount required volumes. This function could also trigger  
    // attach if kubelet is responsible for attaching volumes.    // If underlying PVC was resized while in-use then this function also handles volume    // resizing.    
    rc.mountOrAttachVolumes()  
...
}

mountOrAttachVolumes -> mountAttachedVolumes -> MountVolume -> GenerateMountVolumeFunc -> SetUp -> SetUpAt

// SetUpAt creates new directory.
func (ed *emptyDir) SetUpAt(dir string, mounterArgs volume.MounterArgs) error {
	notMnt, err := ed.mounter.IsLikelyNotMountPoint(dir)
	// Getting an os.IsNotExist err from is a contingency; the directory
	// may not exist yet, in which case, setup should run.
	if err != nil && !os.IsNotExist(err) {
		return err
	}

	// If the plugin readiness file is present for this volume, and the
	// storage medium is the default, then the volume is ready.  If the
	// medium is memory, and a mountpoint is present, then the volume is
	// ready.
	readyDir := ed.getMetaDir()
	if volumeutil.IsReady(readyDir) {
		if ed.medium == v1.StorageMediumMemory && !notMnt {
			return nil
		} else if ed.medium == v1.StorageMediumDefault {
			// Further check dir exists
			if _, err := os.Stat(dir); err == nil {
				klog.V(6).InfoS("Dir exists, so check and assign quota if the underlying medium supports quotas", "dir", dir)
				err = ed.assignQuota(dir, mounterArgs.DesiredSize)
				return err
			}
			// This situation should not happen unless user manually delete volume dir.
			// In this case, delete ready file and print a warning for it.
			klog.Warningf("volume ready file dir %s exist, but volume dir %s does not. Remove ready dir", readyDir, dir)
			if err := os.RemoveAll(readyDir); err != nil && !os.IsNotExist(err) {
				klog.Warningf("failed to remove ready dir [%s]: %v", readyDir, err)
			}
		}
	}

	switch {
	case ed.medium == v1.StorageMediumDefault:
		err = ed.setupDir(dir)
	case ed.medium == v1.StorageMediumMemory:
		err = ed.setupTmpfs(dir)
	case v1helper.IsHugePageMedium(ed.medium):
		err = ed.setupHugepages(dir)
	default:
		err = fmt.Errorf("unknown storage medium %q", ed.medium)
	}

	ownershipChanger := volume.NewVolumeOwnership(ed, dir, mounterArgs.FsGroup, nil /*fsGroupChangePolicy*/, volumeutil.FSGroupCompleteHook(ed.plugin, nil))
	_ = ownershipChanger.ChangePermissions()

	// If setting up the quota fails, just log a message but don't actually error out.
	// We'll use the old du mechanism in this case, at least until we support
	// enforcement.
	if err == nil {
		volumeutil.SetReady(ed.getMetaDir())
		err = ed.assignQuota(dir, mounterArgs.DesiredSize)
	}
	return err
}

SetUpAt 函数用于在指定目录设置 emptyDir 卷，其主要功能包括：

检查挂载点状态
- 使用 ed.mounter.IsLikelyNotMountPoint(dir) 检查目录是否已经是挂载点
- 处理目录不存在的情况，这是正常情况，继续执行设置流程
检查卷是否已准备就绪
- 通过 ed.getMetaDir() 获取元数据目录并检查是否已标记为就绪状态
- 如果卷已就绪且为内存类型 (StorageMediumMemory) 且已挂载，则直接返回
- 如果是默认存储类型 (StorageMediumDefault) 且目录存在，则尝试分配配额并返回
根据存储介质类型设置卷
- StorageMediumDefault：调用 ed.setupDir(dir) 创建普通目录
- StorageMediumMemory：调用 ed.setupTmpfs(dir) 设置 tmpfs 内存文件系统
- HugePages：调用 ed.setupHugepages(dir) 设置大页内存文件系统
- 其他类型：返回错误
设置文件权限和所有权
- 使用 volume.NewVolumeOwnership 创建所有权变更器
- 调用 ChangePermissions() 更改目录权限和组所有权
设置配额和标记就绪状态
- 调用 volumeutil.SetReady() 标记卷为就绪状态
- 调用 ed.assignQuota() 为支持配额的文件系统分配存储配额

该函数实现了 emptyDir 卷的完整设置流程，确保卷在指定路径正确创建并配置。

因为，实际场景中通常使用的是普通卷，这里看下普通卷的实现

// setupDir creates the directory with the default permissions specified by the perm constant.
func (ed *emptyDir) setupDir(dir string) error {  
    // Create the directory if it doesn't already exist.  
    if err := os.MkdirAll(dir, perm); err != nil {  
       return err  
    }  
  
    // stat the directory to read permission bits  
    fileinfo, err := os.Lstat(dir)  
    if err != nil {  
       return err  
    }  
  
    if fileinfo.Mode().Perm() != perm.Perm() {  
       // If the permissions on the created directory are wrong, the  
       // kubelet is probably running with a umask set.  In order to       // avoid clearing the umask for the entire process or locking       // the thread, clearing the umask, creating the dir, restoring       // the umask, and unlocking the thread, we do a chmod to set       // the specific bits we need.       
     err := os.Chmod(dir, perm)  
       if err != nil {  
          return err  
       }  
  
       fileinfo, err = os.Lstat(dir)  
       if err != nil {  
          return err  
       }  
  
       if fileinfo.Mode().Perm() != perm.Perm() {  
          klog.Errorf("Expected directory %q permissions to be: %s; got: %s", dir, perm.Perm(), fileinfo.Mode().Perm())  
       }  
    }  
  
    return nil  
}

setupDir 函数用于创建并设置 emptyDir 卷的目录，主要功能包括：

创建目录
- 使用 os.MkdirAll 创建指定路径的目录，如果目录已存在则不会重复创建
- 使用预定义的权限 perm (0777) 创建目录
验证和修正目录权限
- 使用 os.Lstat 获取目录的文件信息，检查实际权限
- 如果目录的实际权限与期望的权限 perm.Perm() 不匹配：
  - 使用 os.Chmod 强制将目录权限设置为期望的权限值
  - 再次检查权限是否设置正确
  - 如果仍然不正确，则记录错误日志
处理 umask 影响
- 当 kubelet 运行时设置了 umask，可能导致创建的目录权限不正确
- 为了避免修改整个进程的 umask，函数直接使用 os.Chmod 来确保目录具有正确的权限