一、背景
k8s集群排障真的很麻烦
今天集群有同事找我,节点报 PLEG is not healthy 集群中有的节点出现了NotReady,这是什么原因呢?
二、kubernetes源码分析
PLEG is not healthy 也是一个经常出现的问题
POD 生命周期事件生成器
先说下PLEG 这部分代码在kubelet 里,我们看一下在kubelet中的注释:
// GenericPLEG is an extremely simple generic PLEG that relies solely on
// periodic listing to discover container changes. It should be used
// as temporary replacement for container runtimes do not support a proper
// event generator yet.
//
// Note that GenericPLEG assumes that a container would not be created,
// terminated, and garbage collected within one relist period. If such an
// incident happens, GenenricPLEG would miss all events regarding this
// container. In the case of relisting failure, the window may become longer.
// Note that this assumption is not unique -- many kubelet internal components
// rely on terminated containers as tombstones for bookkeeping purposes. The
// garbage collector is implemented to work with such situations. However, to
// guarantee that kubelet can handle missing container events, it is
// recommended to set the relist period short and have an auxiliary, longer
// periodic sync in kubelet as the safety net.
type GenericPLEG struct {
// The period for relisting.
relistPeriod time.Duration
// The container runtime.
runtime kubecontainer.Runtime
// The channel from which the subscriber listens events.
eventChannel chan *PodLifecycleEvent
// The internal cache for pod/container information.
podRecords podRecords
// Time of the last relisting.
relistTime atomic.Value
// Cache for storing the runtime states required for syncing pods.
cache kubecontainer.Cache
// For testability.
clock clock.Clock
// Pods that failed to have their status retrieved during a relist. These pods will be
// retried during the next relisting.
podsToReinspect map[types.UID]*kubecontainer.Pod
}
也就是说kubelet 会定时把 拉取pod 的列表,然后记录下结果。
运行代码后会执行一个定时任务,定时调用relist函数
// Start spawns a goroutine to relist periodically.
func (g *GenericPLEG) Start() {

文章讨论了在Kubernetes集群中遇到的PLEG不健康问题,通过源码分析发现是由于容器运行时(如Docker或containerd)超时和接受文件过多导致的。文章详细解释了kubelet的工作原理,以及如何定位和修复此问题,包括检查CRI服务端的OpenFilesLimit设置。
最低0.47元/天 解锁文章
228

被折叠的 条评论
为什么被折叠?



