docker hang 死阻塞 kubelet 初始化流程

背景

最近升级了一版kubelet，修复因kubelet删除Sidecar类型Pod慢导致平台删除集群超时的问题。在灰度redis隔离集群的时候，发现升级kubelet并重启服务后，少量宿主状态变成了NotReady，并且回滚kubelet至之前版本，宿主状态仍然是NotReady。查看宿主状态时提示 ‘container runtime is down’ ，显示容器运行时出了问题。

我们使用的容器运行时是docker。我们就去检查docker的状态，检测结果如下：

docker ps 查看所有容器列表，执行正常
docker inspect 查看容器详细状态，某一容器执行阻塞

典型的docker hang死行为。我们最近在升级docker版本，存量宿主docker的版本为1.13.1，并且在逐步升级至18.06.3，新宿主的docker版本都是18.06.3。docker hang死问题在1.13.1版本上表现得更彻底，在执行docker ps的时候就已经hang死了；而docker 18.06.3做了一点小小的优化，在执行docker ps时去掉了容器级别的加锁操作。但是很多docker命令在执行前都会申请容器锁，因此一旦某一个容器出现问题，并不会造成docker服务不可响应，受影响的也仅仅是该容器，无法执行操作。

至于为什么以docker ps与docker inspect为指标检查docker状态，因为kubelet就是依赖这两个docker命令获取容器状态。

所以，现在问题有二：

docker hang死的根因是什么？
docker hang死时，为什么重启kubelet，会导致宿主状态变为NotReady？

docker hang死的排查详见：docker-hang-死排查之旅。现在我们再来分析，当容器异常时，为什么重启kubelet，宿主的状态会从Ready变成NotReady。

宿主状态生成机制

在问题排查之前，我们需要先了解宿主状态的生成机制。

宿主的所有状态都是node.Status的属性，因此我们直接定位kubelet设置node.Status的代码即可。

// Run starts the kubelet reacting to config updates
func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
  ......
  if kl.kubeClient != nil {
		// Start syncing node status immediately, this may set up things the runtime needs to run.
		go wait.Until(kl.syncNodeStatus, kl.nodeStatusUpdateFrequency, wait.NeverStop)
		go kl.fastStatusUpdateOnce()
	}
  ......
}

kubelet在启动时创建了一个goroutine，周期性地向apiserver同步本宿主的状态，同步周期默认是10s。

跟踪调用链路，我们可以看到kubelet针对宿主会设置多个Condition，表明宿主当前所处的状态，比如宿主内存是否告急、线程数是否告急，以及宿主是否就绪。

// defaultNodeStatusFuncs is a factory that generates the default set of
// setNodeStatus funcs
func (kl *Kubelet) defaultNodeStatusFuncs() []func(*v1.Node) error {
   ......
   setters = append(setters,
      nodestatus.OutOfDiskCondition(kl.clock.Now, kl.recordNodeStatusEvent),
      nodestatus.MemoryPressureCondition(kl.clock.Now, kl.evictionManager.IsUnderMemoryPressure, kl.recordNodeStatusEvent),
      nodestatus.DiskPressureCondition(kl.clock.Now, kl.evictionManager.IsUnderDiskPressure, kl.recordNodeStatusEvent),
      nodestatus.PIDPressureCondition(kl.clock.Now, kl.evictionManager.IsUnderPIDPressure, kl.recordNodeStatusEvent),
      nodestatus.ReadyCondition(kl.clock.Now, kl.runtimeState.runtimeErrors, kl.runtimeState.networkErrors, validateHostFunc, kl.containerManager.Status, kl.recordNodeStatusEvent),
      nodestatus.VolumesInUse(kl.volumeManager.ReconcilerStatesHasBeenSynced, kl.volumeManager.GetVolumesInUse),
      // TODO(mtaufen): I decided not to move this setter for now, since all it does is send an event
      // and record state back to the Kubelet runtime object. In the future, I'd like to isolate
      // these side-effects by decoupling the decisions to send events and partial status recording
      // from the Node setters.
      kl.recordNodeSchedulableEvent,
   )
   return setters
}

其中Ready Condition表明宿主是否就绪，kubectl查看宿主状态时，展示的Status信息就是Ready Condition的状态，常见的状态及其含义定义如下：

Ready状态：表明宿主状态一切OK，能正常响应Pod事件
NotReady状态：表明宿主的kubelet仍在运行，但是此时已经无法处理Pod事件。NotReady绝大多数情况都是由容器运行时异常导致
Unknown状态：表明宿主上的kubelet已停止运行

kubelet定义的Ready Condition的判定条件如下：

// ReadyCondition returns a Setter that updates the v1.NodeReady condition on the node.
func ReadyCondition(
	nowFunc func() time.Time, // typically Kubelet.clock.Now
	runtimeErrorsFunc func() []string, // typically Kubelet.runtimeState.runtimeErrors
	networkErrorsFunc func() []string, // typically Kubelet.runtimeState.networkErrors
	appArmorValidateHostFunc func() error, // typically Kubelet.appArmorValidator.ValidateHost, might be nil depending on whether there was an appArmorValidator
	cmStatusFunc func() cm.Status, // typically Kubelet.containerManager.Status
	recordEventFunc func(eventType, event string), // typically Kubelet.recordNodeStatusEvent
) Setter {
	return func(node *v1.Node) error {
		......
		rs := append(runtimeErrorsFunc(), networkErrorsFunc()...)
    requiredCapacities := []v1.ResourceName{v1.ResourceCPU, v1.ResourceMemory, v1.ResourcePods}
		if utilfeature.DefaultFeatureGate.Enabled(features.LocalStorageCapacityIsolation) {
			requiredCapacities = append(requiredCapacities, v1.ResourceEphemeralStorage)
		}
		missingCapacities := []string{}
		for _, resource := range requiredCapacities {
			if _, found := node.Status.Capacity[resource]; !found {
				missingCapacities = append(missingCapacities, string(resource))
			}
		}
		if len(missingCapacities) > 0 {
			rs = append(rs, fmt.Sprintf("Missing node capacity for resources: %s", strings.Join(missingCapacities, ", ")))
		}
    if len(rs) > 0 {
			newNodeReadyCondition = v1.NodeCondition{
				Type:              v1.NodeReady,
				Status:            v1.ConditionFalse,
				Reason:            "KubeletNotReady",
				Message:           strings.Join(rs, ","),
				LastHeartbeatTime: currentTime,
			}
		}
		......
	}
}

以上代码片段显示，宿主是否Ready取决于很多条件，包含运行时判定、网络判定、基本资源判定等。

宿主状态变化定位

接下来，我们将重点放在运行时判定，分析宿主状态发生变化的原因。运行时判定条件定义如下：

func (s *runtimeState) runtimeErrors() []string {
   s.RLock()
   defer s.RUnlock()
   var ret []string
   if !s.lastBaseRuntimeSync.Add(s.baseRuntimeSyncThreshold).After(time.Now()) {  // 1
      ret = append(ret, "container runtime is down")
   }
   if s.internalError != nil {
      ret = append(ret, s.internalError.Error())
   }
   for _, hc := range s.healthChecks {                                            // 2
      if ok, err := hc.fn(); !ok {
         ret = append(ret, fmt.Sprintf("%s is not healthy: %v", hc.name, err))
      }
   }
   return ret
}

当出现如下两种状况之一时，则判定运行时检查不通过：

当前时间距最近一次运行时同步操作 (lastBaseRuntimeSync) 的时间间隔超过指定阈值（默认30s）
运行时健康检查未通过

那么，当时宿主的NotReady是由哪种状况引起的呢？结合kubelet日志分析，kubelet每隔5s就输出一条日志：

......
I0715 10:43:28.049240   16315 kubelet.go:1835] skipping pod synchronization - [container runtime is down]
I0715 10:43:33.049359   16315 kubelet.go:1835] skipping pod synchronization - [container runtime is down]
I0715 10:43:38.049492   16315 kubelet.go:1835] skipping pod synchronization - [container runtime is down]
......

因此，状况1是宿主NotReady的元凶。

我们继续分析为什么kubelet没有按照预期设置lastBaseRuntimeSync。kubelet启动时会创建一个goroutine，并在该goroutine中循环设置lastBaseRuntimeSync，循环如下：

func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
   ......
   go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
   ......
}

func (kl *Kubelet) updateRuntimeUp() {
   kl.updateRuntimeMux.Lock()
   defer kl.updateRuntimeMux.Unlock()
   ......
   kl.oneTimeInitializer.Do(kl.initializeRuntimeDependentModules)
   kl.runtimeState.setRuntimeSync(kl.clock.Now())
}

func (s *runtimeState) setRuntimeSync(t time.Time) {
	 s.Lock()
	 defer s.Unlock()
	 s.lastBaseRuntimeSync = t
}

正常情况下，kubelet每隔5s会将lastBaseRuntimeSync设置为当前时间，而宿主状态异常时，这个时间戳一直未被更新。也即updateRuntimeUp一直被阻塞在设置lastBaseRuntimeSync之前的某一步。因此，我们只需逐个排查updateRuntimeUp内的函数调用即可，具体过程不再展示，最终的函数调用链路如下：

initializeRuntimeDependentModules -> kl.cadvisor.Start -> cc.Manager.Start -> self.createContainer -> m.createContainerLocked -> container.NewContainerHandler -> factory.CanHandleAndAccept -> self.client.ContainerInspect

调用链路显示，最后cadvisor执行docker inspect时被hang死，阻塞了kubelet的一个关键初始化流程。

如果容器异常发生在kubelet初始化完毕之后，则不会对宿主的Ready状态造成任何影响，因为oneTimeInitializer是sync.Once类型，也即仅仅会在kubelet启动时执行一次。那时容器异常对kubelet的影响有限，仅仅是不能处理该Pod的任何事件，包含删除、变更等，但是仍然能够处理其他Pod的事件。

这就解释了为什么kubelet重启前宿主状态是Ready，重启后状态是NotReady。

后续

可能有人会问，为什么cadvisor执行docker inspect操作不加超时控制？确实，如果添加了超时控制，重启kubelet不会引起宿主状态变更。个人觉得添加超时控制没有什么问题，不清楚是否有啥坑，待详细挖掘后再来补充。