pod terminating 排查之旅(二)

背景

近期,线上报障了多起Pod删除失败的Case,用户的多次删除请求均已失败告终。Pod删除失败的影响主要有二:

  • 面向用户:用户体验下降,且无法对该Pod执行后续的发布流程
  • 面向弹性云:失败率提升,SLA无法达标

并且,随着弹性云Docker版本升级 (1.13 → 18.06) 进度的推进,线上出现Pod删除失败的Case隐隐有增多的趋势。

线上问题无小事!不论是从哪个角度出发,我们都应该给线上环境把把脉,看看是哪个系统出了问题。

问题定位

由于线上出Case的频率并不低,基本每周都会出现,这反而给我们定位问题带来了便利^_^。

排查线上问题的思路一般分如下两个步骤:

  • 定位出现问题的组件
  • 定位组件出现的问题

组件定位

对于弹性云的同学来说,定位问题组件已经有一套标准流程:从上往下,看看问题是由哪个组件引起。【不清楚组件通信流程的同学可以看看容器删除失败排查

1)第一嫌疑人:kubelet

在kubernetes体系架构下,删除Pod的执行者就是kubelet,作为第一个提审对象,它不冤。

虽然无奈,但是kubelet也早已习惯了,并且在多次经历过社会的毒打之后,练就了一身的甩锅能力:

1
I0624 11:00:26.658872   21280 kubelet.go:1923] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.439656895s ago; threshold is 3m0s]

什么意思?死贫道不死道友呗。

PLEG模块不健康?PLEG是kubelet的一个子模块单元,用来统一管理底层容器的运行状态。

kubelet招供:我是好人啊,不能冤枉我,都是docker惹的祸!

2)第二嫌疑人:dockerd

根据kubelet的证词,我们很快提审本案的第二嫌疑人:dockerd。

为自证清白,dockerd三下五除二打出一套军体拳:

1
2
3
4
# docker ps
// 运行正常,dockerd轻舒一口气
# docker ps -a | grep -V NAMES | awk '{print $1}' | xargs -ti docker inspect -f {{.State.Pid}} {}
// 执行到 docker inspect -f {{.State.Pid}} 60f253d59f26 时,命令卡住

该容器恰好属于用户删除失败的Pod。

你们可能没看到当时dockerd的脸色,面如死灰,且一直喃喃自语:难道真是我的锅?

好几分钟后,dockerd才慢慢缓过来,理了理思绪,想好了一套甩锅流程:虽然问题出现在我这,但是你们的证据不足,不能证明是我亲手干的。我手下养着一大帮人,可能是一些小弟自己偷偷干的。

嘿,你还有理了,那好,我继续收集证据,让你死的明明白白。

3)第三嫌疑人:containerd

作为dockerd手下的二当家,我们首先传讯了containerd。这人一看就老实忠厚,它看着dockerd的证词,苦笑了下,吐槽到:跟着大哥这么多年,还是没有得到大哥的信任(所以kubernetes在1.20版本中,将containerd扶上了大哥的位置?哈哈)。

containerd有条不紊的祭出三板斧:

1
2
3
4
5
6
# docker-containerd-ctr -a /var/run/docker/containerd/docker-containerd.sock -n moby c ls
// 运行正常
# docker-containerd-ctr -a /var/run/docker/containerd/docker-containerd.sock -n moby t ls
// 命令卡死,containerd老脸一僵,但是很快恢复正常
# docker-containerd-ctr -a /var/run/docker/containerd/docker-containerd.sock -n moby c ls | grep -v IMAGE | awk '{print $1}' | xargs -ti docker-containerd-ctr -a /var/run/docker/containerd/docker-containerd.sock -n moby t ps {}
// 执行到 docker-containerd-ctr -a /var/run/docker/containerd/docker-containerd.sock -n moby t ps 60f253d59f26e1c573d4ba5f824e73b3a4b1bb1629edace85caba4c620755d4d 时,命令卡住

containerd神色不自然的说:不好意思啊,警官,可能是我家里不争气的孩子惹的祸,这孩子以前也犯过事【docker hang 死排查】。

4)第四嫌疑人:containerd-shim

containerd的话还没说完,待在一旁的儿子containerd-shim跳出来指着containerd叛逆地说:你凭什么说是我?你自己干的那些破事,我都不稀罕说你。

清官难断家务事!案件排查至此,从已知证据,还真不好确认到底是老子,还是儿子犯的罪。

5)第五嫌疑人:runc

万般无奈,我们清楚了本案的最后一个嫌疑人,也是年纪最小的runc。runc只会呀呀自语地说:不是我,不是我!

1
2
# docker-runc --root /var/run/docker/runtime-runc/moby/ list
60f253d59f26e1c573d4ba5f824e73b3a4b1bb1629edace85caba4c620755d4d 0 stopped /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/60f253d59f26e1c573d4ba5f824e73b3a4b1bb1629edace85caba4c620755d4d 2021-05-07T12:43:02.62261156Z root

从现场收集到的证据表明,这个案件和runc还真没什么关系,案发时,它已经离开了现场,只不过留下了一个烂摊子等着别人来清理。

从众人招供的语录来看,案件嫌疑人初步锁定了containerd与containerd-shim,但是具体是谁,都还不好说。

正当大家一筹莫展之际,一位老刑警从现场提取到了一些新的线索,案件终于有了新的进展。

6)新线索

老刑警领着大家观察它收集到的新线索:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# sudo docker ps -a | grep PODNAME
60f253d59f26 image "/dockerinit" 6 weeks ago Up 6 weeks k8s_CNAME_PODNAME_default_013e3b0e-8d17-11eb-8ef7-246e9693e13c_10
6e6fc586dc12 pause:3.1 "/pause" 6 weeks ago Exited (0) About an hour ago k8s_POD_PODNAME_default_013e3b0e-8d17-11eb-8ef7-246e9693e13c_1


# ps -ef|grep 60f253d59f26
root 119820 3608 0 May07 ? 00:11:16 docker-containerd-shim -namespace moby -workdir /docker/docker_rt/containerd/daemon/io.containerd.runtime.v1.linux/moby/60f253d59f26e1c573d4ba5f824e73b3a4b1bb1629edace85caba4c620755d4d -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-runc
stupig 1629698 1599793 0 11:44 pts/0 00:00:00 grep --color=auto 60f253d59f26


# ps -ef|grep 119820
root 40825 119820 0 Jun04 ? 00:00:00 [su] <defunct>
76183 40833 119820 0 Jun04 ? 00:00:00 [bash] <defunct>
root 119820 3608 0 May07 ? 00:11:16 docker-containerd-shim -namespace moby -workdir /docker/docker_rt/containerd/daemon/io.containerd.runtime.v1.linux/moby/60f253d59f26e1c573d4ba5f824e73b3a4b1bb1629edace85caba4c620755d4d -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-runc
root 119886 119820 0 May07 ? 00:04:29 [dockerinit] <defunct>
root 568896 119820 0 Jun07 ? 00:00:00 [su] <defunct>
76183 568898 119820 0 Jun07 ? 00:00:00 [bash] <defunct>
root 695031 119820 0 May08 ? 00:00:00 [su] <defunct>
74647 695037 119820 0 May08 ? 00:00:00 [bash] <defunct>
root 802705 119820 0 Jun09 ? 00:00:00 [su] <defunct>
74647 802709 119820 0 Jun09 ? 00:00:00 [bash] <defunct>
root 865131 119820 0 Jun07 ? 00:00:00 [su] <defunct>
69099 865133 119820 0 Jun07 ? 00:00:00 [bash] <defunct>
root 1073407 119820 0 Jun23 ? 00:00:00 [su] <defunct>
76183 1073428 119820 0 Jun23 ? 00:00:00 [bash] <defunct>
root 1375526 119820 0 Jun22 ? 00:00:00 [su] <defunct>
69099 1375561 119820 0 Jun22 ? 00:00:00 [bash] <defunct>
root 1397568 119820 0 Jun16 ? 00:00:00 [su] <defunct>
69099 1397570 119820 0 Jun16 ? 00:00:00 [bash] <defunct>
root 1483339 119820 0 Jun23 ? 00:00:00 [su] <defunct>
76183 1483341 119820 0 Jun23 ? 00:00:00 [bash] <defunct>
stupig 1631234 1599793 0 11:44 pts/0 00:00:00 grep --color=auto 119820
root 1692888 119820 0 Jun07 ? 00:00:00 [su] <defunct>
76183 1692903 119820 0 Jun07 ? 00:00:00 [bash] <defunct>
root 1882984 119820 0 Jun21 ? 00:00:00 [su] <defunct>
76183 1882985 119820 0 Jun21 ? 00:00:00 [bash] <defunct>
root 1964311 119820 0 Jun07 ? 00:00:00 [su] <defunct>
76183 1964318 119820 0 Jun07 ? 00:00:00 [bash] <defunct>
root 2019760 119820 0 Jun07 ? 00:00:00 [su] <defunct>
76183 2019784 119820 0 Jun07 ? 00:00:00 [bash] <defunct>
root 2122420 119820 0 Jun07 ? 00:00:00 [su] <defunct>
76183 2122434 119820 0 Jun07 ? 00:00:00 [bash] <defunct>
root 2288703 119820 0 Jun09 ? 00:00:00 [su] <defunct>
69099 2288705 119820 0 Jun09 ? 00:00:00 [bash] <defunct>
root 2330164 119820 0 Jun07 ? 00:00:00 [su] <defunct>
76183 2330166 119820 0 Jun07 ? 00:00:00 [bash] <defunct>
root 2406740 119820 0 May27 ? 00:00:00 [su] <defunct>
74647 2406745 119820 0 May27 ? 00:00:00 [bash] <defunct>
root 2421050 119820 0 Jun07 ? 00:00:00 [su] <defunct>
76183 2421069 119820 0 Jun07 ? 00:00:00 [bash] <defunct>
root 2445918 119820 0 Jun22 ? 00:00:00 [su] <defunct>
76183 2445927 119820 0 Jun22 ? 00:00:00 [bash] <defunct>
root 2487600 119820 0 Jun22 ? 00:00:00 [su] <defunct>
76183 2487602 119820 0 Jun22 ? 00:00:00 [bash] <defunct>
root 2897660 119820 0 Jun07 ? 00:00:00 [su] <defunct>
76183 2897662 119820 0 Jun07 ? 00:00:00 [bash] <defunct>

同一个Pod内pause容器依然退出,但是业务容器却没有退出,并且业务容器关联的containerd-shim进程并未执行子进程收割动作,就像是卡住了。

面对这些新证据,containerd-shim毫无征兆地崩溃了,大哭道:为什么总是我?

问题定位

言归正传,我们如何能根据上述现象快速定位问题呢?思路有三:

  1. 拿着现象问谷歌
  2. 带着问题看代码
  3. 深挖现场定问题

1)谷歌大法

当我们拿着问题呈现的现象搜索谷歌时,还真搜到了关联的内容:Exec process may cause shim hang

该issue中所描述的内容和我们碰到的问题基本一致。问题是由于 reaper.Default 处定义的channel大小太小引起的,调整channel大小可规避该问题。

2)理解代码

尽管本问题可以通过一些手段规避,但是我们还是需要理解代码中出现的问题。

containerd-shim的主要事务是执行runc命令,主要功能是托管真正的容器进程,并暴露一个服务,供外部用户与容器进行交互。

containerd-shim内部处理逻辑如下图所示:

containerd-shim简单架构

GRPC服务:containerd-shim的核心服务,对外暴露众多接口,诸如创建/启动task等,并调用runc执行对应的命令。

此外,containerd-shim内启动了三个协程(包含主协程)共同处理容器内进程退出事件。首先是主协程handleSignals:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
func handleSignals(logger *logrus.Entry, signals chan os.Signal, server *ttrpc.Server, sv *shim.Service) error {
signals := make(chan os.Signal, 32)
signal.Notify(signals, unix.SIGTERM, unix.SIGINT, unix.SIGCHLD, unix.SIGPIPE)
runc.Monitor = reaper.Default
// set the shim as the subreaper for all orphaned processes created by the container
if err := system.SetSubreaper(1); err != nil {
return nil, err
}
for {
select {
case s := <-signals:
switch s {
case unix.SIGCHLD:
if err := reaper.Reap(); err != nil {
logger.WithError(err).Error("reap exit status")
}
case unix.SIGTERM, unix.SIGINT:
// shim退出处理
}
}
}
}

containerd-shim调用system.SetSubreaper将自己作为容器内进程的收割者,一般容器内的1号进程也具备收割僵尸进程的能力,因此containerd-shim更多的是收割runc exec进容器内的进程。

当有僵尸进程出现时,就执行收割逻辑:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
func Reap() error {
now := time.Now()
exits, err := sys.Reap(false) // 调用wait系统调用处理僵尸进程
Default.Lock()
for c := range Default.subscribers { // 将退出事件发送给所有订阅者
for _, e := range exits {
c <- runc.Exit{
Timestamp: now,
Pid: e.Pid,
Status: e.Status,
}
}

}
Default.Unlock()
return err
}

那么谁又是订阅者呢?shim在初始化时就订阅了一份进程退出事件:

1
2
3
4
5
6
7
8
9
10
11
func NewService(config Config, publisher events.Publisher) (*Service, error) {
s := &Service{
processes: make(map[string]proc.Process),
events: make(chan interface{}, 128),
ec: reaper.Default.Subscribe(), // 订阅进程退出事件
}
go s.processExits() // 退出事件处理
go s.forward(publisher) // 退出事件转发
return s, nil
}

其中退出事件处理逻辑如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
func (s *Service) processExits() {
for e := range s.ec {
s.checkProcesses(e)
}
}
func (s *Service) checkProcesses(e runc.Exit) {
s.mu.Lock()
defer s.mu.Unlock()
for _, p := range s.processes {
if p.Pid() == e.Pid {
s.events <- &eventstypes.TaskExit{}
return
}
}
}

只处理s.processes的退出事件,而s.processes关联的都是什么对象呢?主要有二:

1
2
3
4
5
6
7
8
9
10
11
12
// Create a new initial process and container with the underlying OCI runtime
func (s *Service) Create(ctx context.Context, r *shimapi.CreateTaskRequest) (*shimapi.CreateTaskResponse, error) {
......
s.processes[r.ID] = process
......
}
// Exec an additional process inside the container
func (s *Service) Exec(ctx context.Context, r *shimapi.ExecProcessRequest) (*ptypes.Empty, error) {
......
s.processes[r.ID] = process
......
}

代码就展示这些,其他代码大家感兴趣的自己查阅。

现在,我们再来根据现象查问题。从现象可知,异常容器待收割的僵尸进程较多,肯定超过了32个。当shim在收割众多僵尸进程时,往订阅者信道(大小32)中发送时,出现阻塞,阻塞点:阻塞信号,并且此时持有Default.Lock这一把大锁。

那么只要这时候再有人来申请这把锁,就会形成死锁。

那么究竟谁会来申请这把锁呢?这时候,要是能查看containerd-shim的协程栈就好了。

3)现场分析

确实,containerd-shim启动了一个协程方便用户导出协程栈信息,我们来看看能不能行呢?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
func executeShim() error {
dump := make(chan os.Signal, 32)
signal.Notify(dump, syscall.SIGUSR1)
go func() {
for range dump {
dumpStacks(logger)
}
}()
}
func dumpStacks(logger *logrus.Entry) {
var (
buf []byte
stackSize int
)
bufferLen := 16384
for stackSize == len(buf) {
buf = make([]byte, bufferLen)
stackSize = runtime.Stack(buf, true)
bufferLen *= 2
}
buf = buf[:stackSize]
logger.Infof("=== BEGIN goroutine stack dump ===\n%s\n=== END goroutine stack dump ===", buf)
}

一切看起来貌似没什么问题,我们发送SIGUSR1就能获取一份协程栈。

当我们真正去执行时,却发现到处了一个寂寞。根本原因在于:

  • logger.Infof()往os.Stdout输出
  • 由于线上环境docker没有开启debug模式,线上containerd-shim的os.Stdout被赋值为/dev/null
1
2
3
4
5
6
7
COMMAND       PID USER   FD      TYPE             DEVICE SIZE/OFF       NODE NAME
docker-co 119820 root cwd DIR 0,18 120 916720892 /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/60f253d59f26e1c573d4ba5f824e73b3a4b1bb1629edace85caba4c620755d4d
docker-co 119820 root rtd DIR 8,3 4096 2 /
docker-co 119820 root txt REG 8,3 4173632 392525 /usr/bin/docker-containerd-shim
docker-co 119820 root 0r CHR 1,3 0t0 2052 /dev/null
docker-co 119820 root 1w CHR 1,3 0t0 2052 /dev/null
docker-co 119820 root 2w CHR 1,3 0t0 2052 /dev/null

问题排查至此,似乎又僵住了!好在,之前看过一丢丢内核问题排查相关知识。虽然线上containerd-shim将协程栈的信息全部导出到了/dev/null中,但是我们还是有一些手段获取。

赶紧找组里的内核大佬帮忙,并很快确定了方案,基于kprobe,在操作系统往/dev/null设备写入协程栈时,拷贝一份内容写到内核日志中。方案实施起来也不复杂:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/kprobes.h>
#include <linux/sched.h>
#include <asm/uaccess.h>
#include <linux/slab.h>

static int pid;
module_param(pid, int, 0);

static struct kprobe kp = {
.symbol_name = "write_null",
};

#define SEGMENT 512

static int handler_pre(struct kprobe *p, struct pt_regs *regs)
{
char *wbuf;
size_t count, place = 0;

if (pid != current->tgid) {
return 0;
}

count = (size_t)(regs->dx);

printk(KERN_INFO "%u call write_null count: %lu\n", current->tgid, count);

wbuf = (char *)kmalloc(count + 1, 0);
if (wbuf == NULL) {
return 0;
}
memset(wbuf, 0x0, count + 1);

if (copy_from_user(wbuf, (void *)regs->si, count)) {
printk(KERN_ERR "copy_from_user fail\n");
return 0;
}

while (place < count) {
char tmp[SEGMENT + 1];
memset(tmp, 0x0, SEGMENT + 1);

snprintf(tmp, SEGMENT + 1, "%s", wbuf + place);
if ((count - place) >= SEGMENT) {
place += SEGMENT;
} else {
place = count;
}
printk(KERN_INFO "%s\n", tmp);
}

if (wbuf) {
kfree(wbuf);
}

return 0;
}

static int __init kprobe_init(void)
{
int ret;
kp.pre_handler = handler_pre;

ret = register_kprobe(&kp);
if (ret < 0) {
printk(KERN_INFO "register_kprobe failed, returned %d\n", ret);
return ret;
}
printk(KERN_INFO "Planted kprobe at %p\n", kp.addr);
return 0;
}

static void __exit kprobe_exit(void)
{
unregister_kprobe(&kp);
printk(KERN_INFO "kprobe at %p unregistered\n", kp.addr);
}

module_init(kprobe_init)
module_exit(kprobe_exit)
MODULE_LICENSE("GPL");

这里着重感谢睿哥提供的kprobe代码。kprobe的相关知识,感兴趣的自己学习下。

我们在线上部署了该kprobe内核模块之后,往containerd-shim发送SIGUSR1顺利获取了协程栈信息。终于补全了问题排查的最后一块拼盘。

补:这里重点感谢飞哥(不愧是老中医)的提醒,其实还有一个非常简单的方法获取协程栈:借助strace跟踪进程的系统调用。

线上containerd-shim的协程栈信息(删减了大量不重要协程栈,并调整了格式)展示如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
time="2021-06-23T16:35:07+08:00" level=info msg="=== BEGIN goroutine stack dump ===
goroutine 22 [running]:
main.dumpStacks(0xc4201c81e0)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:228 +0x8a
main.executeShim.func1(0xc42012c300, 0xc4201c81e0)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:148 +0x3d
created by main.executeShim
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:146 +0x5de

goroutine 1 [chan send, 83 minutes]:
github.com/containerd/containerd/reaper.Reap(0xc420243be0, 0x1)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/reaper/reaper.go:44 +0x168
main.handleSignals(0xc4201c81e0, 0xc42012c240, 0xc4201d6090, 0xc4201e4000, 0xc420117ea0, 0x86)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:197 +0x2a1
main.executeShim(0x2, 0x60)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:151 +0x616
main.main()
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:96 +0x81

goroutine 5 [syscall]:
os/signal.signal_recv(0x6c14c0)
/usr/local/go/src/runtime/sigqueue.go:139 +0xa6
os/signal.loop()
/usr/local/go/src/os/signal/signal_unix.go:22 +0x22
created by os/signal.init.0
/usr/local/go/src/os/signal/signal_unix.go:28 +0x41

goroutine 6 [chan receive]:
main.main.func1()
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:81 +0x7b
created by main.main
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:80 +0x46

goroutine 7 [select, 92427 minutes, locked to thread]:
runtime.gopark(0x6a70a0, 0x0, 0x696b74, 0x6, 0x18, 0x1)
/usr/local/go/src/runtime/proc.go:291 +0x11a
runtime.selectgo(0xc420104f50, 0xc42014a180)
/usr/local/go/src/runtime/select.go:392 +0xe50
runtime.ensureSigM.func1()
/usr/local/go/src/runtime/signal_unix.go:549 +0x1f4
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2361 +0x1

goroutine 18 [semacquire, 83 minutes]:
sync.runtime_SemacquireMutex(0xc4201e4004, 0x403200)
/usr/local/go/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc4201e4000)
/usr/local/go/src/sync/mutex.go:134 +0x108
github.com/containerd/containerd/linux/shim.(*Service).checkProcesses(0xc4201e4000, 0xc02cd591b40305ef, 0x13af4280e2630f, 0x7fc440, 0x1c11b, 0x0)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/linux/shim/service.go:470 +0x45
github.com/containerd/containerd/linux/shim.(*Service).processExits(0xc4201e4000)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/linux/shim/service.go:465 +0xd0
created by github.com/containerd/containerd/linux/shim.NewService
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/linux/shim/service.go:86 +0x3e9

goroutine 19 [syscall, 83 minutes]:
syscall.Syscall6(0xe8, 0x4, 0xc4201189b8, 0x80, 0xffffffffffffffff, 0x0, 0x0, 0xc420118938, 0x45b793, 0xc42047add0)
/usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
github.com/containerd/containerd/vendor/golang.org/x/sys/unix.EpollWait(0x4, 0xc4201189b8, 0x80, 0x80, 0xffffffffffffffff, 0x5, 0x0, 0x0)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/golang.org/x/sys/unix/zsyscall_linux_amd64.go:1518 +0x7a
github.com/containerd/containerd/vendor/github.com/containerd/console.(*Epoller).Wait(0xc4201be060, 0xc420117aa8, 0xc420117ab0)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/containerd/console/console_linux.go:110 +0x7a
created by github.com/containerd/containerd/linux/shim.(*Service).initPlatform
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/linux/shim/service_linux.go:109 +0xc6

goroutine 20 [chan receive, 83 minutes]:
github.com/containerd/containerd/linux/shim.(*Service).forward(0xc4201e4000, 0x6c0600, 0xc4201d4010)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/linux/shim/service.go:514 +0x62
created by github.com/containerd/containerd/linux/shim.NewService
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/linux/shim/service.go:90 +0x49b

goroutine 21 [IO wait, 92427 minutes]:
internal/poll.runtime_pollWait(0x7f75331fcf00, 0x72, 0x0)
/usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc4201e6118, 0x72, 0xc420010100, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc4201e6118, 0xffffffffffffff00, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Accept(0xc4201e6100, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_unix.go:372 +0x1a8
net.(*netFD).accept(0xc4201e6100, 0xc4201d60a0, 0xc4201d6060, 0xc4201563c0)
/usr/local/go/src/net/fd_unix.go:238 +0x42
net.(*UnixListener).accept(0xc4201d6270, 0x451c70, 0xc42011cea8, 0xc42011ceb0)
/usr/local/go/src/net/unixsock_posix.go:162 +0x32
net.(*UnixListener).Accept(0xc4201d6270, 0x6a6b10, 0xc4201563c0, 0x6c3840, 0xc420012018)
/usr/local/go/src/net/unixsock.go:253 +0x49
github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc.(*Server).Serve(0xc4201d6090, 0x6c3440, 0xc4201d6270, 0x0, 0x0)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/server.go:69 +0x106
main.serve.func1(0x6c3440, 0xc4201d6270, 0xc4201d6090)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:176 +0x71
created by main.serve
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/cmd/containerd-shim/main_unix.go:174 +0x1be

goroutine 8 [select, 83 minutes]:
github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc.(*serverConn).run(0xc4201563c0, 0x6c3840, 0xc420012018)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/server.go:398 +0x3f0
created by github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc.(*Server).Serve
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/server.go:109 +0x25c

goroutine 9 [IO wait, 83 minutes]:
internal/poll.runtime_pollWait(0x7f75331fce30, 0x72, 0xc42011ea48)
/usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc4201a0118, 0x72, 0xffffffffffffff00, 0x6c0a40, 0x7e11b8)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc4201a0118, 0xc4201ed000, 0x1000, 0x1000)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc4201a0100, 0xc4201ed000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc4201a0100, 0xc4201ed000, 0x1000, 0x1000, 0xc42011eb30, 0x451430, 0xc420001b00)
/usr/local/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc42000c050, 0xc4201ed000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/net/net.go:176 +0x6a
bufio.(*Reader).Read(0xc42012c480, 0xc4200105a0, 0xa, 0xa, 0x0, 0x2, 0x2)
/usr/local/go/src/bufio/bufio.go:216 +0x238
io.ReadAtLeast(0x6c02c0, 0xc42012c480, 0xc4200105a0, 0xa, 0xa, 0xa, 0x6c62e0, 0xc42011ef50, 0x0)
/usr/local/go/src/io/io.go:309 +0x86
io.ReadFull(0x6c02c0, 0xc42012c480, 0xc4200105a0, 0xa, 0xa, 0xc42011eef0, 0x3, 0x3)
/usr/local/go/src/io/io.go:327 +0x58
github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc.readMessageHeader(0xc4200105a0, 0xa, 0xa, 0x6c02c0, 0xc42012c480, 0xc42011ee70, 0x2, 0x2, 0xc42011eed0)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/channel.go:38 +0x60
github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc.(*channel).recv(0xc420010580, 0x6c3800, 0xc4200105c0, 0x0, 0xc420392940, 0x0, 0x0, 0x73, 0x0, 0x0)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/channel.go:86 +0x6d
github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc.(*serverConn).run.func1(0xc42014a2a0, 0xc4201563c0, 0xc42014a360, 0xc420010580, 0x6c3800, 0xc4200105c0, 0xc42014a300, 0xc42012c4e0)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/server.go:329 +0x1bf
created by github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc.(*serverConn).run
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/server.go:299 +0x247

goroutine 22661144 [chan receive, 83 minutes]:
github.com/containerd/containerd/reaper.(*Monitor).Wait(0x7fb5d0, 0xc42017a580, 0xc420674360, 0x0, 0x0, 0x68c3c0)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/reaper/reaper.go:82 +0x52
github.com/containerd/containerd/vendor/github.com/containerd/go-runc.cmdOutput(0xc42017a580, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/containerd/go-runc/runc.go:693 +0x110
github.com/containerd/containerd/vendor/github.com/containerd/go-runc.(*Runc).runOrError(0xc42018b6c0, 0xc42017a580, 0xc4200105c0, 0xc4201d6c30)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/containerd/go-runc/runc.go:673 +0x19b
github.com/containerd/containerd/vendor/github.com/containerd/go-runc.(*Runc).Kill(0xc42018b6c0, 0x6c3800, 0xc4200105c0, 0xc420014480, 0x40, 0xf, 0xc42054bbc7, 0xc420393080, 0xc42012d9e0)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/containerd/go-runc/runc.go:320 +0x1e2
github.com/containerd/containerd/linux/proc.(*Init).kill(0xc42018c3c0, 0x6c3800, 0xc4200105c0, 0xf, 0x40, 0xc42054bc01)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/linux/proc/init.go:341 +0x78
github.com/containerd/containerd/linux/proc.(*runningState).Kill(0xc4201ca030, 0x6c3800, 0xc4200105c0, 0xf, 0x0, 0x0)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/linux/proc/init_state.go:331 +0xa5
github.com/containerd/containerd/linux/shim.(*Service).Kill(0xc4201e4000, 0x6c3800, 0xc4200105c0, 0xc42000a940, 0x0, 0x0, 0x0)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/linux/shim/service.go:356 +0x271
github.com/containerd/containerd/linux/shim/v1.RegisterShimService.func10(0x6c3800, 0xc4200105c0, 0xc42000a920, 0xc4201ce968, 0x4, 0xc4201be1a0, 0x0)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/linux/shim/v1/shim.pb.go:1670 +0xc5
github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc.(*serviceSet).dispatch(0xc4201ca008, 0x6c3800, 0xc4200105c0, 0xc4204e6570, 0x25, 0xc4201ce968, 0x4, 0xc420226c30, 0x44, 0x44, ...)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/services.go:71 +0x10e
github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc.(*serviceSet).call(0xc4201ca008, 0x6c3800, 0xc4200105c0, 0xc4204e6570, 0x25, 0xc4201ce968, 0x4, 0xc420226c30, 0x44, 0x44, ...)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/services.go:44 +0xb5
github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc.(*serverConn).run.func2(0xc4201563c0, 0x6c3800, 0xc4200105c0, 0xace455, 0xc420392900, 0xc42014a2a0, 0xc42014a360,0xc400ace455)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/server.go:402 +0xaa
created by github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc.(*serverConn).run
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/server.go:401 +0x763

goroutine 22661145 [semacquire, 83 minutes]:
sync.runtime_SemacquireMutex(0xc4201e4004, 0x64b800)
/usr/local/go/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc4201e4000)
/usr/local/go/src/sync/mutex.go:134 +0x108
github.com/containerd/containerd/linux/shim.(*Service).State(0xc4201e4000, 0x6c3800, 0xc4200105c0, 0xc420120af0, 0x0, 0x0, 0x0)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/linux/shim/service.go:271 +0x59
github.com/containerd/containerd/linux/shim/v1.RegisterShimService.func1(0x6c3800, 0xc4200105c0, 0xc42000a980, 0xc4201ce988, 0x5, 0xc4201be080, 0x0)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/linux/shim/v1/shim.pb.go:1607 +0xc8
github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc.(*serviceSet).dispatch(0xc4201ca008, 0x6c3800, 0xc4200105c0, 0xc4204e65a0, 0x25, 0xc4201ce988, 0x5, 0xc420226c80, 0x42, 0x42, ...)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/services.go:71 +0x10e
github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc.(*serviceSet).call(0xc4201ca008, 0x6c3800, 0xc4200105c0, 0xc4204e65a0, 0x25, 0xc4201ce988, 0x5, 0xc420226c80, 0x42, 0x42, ...)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/services.go:44 +0xb5
github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc.(*serverConn).run.func2(0xc4201563c0, 0x6c3800, 0xc4200105c0, 0xace457, 0xc420392940, 0xc42014a2a0, 0xc42014a360, 0xc400ace457)
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/server.go:402 +0xaa
created by github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc.(*serverConn).run
/tmp/tmp.ma5UpKvnFZ/src/github.com/containerd/containerd/vendor/github.com/stevvooe/ttrpc/server.go:401 +0x763

=== END goroutine stack dump ===" namespace=moby path="/run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/60f253d59f26e1c573d4ba5f824e73b3a4b1bb1629edace85caba4c620755d4d" pid=119820

分析以上协程栈可知:

  • goroutine 1:handleSignals确实阻塞在reaper.go:44,导致后续进程无法被收割
  • goroutine 18:checkProcesses阻塞在service.go:470,获取锁失败,但是并非是reaper.Default这把大锁
  • goroutine 22661144:shim.(*Service).Kill阻塞在reaper.go:82

其中,最为异常的是goroutine 22661144,其阻塞点代码如下:

1
2
3
4
5
6
7
8
9
10
func (m *Monitor) Wait(c *exec.Cmd, ec chan runc.Exit) (int, error) {
for e := range ec { // reaper.go:82
if e.Pid == c.Process.Pid {
c.Wait()
m.Unsubscribe(ec)
return e.Status, nil
}
}
return -1, ErrNoSuchProcess
}

其中ec就是reaper.Default的一个订阅方。

死锁的形成如下:

  • goroutine 22661144:等待着关联进程的退出事件到来,并且持有Service.mu这把锁
  • goroutine 18:等待获取Service.mu这把锁后,再去处理订阅的事件
  • goroutine 1:往所有订阅方发送事件

这三个协程形成了完美的死锁现场。

解决方案

清楚了问题的成因之后,解决问题的方案也很简单,只需调整默认的订阅者信道大小即可。社区优化方案有二:

但是,当我们替换containerd-shim后,影响的也仅是在此之后创建的容器,而之前创建的容器仍然会出现该问题。

这个可以通过添加告警自愈的手段解决:直接杀containerd-shim进程。这样所有的进程都会由init进程完成收割。

坚持原创技术分享,您的支持将鼓励我继续创作!