client-go informer 缓存失效问题排查

背景

长期以来,弹性云线上服务一直饱受缓存不一致的困扰。

缓存不一致的发生一般伴随着kube-apiserver的升级或重启。且当缓存不一致问题发生时,用户侧能够较为明显的感知,问题严重时会引发线上故障。而常见的故障有:

  • 平台数据不一致:Pod状态一会正常,一会不正常,并且来回跳动
  • 服务管理事件丢失:服务变更时,服务管理未正常工作,如服务树未挂载、流量未接入等等

在问题未定位之前,弹性云制定了诸多问题感知与及时止损策略:

  • 问题感知:
    • 人工:kube-apiserver升级或重启时,人工通知关联方也重启平台服务
    • 智能:配置监控与报警策略,当一段时间内未收到k8s对象的变更事件时,发送告警信息
  • 及时止损:
    • 重启:缓存不一致问题发生时,重启服务,并从kube-apiserver全量拉取最新的数据
    • 自愈:部分场景下,即使服务重启也不能完全恢复,添加自愈策略,主动感知并处理异常情况

问题感知与止损策略并没有真正意义上解决问题,而仅仅是在确定性场景下尝试恢复服务,并且伴随着更多异常场景的发现,策略也需同步调整。

问题定位

感知与止损是一种类似亡羊补牢的修复手段,显然,我们更希望的是一个彻底解决问题的方案。那么,我们先从引起缓存不一致的根因开始排查。

我们选择notifier来排查该问题,notifier是一个集群管理服务的控制器集合,其功能主要包含:

  • 服务树挂载
  • DNS注册
  • LVS摘接流等

选择notifier的原因,在于其功能较为简单:notifier使用了client-go的informer,并对核心资源事件注册处理函数;此外也没有复杂的业务流程来干扰问题排查。

问题复现

我们在线下环境中进行测试,发现kube-apiserver服务重启后,问题能够稳定复现,这给我们排查问题带来了极大的便利。因此问题复现步骤如下:

  • 启动notifier服务
  • 重启kube-apiserver服务

状态分析

当问题发生时,我们首先对服务状态做一些基本检查:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# #服务存活状态
# ps -ef|grep notifier
stupig 1007922 1003335 2 13:41 pts/1 00:00:08 ./notifier -c configs/notifier-test.toml

# #服务FD打开状态
# lsof -nP -p 1007922
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
nobody 1007922 stupig 0u CHR 136,1 0t0 4 /dev/pts/1
nobody 1007922 stupig 1u CHR 136,1 0t0 4 /dev/pts/1
nobody 1007922 stupig 2u CHR 136,1 0t0 4 /dev/pts/1
nobody 1007922 stupig 3u unix 0xffff8810a3132400 0t0 4254094659 socket
nobody 1007922 stupig 4u a_inode 0,9 0 8548 [eventpoll]
nobody 1007922 stupig 5r FIFO 0,8 0t0 4253939077 pipe
nobody 1007922 stupig 6w FIFO 0,8 0t0 4253939077 pipe
nobody 1007922 stupig 8u IPv4 4254094660 0t0 UDP *:37087
nobody 1007922 stupig 9r CHR 1,9 0t0 2057 /dev/urandom
nobody 1007922 stupig 10u IPv4 4253939079 0t0 TCP *:4397 (LISTEN)
nobody 1007922 stupig 11u REG 8,17 12538653 8604570895 ./logs/notifier.stupig.log.INFO.20211127-134138.1007922
nobody 1007922 stupig 15u IPv4 4254204931 0t0 TCP 127.0.0.1:43566->127.0.0.1:2479 (ESTABLISHED) # ETCD
nobody 1007922 stupig 19u REG 8,5 252384 821 /tmp/notifier.stupig.log.ERROR.20211127-134505.1007922
nobody 1007922 stupig 20u REG 8,5 252384 822 /tmp/notifier.stupig.log.WARNING.20211127-134505.1007922
nobody 1007922 stupig 21u REG 8,17 414436 8606917935 ./logs/notifier.stupig.log.WARNING.20211127-134139.1007922
nobody 1007922 stupig 24u REG 8,17 290725 8606917936 ./logs/notifier.stupig.log.ERROR.20211127-134238.1007922
nobody 1007922 stupig 30u REG 8,5 252384 823 /tmp/notifier.stupig.log.INFO.20211127-134505.1007922

对比问题发生前的服务状态信息,我们发现一个严重的问题,notifier与kube-apiserver (服务地址:https://localhost:6443) 建立的连接消失了。

因此,notifier与kube-apiserver的数据失去了同步,其后notifier也感知不到业务的变更事件,并最终丧失了对服务的管理能力。

日志分析

现在我们分析notifier的运行日志,重点关注kube-apiserver重启时,notifier打印的日志,其中关键日志信息如下:

1
2
3
4
E1127 14:08:19.728515 1041482 reflector.go:251] notifier/monitor/endpointInformer.go:140: Failed to watch *v1.Endpoints: Get "https://127.0.0.1:6443/api/v1/endpoints?resourceVersion=276025109&timeoutSeconds=395&watch=true": http2: no cached connection was available
E1127 14:08:20.731407 1041482 reflector.go:134] notifier/monitor/endpointInformer.go:140: Failed to list *v1.Endpoints: Get "https://127.0.0.1:6443/api/v1/endpoints?limit=500&resourceVersion=0": http2: no cached connection was available
E1127 14:08:21.733509 1041482 reflector.go:134] notifier/monitor/endpointInformer.go:140: Failed to list *v1.Endpoints: Get "https://127.0.0.1:6443/api/v1/endpoints?limit=500&resourceVersion=0": http2: no cached connection was available
E1127 14:08:22.734679 1041482 reflector.go:134] notifier/monitor/endpointInformer.go:140: Failed to list *v1.Endpoints: Get "https://127.0.0.1:6443/api/v1/endpoints?limit=500&resourceVersion=0": http2: no cached connection was available

上面展示了关键的异常信息 http2: no cached connection was available ,而其关联的操作正是EndpointInformer的ListAndWatch操作。

这里我们已经掌握了关键线索,下一步,我们将结合代码分析定位根因。

代码分析

Informer的工作机制介绍不是本文重点,我们仅关注下面的代码片段:

1
2
3
4
5
6
7
8
9
10
// Run starts a watch and handles watch events. Will restart the watch if it is closed.
// Run will exit when stopCh is closed.
func (r *Reflector) Run(stopCh <-chan struct{}) {
glog.V(3).Infof("Starting reflector %v (%s) from %s", r.expectedType, r.resyncPeriod, r.name)
wait.Until(func() {
if err := r.ListAndWatch(stopCh); err != nil {
utilruntime.HandleError(err)
}
}, r.period, stopCh)
}

Informer的Reflector组件运行在一个独立的goroutine中,并循环调用ListAndWatch接收kube-apiserver的通知事件。

我们结合日志分析可得出结论:当kube-apiserver服务重启后,notifier服务的所有ListAndWatch操作都返回了 http2: no cached connection was available 错误。

因此,我们将关注的重点转移至该错误信息上。

通过代码检索,我们定位了该错误的定位及返回位置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// file: vendor/golang.org/x/net/http2/transport.go:L301
var ErrNoCachedConn = errors.New("http2: no cached connection was available")

// file: vendor/golang.org/x/net/http2/client_conn_pool.go:L55~80
func (p *clientConnPool) getClientConn(req *http.Request, addr string, dialOnMiss bool) (*ClientConn, error) {
if isConnectionCloseRequest(req) && dialOnMiss {
// It gets its own connection.
const singleUse = true
cc, err := p.t.dialClientConn(addr, singleUse)
if err != nil {
return nil, err
}
return cc, nil
}
p.mu.Lock()
for _, cc := range p.conns[addr] {
if cc.CanTakeNewRequest() {
p.mu.Unlock()
return cc, nil
}
}
if !dialOnMiss {
p.mu.Unlock()
return nil, ErrNoCachedConn
}
call := p.getStartDialLocked(addr)
p.mu.Unlock()
<-call.done
return call.res, call.err
}

上述代码返回 ErrNoCachedConn 的条件为:

  • 参数dialOnMiss值为false
  • p.conns连接池内没有可用连接

理论上,在发送http请求时,如果连接池为空,则会先建立一个连接,然后发送请求;并且连接池能够自动剔除状态异常的连接。那么本文关注的问题有时如何发生的呢?

现在我们关注 getClientConn 方法的调用链,主要有二:

栈一:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 0  0x0000000000a590b8 in notifier/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn
at ./go/src/notifier/vendor/golang.org/x/net/http2/client_conn_pool.go:55
1 0x0000000000a5aea6 in notifier/vendor/golang.org/x/net/http2.noDialClientConnPool.GetClientConn
at ./go/src/notifier/vendor/golang.org/x/net/http2/client_conn_pool.go:255
2 0x0000000000a6c4f9 in notifier/vendor/golang.org/x/net/http2.(*Transport).RoundTripOpt
at ./go/src/notifier/vendor/golang.org/x/net/http2/transport.go:345
3 0x0000000000a6bd0e in notifier/vendor/golang.org/x/net/http2.(*Transport).RoundTrip
at ./go/src/notifier/vendor/golang.org/x/net/http2/transport.go:313
4 0x0000000000a5b97e in notifier/vendor/golang.org/x/net/http2.noDialH2RoundTripper.RoundTrip
at ./go/src/notifier/vendor/golang.org/x/net/http2/configure_transport.go:75
5 0x0000000000828e45 in net/http.(*Transport).roundTrip
at /usr/local/go1.16.7/src/net/http/transport.go:537
6 0x00000000008016de in net/http.(*Transport).RoundTrip
at /usr/local/go1.16.7/src/net/http/roundtrip.go:17
7 0x00000000016a1ef8 in notifier/vendor/k8s.io/client-go/transport.(*userAgentRoundTripper).RoundTrip
at ./go/src/notifier/vendor/k8s.io/client-go/transport/round_trippers.go:162
8 0x00000000007a3aa2 in net/http.send
at /usr/local/go1.16.7/src/net/http/client.go:251
9 0x00000000007a324b in net/http.(*Client).send
at /usr/local/go1.16.7/src/net/http/client.go:175
10 0x00000000007a6ed5 in net/http.(*Client).do
at /usr/local/go1.16.7/src/net/http/client.go:717
11 0x00000000007a5d9e in net/http.(*Client).Do
at /usr/local/go1.16.7/src/net/http/client.go:585
12 0x00000000016b9487 in notifier/vendor/k8s.io/client-go/rest.(*Request).request
at ./go/src/notifier/vendor/k8s.io/client-go/rest/request.go:732
13 0x00000000016b9f2d in notifier/vendor/k8s.io/client-go/rest.(*Request).Do
at ./go/src/notifier/vendor/k8s.io/client-go/rest/request.go:804
14 0x00000000017093bb in notifier/vendor/k8s.io/client-go/kubernetes/typed/core/v1.(*endpoints).List
at ./go/src/notifier/vendor/k8s.io/client-go/kubernetes/typed/core/v1/endpoints.go:83
……

栈二:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
 0  0x0000000000a590b8 in notifier/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn
at ./go/src/notifier/vendor/golang.org/x/net/http2/client_conn_pool.go:55
1 0x0000000000a5aea6 in notifier/vendor/golang.org/x/net/http2.noDialClientConnPool.GetClientConn
at ./go/src/notifier/vendor/golang.org/x/net/http2/client_conn_pool.go:255
2 0x0000000000a6c4f9 in notifier/vendor/golang.org/x/net/http2.(*Transport).RoundTripOpt
at ./go/src/notifier/vendor/golang.org/x/net/http2/transport.go:345
3 0x0000000000a6bd0e in notifier/vendor/golang.org/x/net/http2.(*Transport).RoundTrip
at ./go/src/notifier/vendor/golang.org/x/net/http2/transport.go:313
4 0x00000000008296ed in net/http.(*Transport).roundTrip
at /usr/local/go1.16.7/src/net/http/transport.go:590
5 0x00000000008016de in net/http.(*Transport).RoundTrip
at /usr/local/go1.16.7/src/net/http/roundtrip.go:17
6 0x00000000016a1ef8 in notifier/vendor/k8s.io/client-go/transport.(*userAgentRoundTripper).RoundTrip
at ./go/src/notifier/vendor/k8s.io/client-go/transport/round_trippers.go:162
7 0x00000000007a3aa2 in net/http.send
at /usr/local/go1.16.7/src/net/http/client.go:251
8 0x00000000007a324b in net/http.(*Client).send
at /usr/local/go1.16.7/src/net/http/client.go:175
9 0x00000000007a6ed5 in net/http.(*Client).do
at /usr/local/go1.16.7/src/net/http/client.go:717
10 0x00000000007a5d9e in net/http.(*Client).Do
at /usr/local/go1.16.7/src/net/http/client.go:585
11 0x00000000016b9487 in notifier/vendor/k8s.io/client-go/rest.(*Request).request
at ./go/src/notifier/vendor/k8s.io/client-go/rest/request.go:732
12 0x00000000016b9f2d in notifier/vendor/k8s.io/client-go/rest.(*Request).Do
at ./go/src/notifier/vendor/k8s.io/client-go/rest/request.go:804
13 0x00000000017093bb in notifier/vendor/k8s.io/client-go/kubernetes/typed/core/v1.(*endpoints).List
at ./go/src/notifier/vendor/k8s.io/client-go/kubernetes/typed/core/v1/endpoints.go:83
……

分别跟踪两个调用栈后,我们可以很快排除栈一的因素:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// file: net/http/transport.go:L502~620
func (t *Transport) roundTrip(req *Request) (*Response, error) {
if altRT := t.alternateRoundTripper(req); altRT != nil { // L537
if resp, err := altRT.RoundTrip(req); err != ErrSkipAltProtocol {
return resp, err
}
var err error
req, err = rewindBody(req)
if err != nil {
return nil, err
}
}

// file: vendor/golang.org/x/net/http2/configure_transport.go:L74~80
func (rt noDialH2RoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {
res, err := rt.t.RoundTrip(req) // L75
if err == ErrNoCachedConn {
return nil, http.ErrSkipAltProtocol
}
return res, err
}

// file: vendor/golang.org/x/net/http2/transport.go:L312~314
func (t *Transport) RoundTrip(req *http.Request) (*http.Response, error) {
return t.RoundTripOpt(req, RoundTripOpt{}) // L313
}

// file: vendor/golang.org/x/net/http2/transport.go:L337~379
func (t *Transport) RoundTripOpt(req *http.Request, opt RoundTripOpt) (*http.Response, error) {
addr := authorityAddr(req.URL.Scheme, req.URL.Host)
for retry := 0; ; retry++ {
cc, err := t.connPool().GetClientConn(req, addr) // L345
if err != nil {
t.vlogf("http2: Transport failed to get client conn for %s: %v", addr, err)
return nil, err
}
}
}

// file: vendor/golang.org/x/net/http2/client_conn_pool.go:L254~256
func (p noDialClientConnPool) GetClientConn(req *http.Request, addr string) (*ClientConn, error) {
return p.getClientConn(req, addr, noDialOnMiss) // L255
}

栈一调用 getClientConn 返回了 ErrNoCachedConn 错误,并在 noDialH2RoundTripper.RoundTrip 函数中被替换为 http.ErrSkipAltProtocol 错误,返回 roundTrip 函数后继续执行余下流程,并进入栈二的流程。

因此我们重点关注栈二的流程:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// file: net/http/transport.go:L502~620
func (t *Transport) roundTrip(req *Request) (*Response, error) {
for {
var resp *Response
if pconn.alt != nil {
// HTTP/2 path.
t.setReqCanceler(cancelKey, nil) // not cancelable with CancelRequest
resp, err = pconn.alt.RoundTrip(req) // L590
}
if err == nil {
resp.Request = origReq
return resp, nil
}

// Failed. Clean up and determine whether to retry.
if http2isNoCachedConnError(err) {
if t.removeIdleConn(pconn) {
t.decConnsPerHost(pconn.cacheKey)
}
} else if !pconn.shouldRetryRequest(req, err) {
return nil, err
}
}
}

// file: vendor/golang.org/x/net/http2/transport.go:L312~314
func (t *Transport) RoundTrip(req *http.Request) (*http.Response, error) {
return t.RoundTripOpt(req, RoundTripOpt{}) // L313
}

// 内层调用栈同栈一,不再列出

区别于栈一,栈二不再对返回错误做一个转换,而是直接返回了 ErrNoCachedConn 错误,并且 roundTrip 的错误处理流程中也特殊处理了本类错误。如果检测 http2isnoCachedConnError 返回true,则连接池会移除该异常连接。

一切都那么的合乎情理,那么问题是如何发生的呢?这里问题就发生在 http2isnoCachedConnError

1
2
3
4
5
6
7
8
// file: net/http/h2_bundle.go:L6922~6928
// isNoCachedConnError reports whether err is of type noCachedConnError
// or its equivalent renamed type in net/http2's h2_bundle.go. Both types
// may coexist in the same running program.
func http2isNoCachedConnError(err error) bool {
_, ok := err.(interface{ IsHTTP2NoCachedConnError() })
return ok
}

如果 err 对象实现了匿名接口 (仅定义了一个函数 IsHTTP2NoCachedConnError),那么返回true,否则返回false。

那么,getClientConn 返回的错误类型实现了该接口吗?很显然:没有。

1
2
// file: vendor/golang.org/x/net/http2/transport.go:L301
var ErrNoCachedConn = errors.New("http2: no cached connection was available")

至此,问题发生的原因已基本定位清楚。

解决方案

既然问题是由于 getClientConn 返回的错误类型 ErrNoCachedConn 没有实现 IsHTTP2NoCachedConnError 函数引起,那么其修复策略自然是:修改返回错误类型,并实现该接口函数。

注意,由于该部分代码是我们引用的外部代码库的内容,我们检查最新的 golang.org/x/net 代码发现,问题早在2018年1月份就已被修复。。。具体参见:golang.org/x/net修复方案

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// noCachedConnError is the concrete type of ErrNoCachedConn, which
// needs to be detected by net/http regardless of whether it's its
// bundled version (in h2_bundle.go with a rewritten type name) or
// from a user's x/net/http2. As such, as it has a unique method name
// (IsHTTP2NoCachedConnError) that net/http sniffs for via func
// isNoCachedConnError.
type noCachedConnError struct{}

func (noCachedConnError) IsHTTP2NoCachedConnError() {}
func (noCachedConnError) Error() string { return "http2: no cached connection was available" }

// isNoCachedConnError reports whether err is of type noCachedConnError
// or its equivalent renamed type in net/http2's h2_bundle.go. Both types
// may coexist in the same running program.
func isNoCachedConnError(err error) bool {
_, ok := err.(interface{ IsHTTP2NoCachedConnError() })
return ok
}

var ErrNoCachedConn error = noCachedConnError{}

而我们线上使用的版本仍然为:1c05540f6。

因此,我们的修复策略变得更为简单,升级vendor中的依赖库版本即可。

目前,线上notifier服务已升级依赖版本,全量上线所有机房。并且也已验证kube-apiserver重启,不会导致notifier服务异常。

坚持原创技术分享,您的支持将鼓励我继续创作!