关于scrub的详细分析和建议

前言

关于scrub这块一直想写一篇文章的,这个在很久前,就做过一次测试,当时是看这个scrub到底有多大的影响,当时看到的是磁盘读占很高,启动deep-scrub后会有大量的读,前端可能会出现 slow request,这个是当时测试看到的现象,一个比较简单的处理办法就是直接给scrub关掉了,当然关掉了就无法检测底层到底有没有对象不一致的问题
关于这个scrub生产上是否开启,仁者见仁,智者见智,就是选择的问题了,这里不做讨论,个人觉得开和关都有各自的道理,本篇是讲述的如果想开启的情况下如何把scrub给控制住

最近在ceph群里看到一段大致这样的讨论:

scrub是个坑

小文件多的场景一定要把scrub关掉
单pg的文件量达到一定规模,scrub一开就会有slow request
这个问题解决不了

上面的说法有没有问题呢?在一般情况下来看,确实如此,但是我们是否能尝试去解决下这个问题,或者缓解下呢?那么我们就来尝试下

scrub的一些追踪

下面的一些追踪并不涉及代码,仅仅从配置和日志的观测来看看scrub到底干了什么

环境准备

我的环境为了便于观测,配置的是一个pg的存储池,然后往这个pg里面put了100个对象,然后对这个pg做deep-scrub,deep-scrub比scrub对磁盘的压力要大些,所以本篇主要是去观测的deep-scrub

开启对pg目录的访问的监控

使用的是inotifywait,我想看下deep-scrub的时候,pg里面的对象到底接收了哪些请求

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
inotifywait -m 1.0_head
1.0_head/ OPEN,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ CLOSE_NOWRITE,CLOSE,ISDIR
1.0_head/ OPEN,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ CLOSE_NOWRITE,CLOSE,ISDIR
1.0_head/ OPEN a16__head_8FA46F40__1
1.0_head/ ACCESS a16__head_8FA46F40__1
1.0_head/ OPEN a39__head_621FD720__1
1.0_head/ ACCESS a39__head_621FD720__1
1.0_head/ OPEN a30__head_655287E0__1
1.0_head/ ACCESS a30__head_655287E0__1
1.0_head/ OPEN a91__head_B02EE3D0__1
1.0_head/ ACCESS a91__head_B02EE3D0__1
1.0_head/ OPEN a33__head_9E9E3E30__1
1.0_head/ ACCESS a33__head_9E9E3E30__1
1.0_head/ OPEN a92__head_6AFC6B30__1
1.0_head/ ACCESS a92__head_6AFC6B30__1
1.0_head/ OPEN a22__head_AC48AAB0__1
1.0_head/ ACCESS a22__head_AC48AAB0__1
1.0_head/ OPEN a42__head_76B90AC8__1
1.0_head/ ACCESS a42__head_76B90AC8__1
1.0_head/ OPEN a5__head_E5A1A728__1
1.0_head/ ACCESS a5__head_E5A1A728__1
1.0_head/ OPEN a34__head_4D9ABA68__1
1.0_head/ ACCESS a34__head_4D9ABA68__1
1.0_head/ OPEN a69__head_7AF2B6E8__1
1.0_head/ ACCESS a69__head_7AF2B6E8__1
1.0_head/ OPEN a95__head_BD3695B8__1
1.0_head/ ACCESS a95__head_BD3695B8__1
1.0_head/ OPEN a67__head_6BCD37B8__1
1.0_head/ ACCESS a67__head_6BCD37B8__1
1.0_head/ OPEN a10__head_F0F08AF8__1
1.0_head/ ACCESS a10__head_F0F08AF8__1
1.0_head/ OPEN a3__head_88EF0BF8__1
1.0_head/ ACCESS a3__head_88EF0BF8__1
1.0_head/ OPEN a82__head_721BC094__1
1.0_head/ ACCESS a82__head_721BC094__1
1.0_head/ OPEN a48__head_27A729D4__1
1.0_head/ ACCESS a48__head_27A729D4__1
1.0_head/ OPEN a36__head_F63E6AF4__1
1.0_head/ ACCESS a36__head_F63E6AF4__1
1.0_head/ OPEN a29__head_F06D540C__1
1.0_head/ ACCESS a29__head_F06D540C__1
1.0_head/ OPEN a31__head_AC83164C__1
1.0_head/ ACCESS a31__head_AC83164C__1
1.0_head/ OPEN a59__head_884F9B6C__1
1.0_head/ ACCESS a59__head_884F9B6C__1
1.0_head/ OPEN a58__head_06954F6C__1
1.0_head/ ACCESS a58__head_06954F6C__1
1.0_head/ OPEN a55__head_2A42E61C__1
1.0_head/ ACCESS a55__head_2A42E61C__1
1.0_head/ OPEN a90__head_1B88FEDC__1
1.0_head/ ACCESS a90__head_1B88FEDC__1
1.0_head/ OPEN,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ CLOSE_NOWRITE,CLOSE,ISDIR
1.0_head/ OPEN,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ CLOSE_NOWRITE,CLOSE,ISDIR
1.0_head/ OPEN a100__head_C29E0C42__1
1.0_head/ ACCESS a100__head_C29E0C42__1
1.0_head/ OPEN a15__head_87123BE2__1
1.0_head/ ACCESS a15__head_87123BE2__1
1.0_head/ OPEN a23__head_AABFFB92__1
1.0_head/ ACCESS a23__head_AABFFB92__1
1.0_head/ OPEN a41__head_4EA9A5D2__1
1.0_head/ ACCESS a41__head_4EA9A5D2__1
1.0_head/ OPEN a85__head_83760D72__1
1.0_head/ ACCESS a85__head_83760D72__1
1.0_head/ OPEN a72__head_8A105D72__1
1.0_head/ ACCESS a72__head_8A105D72__1
1.0_head/ OPEN a60__head_5536480A__1
1.0_head/ ACCESS a60__head_5536480A__1
1.0_head/ OPEN a73__head_F1819D0A__1
1.0_head/ ACCESS a73__head_F1819D0A__1
1.0_head/ OPEN a78__head_6929D12A__1
1.0_head/ ACCESS a78__head_6929D12A__1
1.0_head/ OPEN a57__head_2C43153A__1
1.0_head/ ACCESS a57__head_2C43153A__1
1.0_head/ OPEN a1__head_51903B7A__1
1.0_head/ ACCESS a1__head_51903B7A__1
1.0_head/ OPEN a12__head_14D7ABC6__1
1.0_head/ ACCESS a12__head_14D7ABC6__1
1.0_head/ OPEN a63__head_9490B166__1
1.0_head/ ACCESS a63__head_9490B166__1
1.0_head/ OPEN a53__head_DF95B716__1
1.0_head/ ACCESS a53__head_DF95B716__1
1.0_head/ OPEN a13__head_E09E0896__1
1.0_head/ ACCESS a13__head_E09E0896__1
1.0_head/ OPEN a27__head_7ED31896__1
1.0_head/ ACCESS a27__head_7ED31896__1
1.0_head/ OPEN a43__head_7052A656__1
1.0_head/ ACCESS a43__head_7052A656__1
1.0_head/ OPEN a28__head_E6257CD6__1
1.0_head/ ACCESS a28__head_E6257CD6__1
1.0_head/ OPEN a35__head_ACABD736__1
1.0_head/ ACCESS a35__head_ACABD736__1
1.0_head/ OPEN a54__head_B9482876__1
1.0_head/ CLOSE_WRITE,CLOSE a12__head_14D7ABC6__1
1.0_head/ ACCESS a54__head_B9482876__1
1.0_head/ OPEN a4__head_F12ACA76__1
1.0_head/ CLOSE_WRITE,CLOSE a63__head_9490B166__1
1.0_head/ ACCESS a4__head_F12ACA76__1
1.0_head/ OPEN a84__head_B033038E__1
1.0_head/ ACCESS a84__head_B033038E__1
1.0_head/ OPEN a19__head_D6A64F9E__1
1.0_head/ ACCESS a19__head_D6A64F9E__1
1.0_head/ OPEN a93__head_F54E757E__1
1.0_head/ ACCESS a93__head_F54E757E__1
1.0_head/ OPEN a7__head_1F08F77E__1
1.0_head/ ACCESS a7__head_1F08F77E__1
1.0_head/ OPEN,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ CLOSE_NOWRITE,CLOSE,ISDIR
1.0_head/ OPEN,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ CLOSE_NOWRITE,CLOSE,ISDIR
1.0_head/ OPEN a9__head_635C6201__1
1.0_head/ ACCESS a9__head_635C6201__1
1.0_head/ OPEN a11__head_12780121__1
1.0_head/ ACCESS a11__head_12780121__1
1.0_head/ OPEN a50__head_5E524321__1
1.0_head/ ACCESS a50__head_5E524321__1
1.0_head/ OPEN a75__head_27E1CB21__1
1.0_head/ ACCESS a75__head_27E1CB21__1
1.0_head/ OPEN a21__head_69ACD1A1__1
1.0_head/ ACCESS a21__head_69ACD1A1__1
1.0_head/ OPEN a25__head_698E7751__1
1.0_head/ ACCESS a25__head_698E7751__1
1.0_head/ OPEN a44__head_57E29949__1
1.0_head/ ACCESS a44__head_57E29949__1
1.0_head/ OPEN a66__head_944E79C9__1
1.0_head/ ACCESS a66__head_944E79C9__1
1.0_head/ OPEN a52__head_DAC6BF29__1
1.0_head/ ACCESS a52__head_DAC6BF29__1
1.0_head/ OPEN a14__head_295EA1A9__1
1.0_head/ ACCESS a14__head_295EA1A9__1
1.0_head/ OPEN a70__head_62941259__1
1.0_head/ ACCESS a70__head_62941259__1
1.0_head/ OPEN a18__head_53B48959__1
1.0_head/ ACCESS a18__head_53B48959__1
1.0_head/ OPEN a17__head_7D103759__1
1.0_head/ ACCESS a17__head_7D103759__1
1.0_head/ OPEN a6__head_9505BEF9__1
1.0_head/ ACCESS a6__head_9505BEF9__1
1.0_head/ OPEN a77__head_88A7CC25__1
1.0_head/ ACCESS a77__head_88A7CC25__1
1.0_head/ OPEN a37__head_141AFE65__1
1.0_head/ ACCESS a37__head_141AFE65__1
1.0_head/ OPEN a74__head_90DAAD15__1
1.0_head/ ACCESS a74__head_90DAAD15__1
1.0_head/ OPEN a32__head_B7957195__1
1.0_head/ ACCESS a32__head_B7957195__1
1.0_head/ OPEN a45__head_CCCFB5D5__1
1.0_head/ ACCESS a45__head_CCCFB5D5__1
1.0_head/ OPEN a24__head_3B937275__1
1.0_head/ ACCESS a24__head_3B937275__1
1.0_head/ OPEN a26__head_2AB240F5__1
1.0_head/ ACCESS a26__head_2AB240F5__1
1.0_head/ OPEN a89__head_8E387EF5__1
1.0_head/ ACCESS a89__head_8E387EF5__1
1.0_head/ OPEN a80__head_6FEFE78D__1
1.0_head/ ACCESS a80__head_6FEFE78D__1
1.0_head/ OPEN a51__head_0BCC72CD__1
1.0_head/ ACCESS a51__head_0BCC72CD__1
1.0_head/ OPEN a71__head_88F4796D__1
1.0_head/ ACCESS a71__head_88F4796D__1
1.0_head/ OPEN,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ CLOSE_NOWRITE,CLOSE,ISDIR
1.0_head/ OPEN,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ CLOSE_NOWRITE,CLOSE,ISDIR
1.0_head/ OPEN a88__head_B0A64FED__1
1.0_head/ ACCESS a88__head_B0A64FED__1
1.0_head/ OPEN a8__head_F885EA9D__1
1.0_head/ ACCESS a8__head_F885EA9D__1
1.0_head/ OPEN a83__head_1322679D__1
1.0_head/ ACCESS a83__head_1322679D__1
1.0_head/ OPEN a76__head_B8285A7D__1
1.0_head/ ACCESS a76__head_B8285A7D__1
1.0_head/ OPEN a94__head_D3BBB683__1
1.0_head/ ACCESS a94__head_D3BBB683__1
1.0_head/ OPEN a46__head_E2C6C983__1
1.0_head/ ACCESS a46__head_E2C6C983__1
1.0_head/ OPEN a56__head_A1E888C3__1
1.0_head/ ACCESS a56__head_A1E888C3__1
1.0_head/ OPEN a99__head_DD3B45C3__1
1.0_head/ ACCESS a99__head_DD3B45C3__1
1.0_head/ OPEN a79__head_AC19FC13__1
1.0_head/ ACCESS a79__head_AC19FC13__1
1.0_head/ OPEN a81__head_BC0AFFF3__1
1.0_head/ ACCESS a81__head_BC0AFFF3__1
1.0_head/ OPEN a64__head_C042B84B__1
1.0_head/ ACCESS a64__head_C042B84B__1
1.0_head/ OPEN a97__head_29054B4B__1
1.0_head/ ACCESS a97__head_29054B4B__1
1.0_head/ OPEN a96__head_BAAC0DCB__1
1.0_head/ ACCESS a96__head_BAAC0DCB__1
1.0_head/ OPEN a62__head_84A40AAB__1
1.0_head/ ACCESS a62__head_84A40AAB__1
1.0_head/ OPEN a98__head_C15FD53B__1
1.0_head/ ACCESS a98__head_C15FD53B__1
1.0_head/ OPEN a87__head_12F9237B__1
1.0_head/ ACCESS a87__head_12F9237B__1
1.0_head/ OPEN a2__head_E2983C17__1
1.0_head/ ACCESS a2__head_E2983C17__1
1.0_head/ OPEN a20__head_7E477A77__1
1.0_head/ ACCESS a20__head_7E477A77__1
1.0_head/ OPEN a49__head_3ADEC577__1
1.0_head/ ACCESS a49__head_3ADEC577__1
1.0_head/ OPEN a61__head_C860ABF7__1
1.0_head/ ACCESS a61__head_C860ABF7__1
1.0_head/ OPEN a68__head_BC5C8F8F__1
1.0_head/ ACCESS a68__head_BC5C8F8F__1
1.0_head/ OPEN a38__head_78AE322F__1
1.0_head/ ACCESS a38__head_78AE322F__1
1.0_head/ OPEN a65__head_7EE57AEF__1
1.0_head/ ACCESS a65__head_7EE57AEF__1
1.0_head/ OPEN a47__head_B6C48D1F__1
1.0_head/ ACCESS a47__head_B6C48D1F__1
1.0_head/ OPEN a86__head_7FB2C85F__1
1.0_head/ ACCESS a86__head_7FB2C85F__1
1.0_head/ OPEN,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ CLOSE_NOWRITE,CLOSE,ISDIR
1.0_head/ OPEN,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ ACCESS,ISDIR
1.0_head/ CLOSE_NOWRITE,CLOSE,ISDIR
1.0_head/ OPEN a40__head_5F0404DF__1
1.0_head/ ACCESS a40__head_5F0404DF__1

在给osd.0开启debug_osd=20后观测chunky相关的日志

1
2
3
4
5
6
[root@lab8106 ceph]# cat ceph-osd.0.log |grep chunky:1|grep handle_replica_op
2017-08-18 23:50:40.262448 7f2ac583c700 10 osd.0 26 handle_replica_op replica scrub(pg: 1.0,from:0'0,to:22'2696,epoch:26,start:1:00000000::::head,end:1:42307943:::a100:0,chunky:1,deep:1,seed:4294967295,version:6) v6 epoch 26
2017-08-18 23:50:40.294637 7f2ac583c700 10 osd.0 26 handle_replica_op replica scrub(pg: 1.0,from:0'0,to:22'2694,epoch:26,start:1:42307943:::a100:0,end:1:80463ac6:::a9:0,chunky:1,deep:1,seed:4294967295,version:6) v6 epoch 26
2017-08-18 23:50:40.320986 7f2ac583c700 10 osd.0 26 handle_replica_op replica scrub(pg: 1.0,from:0'0,to:22'2690,epoch:26,start:1:80463ac6:::a9:0,end:1:b7f2650d:::a88:0,chunky:1,deep:1,seed:4294967295,version:6) v6 epoch 26
2017-08-18 23:50:40.337646 7f2ac583c700 10 osd.0 26 handle_replica_op replica scrub(pg: 1.0,from:0'0,to:22'2700,epoch:26,start:1:b7f2650d:::a88:0,end:1:fb2020fa:::a40:0,chunky:1,deep:1,seed:4294967295,version:6) v6 epoch 26
2017-08-18 23:50:40.373227 7f2ac583c700 10 osd.0 26 handle_replica_op replica scrub(pg: 1.0,from:0'0,to:22'2636,epoch:26,start:1:fb2020fa:::a40:0,end:MAX,chunky:1,deep:1,seed:4294967295,version:6) v6 epoch 26

截取关键部分看下,如图
a100
我们看下上面的文件访问监控里面这些对象在什么位置

1
2
3
4
25:1.0_head/ ACCESS a100__head_C29E0C42__1
50:1.0_head/ ACCESS a9__head_635C6201__1
75:1.0_head/ ACCESS a88__head_B0A64FED__1
100:1.0_head/ ACCESS a40__head_5F0404DF__1

看上去是不是很有规律,这个地方在ceph里面会有个chunk的概念,在做scrub的时候,ceph会对这个chunk进行加锁,这个可以在很多地方看到这个,这个也就是为什么有slow request,并不一定是你的磁盘慢了,而是加了锁,就没法读的

osd scrub chunk min

Description: The minimal number of object store chunks to scrub during single operation. Ceph blocks writes to single chunk during scrub.
Type: 32-bit Integer
Default: 5

从配置文件上面看说是会锁住写,没有提及读的锁定的问题,那么我们下面验证下这个问题,到底deep-scrub,是不是会引起读的slow request

上面的环境100个对象,现在把100个对象的大小调整为100M一个,并且chunk设置为100个对象的,也就是我把我这个环境所有的对象认为是一个大的chunk,然后去用rados读取这个对象,来看下会发生什么

1
2
osd_scrub_chunk_min = 100
osd_scrub_chunk_max = 100

使用ceph -w监控

1
2
3
2017-08-19 00:19:26.045032 mon.0 [INF] pgmap v377: 1 pgs: 1 active+clean+scrubbing+deep; 10000 MB data, 30103 MB used, 793 GB / 822 GB avail
2017-08-19 00:19:17.540413 osd.0 [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.398705 secs
2017-08-19 00:19:17.540456 osd.0 [WRN] slow request 30.398705 seconds old, received at 2017-08-19 00:18:47.141483: replica scrub(pg: 1.0,from:0'0,to:26'5200,epoch:32,start:1:00000000::::head,end:MAX,chunky:1,deep:1,seed:4294967295,version:6) currently reached_pg

我从deep scrub 一开始就进行a40对象的get rados -p rbd get a40 a40,直接就卡着不返回,在pg内对象不变的情况下,对pg做scrub的顺序是不变的,我专门挑了我这个scrub顺序下最后一个scrub的对象来做get,还是出现了slow request ,这个可以证明上面的推断,也就是在做scrub的时候,对scub的chunk的对象的读取请求也会卡死,现在我把我的scrub的chunk弄成1看下会发生什么

配置参数改成

1
2
osd_scrub_chunk_min = 1
osd_scrub_chunk_max = 1
1
2
3
4
5
watch -n 1 'rados -p rbd get a9 a1'
watch -n 1 'rados -p rbd get a9 a2'
watch -n 1 'rados -p rbd get a9 a3'
watch -n 1 'rados -p rbd get a9 a4'
watch -n 1 'rados -p rbd get a9 a5'

使用五个请求同时去get a9,循环的去做

然后做deep scrub,这一次并没有出现slow request 的情况

###另外一个重要参数
再看看这个参数osd_scrub_sleep = 0

osd scrub sleep

Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow down whole scrub operation while client operations will be less impacted.
Type: Float
Default: 0

可以看到还有scrub group这个概念,从数据上分析这个group 是3,也就是3个chunks
我们来设置下

osd_scrub_sleep = 5

然后再次做deep-scrub,然后看下日志的内容

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
cat /var/log/ceph/ceph-osd.0.log |grep be_deep_scrub|awk '{print $1,$2,$28}'|less
2017-08-19 00:48:37.930455 1:02f625f1:::a16:head
2017-08-19 00:48:38.477271 1:02f625f1:::a16:head
2017-08-19 00:48:38.477367 1:04ebf846:::a39:head
2017-08-19 00:48:39.023952 1:04ebf846:::a39:head
2017-08-19 00:48:39.024084 1:07e14aa6:::a30:head
2017-08-19 00:48:39.572683 1:07e14aa6:::a30:head
2017-08-19 00:48:44.989551 1:0bc7740d:::a91:head
2017-08-19 00:48:45.556758 1:0bc7740d:::a91:head
2017-08-19 00:48:45.556857 1:0c7c7979:::a33:head
2017-08-19 00:48:46.109657 1:0c7c7979:::a33:head
2017-08-19 00:48:46.109768 1:0cd63f56:::a92:head
2017-08-19 00:48:46.657849 1:0cd63f56:::a92:head
2017-08-19 00:48:52.084712 1:0d551235:::a22:head
2017-08-19 00:48:52.614345 1:0d551235:::a22:head
2017-08-19 00:48:52.614458 1:13509d6e:::a42:head
2017-08-19 00:48:53.158826 1:13509d6e:::a42:head
2017-08-19 00:48:53.158916 1:14e585a7:::a5:head

可以看到1s做一个对象的deep-scrub,然后在做了3个对象后就停止了5s

默认情况下的scrub和修改后的对比

我们来计算下在修改前后的情况对比,我们来模拟pg里面有10000个对象的情况小文件 测试的文件都是1K的,这个可以根据自己的文件模型进行测试

假设是海量对象的场景,那么算下来单pg 1w左右对象左右也算比较多了,我们就模拟10000个对象的场景的deep-scrub

1
cat /var/log/ceph/ceph-osd.0.log |grep be_deep_scrub|awk '{print $1,$2,$28}'|awk '{sub(/.*/,substr($2,1,8),$2); print $0}'|uniq|awk '{a[$1," ",$2]++}END{for (j in a) print j,a[j]|"sort -k 1"}'

使用上面的脚本统计每秒scrub的对象数目

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
2017-08-19 01:23:33 184
2017-08-19 01:23:34 236
2017-08-19 01:23:35 261
2017-08-19 01:23:36 263
2017-08-19 01:23:37 229
2017-08-19 01:23:38 289
2017-08-19 01:23:39 236
2017-08-19 01:23:40 258
2017-08-19 01:23:41 276
2017-08-19 01:23:42 238
2017-08-19 01:23:43 224
2017-08-19 01:23:44 282
2017-08-19 01:23:45 254
2017-08-19 01:23:46 258
2017-08-19 01:23:47 261
2017-08-19 01:23:48 233
2017-08-19 01:23:49 300
2017-08-19 01:23:50 243
2017-08-19 01:23:51 257
2017-08-19 01:23:52 252
2017-08-19 01:23:53 246
2017-08-19 01:23:54 313
2017-08-19 01:23:55 252
2017-08-19 01:23:56 276
2017-08-19 01:23:57 245
2017-08-19 01:23:58 256
2017-08-19 01:23:59 307
2017-08-19 01:24:00 276
2017-08-19 01:24:01 310
2017-08-19 01:24:02 220
2017-08-19 01:24:03 250
2017-08-19 01:24:04 313
2017-08-19 01:24:05 265
2017-08-19 01:24:06 304
2017-08-19 01:24:07 262
2017-08-19 01:24:08 308
2017-08-19 01:24:09 263
2017-08-19 01:24:10 293
2017-08-19 01:24:11 42

可以看到1s 会扫300个对象左右,差不多40s钟就扫完了一个pg,默认25个对象一个trunk

这里可以打个比喻,在一条长为40m的马路上,一个汽车以1m/s速度前进,中间会有人来回穿,如果穿梭的人只有一两个可能没什么问题,但是一旦有40个人在这个区间进行穿梭的时候,可想而知碰撞的概率会有多大了

或者同一个文件被连续请求40次,那么对应到这里就是40个人在同一个位置不停的穿马路,这样撞上的概率是不是非常的大了?

上面说了这么多,那么我想如果整个看下来,应该知道怎么处理了
我们看下这样的全部为1的情况下,会出现什么情况

1
2
3
osd_scrub_chunk_min = 1
osd_scrub_chunk_max = 1
osd_scrub_sleep = 3

这里减少chunk大小,相当于减少上面例子当中汽车的长度,原来25米的大卡车,变成1米的自行车了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[root@lab8106 ceph]# cat /var/log/ceph/ceph-osd.0.log |grep be_deep_scrub|awk '{print $1,$2,$28}'
2017-08-19 16:12:21.927440 1:0000b488:::a5471:head
2017-08-19 16:12:21.931914 1:0000b488:::a5471:head
2017-08-19 16:12:21.932039 1:000fbbcb:::a5667:head
2017-08-19 16:12:21.933568 1:000fbbcb:::a5667:head
2017-08-19 16:12:21.933646 1:00134ebd:::a1903:head
2017-08-19 16:12:21.934972 1:00134ebd:::a1903:head
2017-08-19 16:12:24.960697 1:0018f641:::a2028:head
2017-08-19 16:12:24.966653 1:0018f641:::a2028:head
2017-08-19 16:12:24.966733 1:00197a21:::a1463:head
2017-08-19 16:12:24.967085 1:00197a21:::a1463:head
2017-08-19 16:12:24.967162 1:001cb17d:::a1703:head
2017-08-19 16:12:24.967492 1:001cb17d:::a1703:head
2017-08-19 16:12:27.972252 1:002d911c:::a1585:head
2017-08-19 16:12:27.976621 1:002d911c:::a1585:head
2017-08-19 16:12:27.976740 1:00301acf:::a6131:head
2017-08-19 16:12:27.977097 1:00301acf:::a6131:head
2017-08-19 16:12:27.977181 1:0039a0a8:::a1840:head
2017-08-19 16:12:27.979053 1:0039a0a8:::a1840:head
2017-08-19 16:12:30.983556 1:00484881:::a8781:head
2017-08-19 16:12:30.989098 1:00484881:::a8781:head
2017-08-19 16:12:30.989181 1:004f234f:::a4402:head
2017-08-19 16:12:30.989531 1:004f234f:::a4402:head
2017-08-19 16:12:30.989626 1:00531b36:::a5251:head
2017-08-19 16:12:30.989954 1:00531b36:::a5251:head
2017-08-19 16:12:33.994419 1:00584c30:::a3374:head
2017-08-19 16:12:34.001296 1:00584c30:::a3374:head
2017-08-19 16:12:34.001378 1:005d6aa5:::a2115:head
2017-08-19 16:12:34.002174 1:005d6aa5:::a2115:head
2017-08-19 16:12:34.002287 1:005e0dfd:::a9945:head
2017-08-19 16:12:34.002686 1:005e0dfd:::a9945:head
2017-08-19 16:12:37.005645 1:006320f9:::a5207:head
2017-08-19 16:12:37.011498 1:006320f9:::a5207:head
2017-08-19 16:12:37.011655 1:006d32b4:::a7517:head
2017-08-19 16:12:37.011998 1:006d32b4:::a7517:head
2017-08-19 16:12:37.012111 1:006dae55:::a4702:head
2017-08-19 16:12:37.012442 1:006dae55:::a4702:head

上面从日志里面截取部分的日志,这个是什么意思呢,是每秒钟扫描3个对象,然后休息3s再进行下一个,这个是不是已经把速度压到非常低了?还有上面做测试scrub sleep例子里面好像是1s 会scrub 1个对象,这里怎么就成了1s会scrub 3 个对象了,这个跟scrub的对象大小有关,对象越大,scrub的时间就相对长一点,这个测试里面的对象是1K的,基本算非常小了,也就是1s会扫描3个对象,然后根据你的设置的sleep值等待进入下一组的scrub

在上面的环境下默认每秒钟会对300左右的对象进行scrub,以25个对象的锁定窗口移动,无法写入和读取,而参数修改后每秒有3个对象被scrub,以1个对象的锁定窗口移动,这个单位时间锁定的对象的数目已经降低到一个非常低的程度了,如果你有生产环境又想去开scrub,不妨尝试下降低chunk,增加sleep

这个的影响就是扫描的速度而已,而如果你想加快扫描速度,就去调整sleep参数来控制这个扫描的速度了,这个就不在这里赘述了

本篇讲述的是一个PG上开启deep-scrub以后的影响,默认的是到了最大的intelval以后就会开启自动开启scrub了,所以我建议的是不用系统自带的时间控制,而是自己去分析的scrub的时间戳和对象数目,然后计算好以后,可以是每天晚上,扫描指定个数的PG,然后等一轮全做完以后,中间就是自定义的一段时间的不扫描期,这个可以自己定义,是一个月或者两个月扫一轮都行,这个会在后面单独写一篇文章来讲述这个

总结

关于scrub,你需要了解,scrub什么时候会发生,发生以后会对你的osd产生多少的负载,每秒钟会扫描多少对象,如何去降低这个影响,这些问题就是本篇的来源了,很多问题是能从参数上进行解决的,关键是你要知道它们到底在干嘛

变更记录

Why Who When
创建 武汉-运维-磨渣 2017-08-19