重构克隆rbd的数据

前言

之前写过一篇重构rbd的元数据的文章,讲的是如果rbd的元数据丢失了,数据还在的时候怎么恢复相关的元数据,但是实际使用的场景是,集群可能崩溃了,数据还在,osd无法拉起来,数据又很重要,需要把数据拼接回来,这个也是最底层的一种数据还原方式了

网上有非快照的rbd的拼接的文章,对于快照使用方式的拼接这个没有太多信息,而实际上很多使用场景就是克隆了一个原始系统后,再使用的,本篇就是把还原需要的细节都写出来了

重构的步骤

获取基本的信息

  • 1、找到rbd_directory,通过这个找到整个环境里面的rbd的名称和prefix的对应关系
  • 2、根据rbd_header的元数据信息找到rbd下面的信息
  • rbd的大小
  • rbd的块大小
  • rbd是否有parent(判断这个是通过哪个快照创建的)
  • 3、rbd是否做了快照(做了快照有两种可能,本身的快照或者本身做的快照被克隆了),不是快照的image的对象,直接取head的对象进行拼接即可,通过快照克隆的对象,需要判断每个对象的状态来进行拼接(后面讲)

正常情况

正常的情况就是通过命令能够获取到上面的信息1,2的信息,这个的前提是,相关的对象所在的osd能够启动,因为数据存储在omap里面的,而无法启动的时候,就只能去底层读取了,这里先讲正常读取的情况,这个正常读取的情况其实就是几个命令就能获取的

异常情况
异常情况就是无法启动osd了,需要去底层读取了,下面会介绍正常和异常的两种方式的读取

通过命令获取关联信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[root@lab102 opt]# rados  -p rbd listomapvals rbd_directory
id_5f7a6b8b4567
value (11 bytes) :
00000000 07 00 00 00 74 65 73 74 72 62 64 |....testrbd|
0000000b

id_5f856b8b4567
value (10 bytes) :
00000000 06 00 00 00 6e 65 77 72 62 64 |....newrbd|
0000000a

name_newrbd
value (16 bytes) :
00000000 0c 00 00 00 35 66 38 35 36 62 38 62 34 35 36 37 |....5f856b8b4567|
00000010

name_testrbd
value (16 bytes) :
00000000 0c 00 00 00 35 66 37 61 36 62 38 62 34 35 36 37 |....5f7a6b8b4567|
00000010

上面可以找到rbd的名称和prefix的对应关系,无法通过命令找到的时候,我们通过底层rocksdb查找

通过底层命令获取信息
这个需要在rbd_directory对象所在的osd里面执行,查询的时候需要先停止osd

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@lab102 opt]# ceph-kvstore-tool rocksdb /var/lib/ceph/osd/ceph-0/current/omap/ list|grep id
_USER_0000000000000065_USER_:id_5f7a6b8b4567
_USER_0000000000000065_USER_:id_5f856b8b4567

[root@lab102 opt]# ceph-kvstore-tool rocksdb /var/lib/ceph/osd/ceph-0/current/omap/ list|grep name
_USER_0000000000000065_USER_:name_newrbd
_USER_0000000000000065_USER_:name_testrbd

[root@lab102 opt]# ceph-kvstore-tool rocksdb /var/lib/ceph/osd/ceph-0/current/omap/ get _USER_0000000000000065_USER_ name_newrbd
(_USER_0000000000000065_USER_, name_newrbd)
00000000 0c 00 00 00 35 66 38 35 36 62 38 62 34 35 36 37 |....5f856b8b4567|
00000010
[root@lab102 opt]# ceph-kvstore-tool rocksdb /var/lib/ceph/osd/ceph-0/current/omap/ get _USER_0000000000000065_USER_ name_testrbd
00000000 0c 00 00 00 35 66 37 61 36 62 38 62 34 35 36 37 |....5f7a6b8b4567|
00000010

可以看到通过底层找到的信息跟上层命令获取的信息是一致的

查询rbd的元数据信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[root@lab102 opt]# rbd info testrbd
rbd image 'testrbd':
size 16384 kB in 4 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.5f7a6b8b4567
format: 2
features: layering
flags:
[root@lab102 opt]# rbd info newrbd
rbd image 'newrbd':
size 16384 kB in 4 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.5f856b8b4567
format: 2
features: layering
flags:
parent: rbd/testrbd@testrbd-write3obj
overlap: 16384 kB
[root@lab102 opt]# rbd snap ls testrbd
SNAPID NAME SIZE
10 testrbd-write3obj 16384 kB
11 writemany 16384 kB
12 writemany1 16384 kB
13 writemany2 16384 kB
14 writemany3 16384 kB
15 writemany4 16384 kB
16 writemany2a 16384 kB
17 writemany2h 16384 kB
18 writemany2l 16384 kB

通过上面的命令我们可以查询到相关的对应关系,newrbd是通过testrbd的快照进行克隆创建的,并且snapid为10,快照名称为testrbd-write3obj,这个是正常获得的,我们通过底层获取

上面的信息我们已经获取到了两个rbd的信息,下面就查询两个rbd的信息
查询命令

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
[root@lab102 opt]# rados  -p rbd listomapvals rbd_header.5f7a6b8b4567
features
value (8 bytes) :
00000000 01 00 00 00 00 00 00 00 |........|
00000008

object_prefix
value (25 bytes) :
00000000 15 00 00 00 72 62 64 5f 64 61 74 61 2e 35 66 37 |....rbd_data.5f7|
00000010 61 36 62 38 62 34 35 36 37 |a6b8b4567|
00000019

order
value (1 bytes) :
00000000 16 |.|
00000001

size
value (8 bytes) :
00000000 00 00 00 01 00 00 00 00 |........|
00000008

snap_seq
value (8 bytes) :
00000000 12 00 00 00 00 00 00 00 |........|
00000008

snapshot_000000000000000a
value (94 bytes) :
00000000 04 01 58 00 00 00 0a 00 00 00 00 00 00 00 11 00 |..X.............|
00000010 00 00 74 65 73 74 72 62 64 2d 77 72 69 74 65 33 |..testrbd-write3|
00000020 6f 62 6a 00 00 00 01 00 00 00 00 01 00 00 00 00 |obj.............|
00000030 00 00 00 01 01 1c 00 00 00 ff ff ff ff ff ff ff |................|
00000040 ff 00 00 00 00 fe ff ff ff ff ff ff ff 00 00 00 |................|
00000050 00 00 00 00 00 02 00 00 00 00 00 00 00 00 |..............|
0000005e

snapshot_000000000000000b
value (86 bytes) :
00000000 04 01 50 00 00 00 0b 00 00 00 00 00 00 00 09 00 |..P.............|
00000010 00 00 77 72 69 74 65 6d 61 6e 79 00 00 00 01 00 |..writemany.....|
00000020 00 00 00 01 00 00 00 00 00 00 00 01 01 1c 00 00 |................|
00000030 00 ff ff ff ff ff ff ff ff 00 00 00 00 fe ff ff |................|
00000040 ff ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 |................|
00000050 00 00 00 00 00 00 |......|
00000056

snapshot_000000000000000c
value (87 bytes) :
00000000 04 01 51 00 00 00 0c 00 00 00 00 00 00 00 0a 00 |..Q.............|
00000010 00 00 77 72 69 74 65 6d 61 6e 79 31 00 00 00 01 |..writemany1....|
00000020 00 00 00 00 01 00 00 00 00 00 00 00 01 01 1c 00 |................|
00000030 00 00 ff ff ff ff ff ff ff ff 00 00 00 00 fe ff |................|
00000040 ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 00 |................|
00000050 00 00 00 00 00 00 00 |.......|
00000057

snapshot_000000000000000d
value (87 bytes) :
00000000 04 01 51 00 00 00 0d 00 00 00 00 00 00 00 0a 00 |..Q.............|
00000010 00 00 77 72 69 74 65 6d 61 6e 79 32 00 00 00 01 |..writemany2....|
00000020 00 00 00 00 01 00 00 00 00 00 00 00 01 01 1c 00 |................|
00000030 00 00 ff ff ff ff ff ff ff ff 00 00 00 00 fe ff |................|
00000040 ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 00 |................|
00000050 00 00 00 00 00 00 00 |.......|
00000057

snapshot_000000000000000e
value (87 bytes) :
00000000 04 01 51 00 00 00 0e 00 00 00 00 00 00 00 0a 00 |..Q.............|
00000010 00 00 77 72 69 74 65 6d 61 6e 79 33 00 00 00 01 |..writemany3....|
00000020 00 00 00 00 01 00 00 00 00 00 00 00 01 01 1c 00 |................|
00000030 00 00 ff ff ff ff ff ff ff ff 00 00 00 00 fe ff |................|
00000040 ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 00 |................|
00000050 00 00 00 00 00 00 00 |.......|
00000057

snapshot_000000000000000f
value (87 bytes) :
00000000 04 01 51 00 00 00 0f 00 00 00 00 00 00 00 0a 00 |..Q.............|
00000010 00 00 77 72 69 74 65 6d 61 6e 79 34 00 00 00 01 |..writemany4....|
00000020 00 00 00 00 01 00 00 00 00 00 00 00 01 01 1c 00 |................|
00000030 00 00 ff ff ff ff ff ff ff ff 00 00 00 00 fe ff |................|
00000040 ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 00 |................|
00000050 00 00 00 00 00 00 00 |.......|
00000057

snapshot_0000000000000010
value (88 bytes) :
00000000 04 01 52 00 00 00 10 00 00 00 00 00 00 00 0b 00 |..R.............|
00000010 00 00 77 72 69 74 65 6d 61 6e 79 32 61 00 00 00 |..writemany2a...|
00000020 01 00 00 00 00 01 00 00 00 00 00 00 00 01 01 1c |................|
00000030 00 00 00 ff ff ff ff ff ff ff ff 00 00 00 00 fe |................|
00000040 ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 |................|
00000050 00 00 00 00 00 00 00 00 |........|
00000058

snapshot_0000000000000011
value (88 bytes) :
00000000 04 01 52 00 00 00 11 00 00 00 00 00 00 00 0b 00 |..R.............|
00000010 00 00 77 72 69 74 65 6d 61 6e 79 32 68 00 00 00 |..writemany2h...|
00000020 01 00 00 00 00 01 00 00 00 00 00 00 00 01 01 1c |................|
00000030 00 00 00 ff ff ff ff ff ff ff ff 00 00 00 00 fe |................|
00000040 ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 |................|
00000050 00 00 00 00 00 00 00 00 |........|
00000058

snapshot_0000000000000012
value (88 bytes) :
00000000 04 01 52 00 00 00 12 00 00 00 00 00 00 00 0b 00 |..R.............|
00000010 00 00 77 72 69 74 65 6d 61 6e 79 32 6c 00 00 00 |..writemany2l...|
00000020 01 00 00 00 00 01 00 00 00 00 00 00 00 01 01 1c |................|
00000030 00 00 00 ff ff ff ff ff ff ff ff 00 00 00 00 fe |................|
00000040 ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 02 |................|
00000050 00 00 00 00 00 00 00 00 |........|
00000058

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
[root@lab102 opt]# rados  -p rbd listomapvals rbd_header.5f856b8b4567
features
value (8 bytes) :
00000000 01 00 00 00 00 00 00 00 |........|
00000008

object_prefix
value (25 bytes) :
00000000 15 00 00 00 72 62 64 5f 64 61 74 61 2e 35 66 38 |....rbd_data.5f8|
00000010 35 36 62 38 62 34 35 36 37 |56b8b4567|
00000019

order
value (1 bytes) :
00000000 16 |.|
00000001

parent
value (46 bytes) :
00000000 01 01 28 00 00 00 00 00 00 00 00 00 00 00 0c 00 |..(.............|
00000010 00 00 35 66 37 61 36 62 38 62 34 35 36 37 0a 00 |..5f7a6b8b4567..|
00000020 00 00 00 00 00 00 00 00 00 01 00 00 00 00 |..............|
0000002e

size
value (8 bytes) :
00000000 00 00 00 01 00 00 00 00 |........|
00000008

snap_seq
value (8 bytes) :
00000000 00 00 00 00 00 00 00 00 |........|
00000008

上面的信息可以看到,查询到的信息是不同的,一个有parent,一个没有,一个有快照,一个没有,通过这个信息可以推断,有parent的img是通过快照创建的,我们解析下这个数据

1
2
3
4
5
6
parent
value (46 bytes) :
00000000 01 01 28 00 00 00 00 00 00 00 00 00 00 00 0c 00 |..(.............|
00000010 00 00 35 66 37 61 36 62 38 62 34 35 36 37 0a 00 |..5f7a6b8b4567..|
00000020 00 00 00 00 00 00 00 00 00 01 00 00 00 00 |..............|
0000002e

上面是16进制的字符串,右边是这个字符串对应的文本,实际上这个是固定结构的,开头的00 00可以忽略,35 66 37 61 36 62 38 62 34 35 36 37这一段对应的就是后面的5f7a6b8b4567,这个是rbd的prefix,而之前的信息我们知道5f7a6b8b4567就是testrbd的prefix,而后面的0a就是这个快照的编号,这个是16进制的数字,也就是snapid 10编号,0a快照的

1
2
3
4
5
6
7
8
9
snapshot_000000000000000a
value (94 bytes) :
00000000 04 01 58 00 00 00 0a 00 00 00 00 00 00 00 11 00 |..X.............|
00000010 00 00 74 65 73 74 72 62 64 2d 77 72 69 74 65 33 |..testrbd-write3|
00000020 6f 62 6a 00 00 00 01 00 00 00 00 01 00 00 00 00 |obj.............|
00000030 00 00 00 01 01 1c 00 00 00 ff ff ff ff ff ff ff |................|
00000040 ff 00 00 00 00 fe ff ff ff ff ff ff ff 00 00 00 |................|
00000050 00 00 00 00 00 02 00 00 00 00 00 00 00 00 |..............|
0000005e

可以看到跟我们命令查询的信息也是一致的,上面是通过命令查询的信息,我们从底层查询一次

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[root@lab102 opt]# ceph-kvstore-tool rocksdb /var/lib/ceph/osd/ceph-0/current/omap/  list
_USER_0000000000000075_USER_:features
_USER_0000000000000075_USER_:object_prefix
_USER_0000000000000075_USER_:order
_USER_0000000000000075_USER_:size
_USER_0000000000000075_USER_:snap_seq
_USER_0000000000000075_USER_:snapshot_000000000000000a
_USER_0000000000000075_USER_:snapshot_000000000000000b
_USER_0000000000000075_USER_:snapshot_000000000000000c
_USER_0000000000000075_USER_:snapshot_000000000000000d
_USER_0000000000000075_USER_:snapshot_000000000000000e
_USER_0000000000000075_USER_:snapshot_000000000000000f
_USER_0000000000000075_USER_:snapshot_0000000000000010
_USER_0000000000000075_USER_:snapshot_0000000000000011
_USER_0000000000000075_USER_:snapshot_0000000000000012
_USER_0000000000000076_USER_:features
_USER_0000000000000076_USER_:object_prefix
_USER_0000000000000076_USER_:order
_USER_0000000000000076_USER_:parent
_USER_0000000000000076_USER_:size
_USER_0000000000000076_USER_:snap_seq

上面省略了一些无用的信息,找到带object_prefix的信息,然后进行其它信息获取

1
2
3
4
5
6
7
8
9
10
11
[root@lab102 opt]# ceph-kvstore-tool rocksdb /var/lib/ceph/osd/ceph-0/current/omap/  get _USER_0000000000000075_USER_ object_prefix
(_USER_0000000000000075_USER_, object_prefix)
00000000 15 00 00 00 72 62 64 5f 64 61 74 61 2e 35 66 37 |....rbd_data.5f7|
00000010 61 36 62 38 62 34 35 36 37 |a6b8b4567|
00000019
[root@lab102 opt]# ceph-kvstore-tool rocksdb /var/lib/ceph/osd/ceph-0/current/omap/ get _USER_0000000000000076_USER_ parent
(_USER_0000000000000076_USER_, parent)
00000000 01 01 28 00 00 00 00 00 00 00 00 00 00 00 0c 00 |..(.............|
00000010 00 00 35 66 37 61 36 62 38 62 34 35 36 37 0a 00 |..5f7a6b8b4567..|
00000020 00 00 00 00 00 00 00 00 00 01 00 00 00 00 |..............|
0000002e

可以看到跟上面的信息获取的信息一致的,我们解析下其它几个信息

块大小的获取

1
2
3
4
[root@lab102 opt]# ceph-kvstore-tool rocksdb /var/lib/ceph/osd/ceph-0/current/omap/  get _USER_0000000000000076_USER_ order
(_USER_0000000000000076_USER_, order)
00000000 16 |.|
00000001

可以看到上面显示的是16,这个是16进制,实际就是16+6=22,对应的就是下表的数值中的4M的

1
2
3
4
5
6
7
8
9
10
11
order 15 (32768 bytes objects)
order 16 (64 kB objects)
order 17 (128 kB objects)
order 18 (256 kB objects)
order 19 (512 kB objects)
order 20 (1024 kB objects)
order 21 (2048 kB objects)
order 22 (4096 kB objects)
order 23 (8192 kB objects)
order 24 (16384 kB objects)
order 25 (32768 kB objects)

块大小获取到了

获取rbd的大小

1
2
3
4
[root@lab102 opt]# ceph-kvstore-tool rocksdb /var/lib/ceph/osd/ceph-0/current/omap/  get _USER_0000000000000076_USER_ size
(_USER_0000000000000076_USER_, size)
00000000 00 00 00 01 00 00 00 00 |........|
00000008

这个大小的也是通过16进制来确定的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
1B
01 00 00 00 00 00 00 00
256B
00 01 00 00 00 00 00 00
64K
00 00 01 00 00 00 00 00
16M
00 00 00 01 00 00 00 00
4G
00 00 00 00 01 00 00 00
1T
00 00 00 00 00 01 00 00
256T
00 00 00 00 00 00 01 00

每两位可以表示256个数字,初始为1B,也就可以得到上面的数值了
查询到的 00 00 00 01 00 00 00 00 按计算为16M

1
2
3
[root@lab102 opt]# rbd info newrbd
rbd image 'newrbd':
size 16384 kB in 4 objects

可以看到是匹配的,如果是其它数值,进行计算即可,注意是16进制的,有了上面的信息,就可以进行下一步了

快照的写入方式

做了快照以后,如果没有修改对象数据,那么对象数据是以head结尾的数据形式存在的,如果修改了数据,那么数据就会复制一份并且以快照id的方式存储一份,最新的数据写入到head对象里面

1
2
3
4
./0.38_head/rbd\udata.5f7a6b8b4567.0000000000000001__head_D4B551B8__0
./0.38_head/rbd\udata.5f7a6b8b4567.0000000000000001__a_D4B551B8__0
./0.38_head/rbd\udata.5f7a6b8b4567.0000000000000001__b_D4B551B8__0
./0.38_head/rbd\udata.5f7a6b8b4567.0000000000000001__c_D4B551B8__0

而克隆以后,对相同索引的对象进行修改的时候,就会生成自己prefix的相同的索引id的对象,如果没有修改就去读取parent里面的对象,这个地方实际我们找对象就有判断顺序了

1
./0.3d_head/rbd\udata.5f856b8b4567.0000000000000000__head_538F72BD__0

以这个对象为例子

1
2
3
如果有5f856b8b4567.0000000000000000__head就读这个
如果没有就读取5f7a6b8b4567.0000000000000001__a
如果没有5f7a6b8b4567.0000000000000001__a就读取5f7a6b8b4567.0000000000000001_head

这里第三步实际上存在一个问题来了,如果原始镜像里面,没有修改对象,是有5f7a6b8b4567.0000000000000001_head这个的,如果做快照的时候,没有写过5f7a6b8b4567.0000000000000001_head这个对象,是空的,然后做了快照之后,再写的这个对象,这个时候也不会生成a后缀的对象,那么我们如果这个时候读取的是新写的对象,那么数据实际就是错的了,这个地方需要加一步判断了,一个对象是原始对象,还是后写入的对象,实际上是可以通过对象的扩展属性来判断的,这个查询的方式如下

判断对象的属性

获取扩展属性ceph.snapset

1
attr -q -g "ceph.snapset" ./0.34_head/rbd\\udata.5f7a6b8b4567.0000000000000003__head_2F5129B4__0 > 3.txt
1
2
3
4
5
6
7
8
9
10
11
[root@lab102 current]# ceph-dencoder import 3.txt type SnapSet decode dump_json
{
"snap_context": {
"seq": 10,
"snaps": [
10
]
},
"head_exists": 1,
"clones": []
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
[root@lab102 current]# attr -q -g "ceph.snapset" ./0.2a_head/rbd\\udata.5f7a6b8b4567.0000000000000002__head_D519586A__0 > 2.txt
[root@lab102 current]# ceph-dencoder import 2.txt type SnapSet decode dump_json
{
"snap_context": {
"seq": 18,
"snaps": [
18,
17,
16,
15,
14,
13,
12,
11,
10
]
},
"head_exists": 1,
"clones": [
{
"snap": 15,
"size": 4194304,
"overlap": "[]"
},
{
"snap": 16,
"size": 4194304,
"overlap": "[]"
},
{
"snap": 17,
"size": 4194304,
"overlap": "[]"
},
{
"snap": 18,
"size": 4194304,
"overlap": "[]"
}
]
}

下面的数据就是clone后写入的数据,也就是打快照后新写入的数据,上面的是新写入的数据,下面是有覆盖写的情况,我们来比较,之前有但是数据没有动,和新写入的数据,这两个情况是怎样的,测试方法如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[root@lab102 current]# rbd create testrbd --size 16M
[root@lab102 current]# rbd map testrbd
/dev/rbd0
[root@lab102 current]# dd if=/dev/urandom of=/dev/rbd/rbd/testrbd bs=4M count=2
2+0 records in
2+0 records out
8388608 bytes (8.4 MB) copied, 0.111296 s, 75.4 MB/s
rbd snap create --image testrbd --snap overwrite
rbd snap protect --image testrbd --snap overwrite
rbd clone --image testrbd --snap overwrite newrbd
[root@lab102 current]# ceph-dencoder import 0.txt type SnapSet decode dump_json
{
"snap_context": {
"seq": 0,
"snaps": []
},
"head_exists": 1,
"clones": []
}

[root@lab102 current]# ceph-dencoder import 1.txt type SnapSet decode dump_json
{
"snap_context": {
"seq": 0,
"snaps": []
},
"head_exists": 1,
"clones": []
}

上面的对象是快照原始数据未修改的数据的,我们新写入一个数据到原始镜像

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@lab102 current]# dd if=/dev/urandom of=/dev/rbd/rbd/testrbd bs=4M count=1 seek=2
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 0.06408 s, 65.5 MB/s
[root@lab102 current]# ceph-dencoder import 2.txt type SnapSet decode dump_json
{
"snap_context": {
"seq": 32,
"snaps": [
32
]
},
"head_exists": 1,
"clones": []
}

新写入的地方标记了个snap 32,而上面的snaps没有标记的是原始的数据,那么判断一个对象,没有快照数据,需要判断是原始数据,还是新数据的时候,snaps里面有数据的就是新写入的数据,没有snaps的标记的时候,就是老的数据,读取这个原始数据,可以理解为创建可快照32之后写入的这个对象,那么就是新对象,对于快照后克隆的rbd,就不要读取这个数据了

还有一种情况,做了多个快照,对象对于快照1来说是老数据,对于快照2来说是新数据,那么怎么判断

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
[root@lab102 current]# ceph-dencoder import 0.txt type SnapSet decode dump_json
{
"snap_context": {
"seq": 0,
"snaps": []
},
"head_exists": 1,
"clones": []
}

[root@lab102 current]# ceph-dencoder import 1.txt type SnapSet decode dump_json
{
"snap_context": {
"seq": 34,
"snaps": [
34
]
},
"head_exists": 1,
"clones": []
}

[root@lab102 current]# ceph-dencoder import 2.txt type SnapSet decode dump_json
{
"snap_context": {
"seq": 35,
"snaps": [
35,
34
]
},
"head_exists": 1,
"clones": []
}
[root@lab102 current]# rbd snap ls testrbd
SNAPID NAME SIZE
34 snap1 16384 kB
35 snap2 16384 kB

写入第一个对象,做快照snap1,然后写入第二个对象,做快照2,再写入第三个对象
那么第二个对象对于快照1来说是新对象,对于快照2是老对象
我们从上面的snaps里面可以看到的,如果这个对象里面没有包含快照的id,那么这个对象就是属于快照的原始数据,如果有快照的id,那么就是原始快照新写入的数据
比如上面的1.txt的信息,有快照34,那么这个对于快照34来说是新数据,不要读取,2.txt里面有34,35,那么对于快照34和35都是新的数据,都不要读取,0.txt没有记录,那么这个就是快照的相关的原始数据,是要去读取的

根据上面的流程以后,就能判断出一个img需要的是哪个对象的数据了,有了这些信息之后,创建一个空img,然后把对象数据塞进去就可以了,上面的这些操作对于一个大的集群来说,一个个操作肯定不现实,所以需要去用脚本来实现就会快很多,这个上面的原理都清楚了,再写脚本就很简单了,这个后面再写个脚本

总结

本篇文章是分析了快照以后的元数据如何从底层读取,一个对象如何判断是不是克隆后可以读取的对象,基于以上的操作,即使集群破坏的很厉害,只要底层的数据没有删除,还是有进行重构的可能的