一个编译参数引起的ceph性能大幅下降

背景

最近翻ceph的官方的博客,发现了一篇博客提到了一个ubuntu下面的编译参数引起的rocksdb的性能下降,这个地方应该是ceph官方代码的参数没生效

受影响的系统

  • P版本之前的ceph版本
  • 操作系统是ubuntu的
  • 某些ceph版本

这个要素比较多,所以运行的环境并不一定受到影响,那么我们看下,收到影响的版本是哪些,非ubuntu的系统可以忽略这个问题

我对15的版本比较熟,就以这个版本举例子

受到影响的版本

这个版本是从ceph官方同步过来的版本

1
https://mirrors.tuna.tsinghua.edu.cn/ceph/debian-15.2.17/pool/main/c/ceph/

不受影响的版本

1
https://launchpad.net/ubuntu/focal/+source/ceph

这个是ubuntu官方打包的版本

下载不同的版本

源文件

1
2
3
4
5
6
7
8
9
10
11
12
root@lab103:~# cat /etc/apt/sources.list
deb http://archive.ubuntu.com/ubuntu/ focal main restricted
deb http://archive.ubuntu.com/ubuntu/ focal-updates main restricted
deb http://archive.ubuntu.com/ubuntu/ focal universe
deb http://archive.ubuntu.com/ubuntu/ focal-updates universe
deb http://archive.ubuntu.com/ubuntu/ focal multiverse
deb http://archive.ubuntu.com/ubuntu/ focal-updates multiverse
deb http://archive.ubuntu.com/ubuntu/ focal-backports main restricted universe multiverse
deb http://security.ubuntu.com/ubuntu/ focal-security main restricted
deb http://security.ubuntu.com/ubuntu/ focal-security universe
deb http://security.ubuntu.com/ubuntu/ focal-security multiverse
deb [trusted=yes] https://mirrors.tuna.tsinghua.edu.cn/ceph/debian-15.2.17 focal main

上面的源是安装ceph的软件包的,如果屏蔽掉最后一行,安装的就是ubuntu官方的版本,如果留着最后一行,安装的就是ceph官方的版本,包的名称不一样,这个很好区分

两个版本的区别

我们下载ubuntu官方的debian的打包文件

1
wget https://launchpad.net/ubuntu/+archive/primary/+sourcefiles/ceph/15.2.17-0ubuntu0.20.04.6/ceph_15.2.17-0ubuntu0.20.04.6.debian.tar.xz

ubuntu的官方的包是把debian的目录剥离出来的

我们再下载ceph官方的源码包,这个跟git里面是同步的我们看下内容

1
wget https://mirrors.tuna.tsinghua.edu.cn/ceph/debian-15.2.17/pool/main/c/ceph/ceph_15.2.17.orig.tar.gz

ceph的官方包里面是有debian目录的,我们直接查看内容

我们需要比对的是debian/rules里面的内容

ubuntu官方的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
## Set JAVAC to prevent FTBFS due to incorrect use of 'gcj' if found (see "m4/ac_prog_javac.m4").
export JAVAC=javac

extraopts += -DWITH_OCF=ON -DWITH_NSS=ON -DWITH_PYTHON3=ON -DWITH_DEBUG=ON
extraopts += -DWITH_PYTHON2=OFF -DMGR_PYTHON_VERSION=3
extraopts += -DWITH_CEPHFS_JAVA=ON
extraopts += -DWITH_CEPHFS_SHELL=ON
extraopts += -DWITH_TESTS=OFF
extraopts += -DWITH_SYSTEM_BOOST=ON
extraopts += -DWITH_LTTNG=OFF -DWITH_EMBEDDED=OFF
extraopts += -DCMAKE_INSTALL_LIBEXECDIR=/usr/lib
extraopts += -DWITH_MGR_DASHBOARD_FRONTEND=OFF
extraopts += -DWITH_SYSTEMD=ON -DCEPH_SYSTEMD_ENV_DIR=/etc/default
extraopts += -DCMAKE_INSTALL_SYSCONFDIR=/etc
extraopts += -DCMAKE_INSTALL_SYSTEMD_SERVICEDIR=/lib/systemd/system
extraopts += -DWITH_RADOSGW_KAFKA_ENDPOINT=OFF
extraopts += -DCMAKE_BUILD_TYPE=RelWithDebInfo

ifneq (,$(filter parallel=%,$(DEB_BUILD_OPTIONS)))
NUMJOBS = $(patsubst parallel=%,%,$(filter parallel=%,$(DEB_BUILD_OPTIONS)))
extraopts += -DBOOST_J=$(NUMJOBS)
endif

ceph官方的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
extraopts += -DWITH_OCF=ON -DWITH_LTTNG=ON
extraopts += -DWITH_MGR_DASHBOARD_FRONTEND=OFF
extraopts += -DWITH_PYTHON3=3
extraopts += -DWITH_CEPHFS_JAVA=ON
extraopts += -DWITH_CEPHFS_SHELL=ON
extraopts += -DWITH_SYSTEMD=ON -DCEPH_SYSTEMD_ENV_DIR=/etc/default
extraopts += -DWITH_GRAFANA=ON
# assumes that ceph is exmpt from multiarch support, so we override the libdir.
extraopts += -DCMAKE_INSTALL_LIBDIR=/usr/lib
extraopts += -DCMAKE_INSTALL_LIBEXECDIR=/usr/lib
extraopts += -DCMAKE_INSTALL_SYSCONFDIR=/etc
extraopts += -DCMAKE_INSTALL_SYSTEMD_SERVICEDIR=/lib/systemd/system
ifneq (,$(filter parallel=%,$(DEB_BUILD_OPTIONS)))
NUMJOBS = $(patsubst parallel=%,%,$(filter parallel=%,$(DEB_BUILD_OPTIONS)))
extraopts += -DBOOST_J=$(NUMJOBS)
endif

区别就是这个

1
extraopts += -DCMAKE_BUILD_TYPE=RelWithDebInfo

RelWithDebInfo: 既优化又能调试的版本

这个参数带来的效果是

1
-DCMAKE_CXX_FLAGS='-Wno-deprecated-copy -Wno-pessimizing-move'"

会带来这两个参数,以及一些其它的关闭,属于生产包需要加这个参数好一点

ceph官方是放在自己的cmake里面控制,但是deb打包的时候,有自己的这个参数,变量就被冲掉了,也就是没生效

1
https://github.com/ceph/ceph/pull/55500

官方现在改了,应该是解决了,但是ubuntu官方里面是直接在最上层就用参数去控制了,也就没有这个问题
修改cmake/modules/BuildRocksDB.cmake

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
endif()
include(CheckCXXCompilerFlag)
check_cxx_compiler_flag("-Wno-deprecated-copy" HAS_WARNING_DEPRECATED_COPY)
set(rocksdb_CXX_FLAGS "${CMAKE_CXX_FLAGS}")
if(HAS_WARNING_DEPRECATED_COPY)
- set(rocksdb_CXX_FLAGS -Wno-deprecated-copy)
+ string(APPEND rocksdb_CXX_FLAGS " -Wno-deprecated-copy")
endif()
check_cxx_compiler_flag("-Wno-pessimizing-move" HAS_WARNING_PESSIMIZING_MOVE)
if(HAS_WARNING_PESSIMIZING_MOVE)
- set(rocksdb_CXX_FLAGS "${rocksdb_CXX_FLAGS} -Wno-pessimizing-move")
+ string(APPEND rocksdb_CXX_FLAGS " -Wno-pessimizing-move")
endif()
if(rocksdb_CXX_FLAGS)
list(APPEND rocksdb_CMAKE_ARGS -DCMAKE_CXX_FLAGS='${rocksdb_CXX_FLAGS}')

打包过程可以看到这两个参数加进去了没

1
2
3
[  8%] Performing configure step for 'rocksdb_ext'
cd /ceph/ceph-15.2.17/obj-x86_64-linux-gnu/src/rocksdb && /usr/bin/cmake -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DWITH_GFLAGS=OFF -DCMAKE_PREFIX_PATH= -DCMAKE_CXX_COMPILER=/usr/bin/c++ -DWITH_SNAPPY=TRUE -DWITH_LZ4=TRUE -DLZ4_INCLUDE_DIR=/usr/include -DLZ4_LIBRARIES=/usr/lib/x86_64-linux-gnu/liblz4.so -DWITH_ZLIB=TRUE -DPORTABLE=ON -DCMAKE_AR=/usr/bin/ar -DCMAKE_BUILD_TYPE=RelWithDebInfo -DFAIL_ON_WARNINGS=OFF -DUSE_RTTI=1 "-GUnix Makefiles" -DCMAKE_C_FLAGS=-Wno-stringop-truncation "-DCMAKE_CXX_FLAGS=' -Wno-deprecated-copy -Wno-pessimizing-move'" "-GUnix Makefiles" /ceph/ceph-15.2.17/src/rocksdb
-- Build files have been written to: /ceph/ceph-15.2.17/obj-x86_64-linux-gnu/src/rocksdb

那么我们整体捋一捋

1
2
- ceph在代码里面加了参数,参数被打包过程冲掉了,引起了性能下降
- ubuntu在打包ceph里面加另外一个参数,让这两个参数生效了,所以打出来的包没有问题

大概就是这么回事,差不多就是发行版本出的包,没有按优化的版本打包,问题很小,影响还比较大,如果正好使用的就是这个ubuntu的官方包的话

性能测试

上面是说了这个问题的来源,我们来体验一下这个性能的区别

为了测试的准确性,构建了一个单机,单副本的环境,单个nvme的osd,在宿主机创建好osd之后,停止osd,然后把osd映射到docker里面进行手动启动

这样做的目的是,osd不变,减少变量,容器系统一致,只替换了ceph的包

容器启动方式

1
docker run -it --privileged=true -v /dev/:/dev/ -v /var/lib/ceph/:/var/lib/ceph -v /etc/ceph:/etc/ceph --network host ubuntu:focal /bin/bash --name ceph_deb

ceph官方包

1
2
root@lab103:/# dpkg --list|grep ceph
ii ceph 15.2.17-1focal amd64 distributed storage and file system

手动启动osd

1
/usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph

测试

1
rados -p data bench 30 write -b 4096

也可以用其它命令,小io4k的比较明显,时间不久,就都贴上来

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
[root@lab103 zp]# rados -p data bench 30 write -b 4096
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 30 seconds or 0 objects
Object prefix: benchmark_data_lab103_235759
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 6811 6795 26.5416 26.543 0.00314961 0.00234824
2 15 13061 13046 25.4781 24.418 0.00280007 0.00244734
3 16 18804 18788 24.4609 22.4297 0.00179994 0.00255073
4 16 24430 24414 23.839 21.9766 0.0027436 0.00261775
5 16 29900 29884 23.3441 21.3672 0.00184661 0.00267347
6 16 35338 35322 22.9932 21.2422 0.00376749 0.00271445
7 16 40864 40848 22.7917 21.5859 0.0023982 0.00273855
8 16 46654 46638 22.7695 22.6172 0.00265023 0.0027412
9 15 52491 52476 22.7731 22.8047 0.00237885 0.00274086
10 16 58334 58318 22.7775 22.8203 0.0032622 0.00274042
11 16 64011 63995 22.7224 22.1758 0.00224359 0.00274685
12 16 69693 69677 22.6782 22.1953 0.00282742 0.00275244
13 16 75483 75467 22.6733 22.6172 0.0024294 0.00275293
14 16 81331 81315 22.6852 22.8438 0.00243682 0.00275157
15 16 87110 87094 22.6776 22.5742 0.00249853 0.00275236
16 16 92892 92876 22.6717 22.5859 0.00234551 0.00275325
17 16 98682 98666 22.6683 22.6172 0.00279748 0.00275354
18 16 104350 104334 22.6388 22.1406 0.00247821 0.00275716
19 16 110108 110092 22.6309 22.4922 0.00265833 0.00275813
2024-03-21T18:09:10.318468+0800 min lat: 0.000901059 max lat: 0.0108369 avg lat: 0.00275815
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
20 16 115907 115891 22.6318 22.6523 0.00308136 0.00275815
21 16 121632 121616 22.6188 22.3633 0.00107606 0.00275984
22 16 127334 127318 22.6029 22.2734 0.00307552 0.00276152
23 16 133012 132996 22.5844 22.1797 0.00268477 0.00276388
24 16 138652 138636 22.5612 22.0312 0.00284827 0.00276685
25 16 144273 144257 22.5369 21.957 0.0022021 0.00276986
26 16 149771 149755 22.496 21.4766 0.00333811 0.00277466
27 16 155250 155234 22.4553 21.4023 0.00249963 0.00277992
28 16 160745 160729 22.4199 21.4648 0.00245874 0.00278427
29 16 166247 166231 22.3878 21.4922 0.00295188 0.00278836
Total time run: 30.0023
Total writes made: 171762
Write size: 4096
Object size: 4096
Bandwidth (MB/sec): 22.3631
Stddev Bandwidth: 1.02477
Max bandwidth (MB/sec): 26.543
Min bandwidth (MB/sec): 21.2422
Average IOPS: 5724
Stddev IOPS: 262.341
Max IOPS: 6795
Min IOPS: 5438
Average Latency(s): 0.00279145
Stddev Latency(s): 0.000623496
Max latency(s): 0.0108369
Min latency(s): 0.000901059
Cleaning up (deleting benchmark objects)
Removed 171762 objects
Clean up completed and total clean up time :30.6685

ubuntu官方包

1
2
3
4
root@lab103:/# dpkg --list|grep ceph
root@lab103:/# dpkg --list|grep ceph
ii ceph 15.2.17-0ubuntu0.20.04.6 amd64 distributed storage and file system
ii ceph-base 15.2.17-0ubuntu0.20.04.6 amd64 common ceph daemon libraries and management tools

手动启动osd

1
/usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph

测试

1
rados -p data bench 30 write -b 4096
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
[root@lab103 zp]# rados -p data bench 30 write -b 4096
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 30 seconds or 0 objects
Object prefix: benchmark_data_lab103_236362
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 15303 15287 59.7107 59.7148 0.00106137 0.00104294
2 16 30338 30322 59.2169 58.7305 0.000799582 0.00105216
3 16 46691 46675 60.7683 63.8789 0.00116935 0.00102532
4 16 62548 62532 61.0597 61.9414 0.000970205 0.00102051
5 16 78056 78040 60.9618 60.5781 0.000754982 0.00102213
6 16 93210 93194 60.6659 59.1953 0.000796767 0.00102717
7 16 108491 108475 60.5256 59.6914 0.000820318 0.0010296
8 15 123545 123530 60.31 58.8086 0.00117671 0.00103331
9 15 139028 139013 60.3281 60.4805 0.0010041 0.00103303
10 16 154628 154612 60.3878 60.9336 0.00106916 0.00103203
11 16 169869 169853 60.3095 59.5352 0.00100121 0.00103337
12 16 185143 185127 60.255 59.6641 0.000874392 0.00103434
13 16 200261 200245 60.162 59.0547 0.000937596 0.00103596
14 16 212138 212122 59.1782 46.3945 0.000740705 0.00105322
15 16 227092 227076 59.1267 58.4141 0.000774189 0.00105417
16 16 241651 241635 58.9853 56.8711 0.00140325 0.00105664
17 16 257762 257746 59.217 62.9336 0.000777715 0.00105251
18 15 273338 273323 59.3072 60.8477 0.000842773 0.00105092
19 16 288882 288866 59.3808 60.7148 0.000944547 0.00104963
2024-03-21T18:13:37.381819+0800 min lat: 0.000545632 max lat: 0.037882 avg lat: 0.00104838
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
20 16 304446 304430 59.4512 60.7969 0.000774379 0.00104838
21 16 320053 320037 59.5228 60.9648 0.000891544 0.00104713
22 16 335395 335379 59.5409 59.9297 0.00105081 0.00104681
23 16 351159 351143 59.6291 61.5781 0.000691308 0.00104527
24 15 365701 365686 59.5112 56.8086 0.000910524 0.00104736
25 16 381282 381266 59.5649 60.8594 0.000934152 0.00104642
26 16 396656 396640 59.5834 60.0547 0.000904033 0.00104609
27 15 411989 411974 59.5947 59.8984 0.00273626 0.00104583
28 16 424394 424378 59.1965 48.4531 0.000757671 0.00105296
29 15 435855 435840 58.6989 44.7734 0.00237303 0.00106191
Total time run: 30.0008
Total writes made: 449321
Write size: 4096
Object size: 4096
Bandwidth (MB/sec): 58.5038
Stddev Bandwidth: 4.48913
Max bandwidth (MB/sec): 63.8789
Min bandwidth (MB/sec): 44.7734
Average IOPS: 14976
Stddev IOPS: 1149.22
Max IOPS: 16353
Min IOPS: 11462
Average Latency(s): 0.00106545
Stddev Latency(s): 0.000482816
Max latency(s): 0.0431733
Min latency(s): 0.00047586
Cleaning up (deleting benchmark objects)
Removed 449321 objects
Clean up completed and total clean up time :22.541

一个iops 14976(58MB/s) - ubuntu的包
一个iops是5724(22MB/s) - ceph的包

可以看到差距还是很明显的

这个如果正好碰到了,可以按ubuntu这个方式改下debian的rule,或者把ceph的pr弄进去重新打包即可,具体验证有没有问题,可以通过检查下打包的参数或者直接单机验证下性能也可以