tidb数据库的恢复操作

背景

tidb数据库是多副本的一个集群数据库，类似ceph，三节点出现两节点的时候也是无法选举，以及内部数据的leader不同，会出现无法访问的情况，本篇就是基于这个来进行恢复的实践

三节点坏两个，做恢复
备份了一个节点数据，三个节点都坏了

这两个场景基本一致的

备份集群数据

1
2
3

/root/.tiup/storage/cluster/clusters/
/data/tidb-deploy/
/data/tidb-data/

第一个是集群的拓扑信息
第二个数数据库的部署和启动相关的
第三个是数据目录
这三个都建议备份下

模拟pd损坏

模拟pd三节点坏两个

1	`tiup cluster display tidb-test`

查看集群id

1 2	`[root@lab101 data]# cat tidb-deploy/pd-2379/log/pd.log \|grep "init cluster id" [2025/04/09 10:16:00.265 +08:00] [INFO] [server.go:358] ["init cluster id"] [cluster-id=7491131464661495325]`

查看pd的leader的id

1 2	`[root@lab102 data]# cat tidb-deploy/pd-2379/log/pd.log \|grep "idAllocator" [2025/04/09 10:21:21.326 +08:00] [INFO] [id.go:122] ["idAllocator allocates a new id"] [alloc-id=2000]`

这个id每切换一次，leader的id增加1000 注意后面设置的id比最新最大的id还大就行

全部停止

1	`tiup cluster stop tidb-test -N 192.168.0.102:2379,192.168.0.102:2379,192.168.0.101:2379`

当前查询到的最大id为2000

假设坏了两个，这里把数据挪走

先缩减异常的节点

1	`tiup cluster scale-in tidb-test -N 192.168.0.102:2379,192.168.0.103:2379 --force`

方法一：纯新建pd

再删除当前正常的pd异常的目录

1	`rm -rf /data/tidb-data/pd-2379`

修改启动脚本，改3节点为单节点

1	`/data/tidb-deploy/pd-2379/scripts/run_pd.sh`

再启动

1	`tiup cluster start tidb-test -N 192.168.0.101:2379`

再修改id信息

1	`tiup pd-recover -endpoints http://192.168.0.101:2379 -cluster-id 7491131464661495325 -alloc-id 6000`

方法二: 基于老数据

这里有两种方法,上面的方法是需要日志查询id信息的，下面这种不用
使用老的数据目录
启动脚本/data/tidb-deploy/pd-2379/scripts/run_pd.sh增加，

1	`--force-new-cluster`

然后启动pd
然后恢复

1	`[root@lab101 data]# tiup pd-recover -from-old-member -endpoints http://192.168.0.101:2379`

提示restart

1	`tiup cluster restart tidb-test -N 192.168.0.101:2379`

tikv的恢复

先全部停止

1	`tiup cluster stop tidb-test -N 192.168.0.101:20160,192.168.0.102:20160,192.168.0.103:20160`

强制缩容

1	`tiup cluster scale-in tidb-test -N 192.168.0.102:20160,192.168.0.103:20160 --force`

启动单个

1	`tiup cluster start tidb-test -N 192.168.0.101:20160`

暂停pd调度

tiup ctl:v5.4.0 pd -u "http://192.168.0.101:2379" -i

» config set region-schedule-limit 0
Success!
» config set replica-schedule-limit 0
Success!
» config set merge-schedule-limit 0
Success!
» config set hot-region-schedule-limit 0
Success!

查询store id

tiup ctl:v5.4.0 pd -u "http://192.168.0.101:2379"  store

id7 101 
id1 103
id2 102

获取region的id

1	`tiup ctl:v5.4.0 pd -u "http://192.168.0.101:2379" region\|jq\|grep start_key -B 1\|grep -v start_key\|grep id\|cut -d ":" -f 2\|cut -d , -f 1 > region.id`

暂停下

1	`tiup cluster stop tidb-test -N 192.168.0.101:20160`

tikv修复工具
cp /root/.tiup/components/ctl/v5.4.0/tikv-ctl /sbin/

处理regions

1	for id in `cat region.id`;do echo $id;tikv-ctl --data-dir /data/tidb-data/tikv-20160/ unsafe-recover remove-fail-stores -s 1,2 -r $id;done;

全部处理好了以后启动

1	`tiup cluster start tidb-test -N 192.168.0.101:20160`

查看副本数

1 2	`[root@lab101 tidb-data]# pd-ctl -u http://192.168.0.101:2379 config show\|grep max-repl "max-replicas": 3,`

设置完成以后，就可以访问了也不报region的错误了 tikv的修复完成了

tidb修复

全停了

1	`tiup cluster stop tidb-test -N 192.168.0.101:4000,192.168.0.102:4000,192.168.0.103:4000`

强制缩容

1	`tiup cluster scale-in tidb-test -N 192.168.0.102:4000,192.168.0.103:4000 --force`

启动

1	`tiup cluster start tidb-test -N 192.168.0.101:4000`

这个单独启动就可以了不存在修复问题

总结

tidb整体上跟ceph的架构有点像，这种按顺序进行处理恢复即可，数据在就可以恢复

数据库

#tidb

tidb数据库的恢复操作

https://zphj1987.com/2025/04/09/tidb数据库的恢复操作/

作者

zphj1987

发布于

2025年4月9日

许可协议

rbd块设备的id修改上一篇

rgw的d3n功能配置下一篇

tidb数据库的恢复操作

背景

相关操作

备份集群数据

模拟pd损坏

方法一：纯新建pd

方法二: 基于老数据

tikv的恢复

tidb修复

总结