背景 内部定位一个分段上传的异常,原因是一个参数的设置异常,定位过程中用到了几个脚本,这里记录下
定位过程 怀疑请求42min超时 经过多次上传,观察到文件在上传42分钟左右的时候会提示失败了,初步怀疑是后台是否有超时参数,上传超过42分钟会失败
慢速上传验证 这里通过一个s3cmd,上传一个相同大小的大文件,并且把时间控制到更长,超过42min
1 time s3cmd put --limit-rate=3m --multipart-chunk-size-mb=6 big12G s3://testput/big12G
测试发现没有问题,这个地方排除了
怀疑多网关的问题 因为对着多网关的发分片数据,怀疑是不是A网关发了,B网关请求的时候没有同步一些元数据,这里使用一个模拟程序来进行分片的多网关上传
多网关上传脚本
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 [root@lab103 data] import boto3 import random import os import time from botocore.client import Config ACCESS_KEY = 'test1' SECRET_KEY = 'test1' GATEWAYS = ['http://192.168.0.101:7480' , 'http://192.168.0.102:7480' ] BUCKET_NAME = 'testput' LOCAL_FILE_PATH = '12G' OBJECT_NAME = os.path.basename(LOCAL_FILE_PATH) PART_SIZE = 6 * 1024 * 1024 def create_s3_client(endpoint_url): return boto3.client( 's3' , aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY, endpoint_url=endpoint_url, config=Config(signature_version='s3v4' ), region_name='us-east-1' , ) def upload_multipart(): init_client = create_s3_client(GATEWAYS[0]) response = init_client.create_multipart_upload(Bucket=BUCKET_NAME, Key=OBJECT_NAME) upload_id = response['UploadId' ] print (f"Initiated multipart upload with ID: {upload_id}" ) parts = [] part_number = 1 try: with open(LOCAL_FILE_PATH, 'rb' ) as f: while True: data = f.read(PART_SIZE) if not data: break gateway = random.choice(GATEWAYS) client = create_s3_client(gateway) print (f"Uploading part {part_number} via {gateway}..." ) part = client.upload_part( Bucket=BUCKET_NAME, Key=OBJECT_NAME, PartNumber=part_number, UploadId=upload_id, Body=data ) parts.append({'ETag' : part['ETag' ], 'PartNumber' : part_number}) print (f"Uploaded part {part_number} via {gateway}, ETag: {part['ETag']}" ) part_number += 1 print ("Completing multipart upload..." ) init_client.complete_multipart_upload( Bucket=BUCKET_NAME, Key=OBJECT_NAME, UploadId=upload_id, MultipartUpload={'Parts' : parts} ) print ("Upload complete." ) except Exception as e: print (f"Error occurred: {e}" ) print ("Aborting upload..." ) init_client.abort_multipart_upload(Bucket=BUCKET_NAME, Key=OBJECT_NAME, UploadId=upload_id)if __name__ == "__main__" : upload_multipart()
这个地方经过验证,没有问题,排除了这个问题
查看日志出现
1 total parts mismatch: have: 167 expected: 1872
这个地方看到一个关键信息,分片需要1872,只有167
查看分片情况的脚本 我们查看分片
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 [root@fiel-03 ceph] import boto3 s3 = boto3.client( 's3' , endpoint_url='http://ip:7480' , aws_access_key_id='access' , aws_secret_access_key='key' ) bucket = 'mybucket' key = 'objectname' upload_id = 'myuploadid' part_number_marker = 0 is_truncated = Truewhile is_truncated: response = s3.list_parts( Bucket=bucket, Key=key, UploadId=upload_id, PartNumberMarker=part_number_marker ) for part in response['Parts' ]: print ("Part {} - Size: {} - ETag: {}" .format( part['PartNumber' ], part['Size' ], part['ETag' ])) is_truncated = response.get('IsTruncated' , False) if is_truncated: part_number_marker = response['NextPartNumberMarker' ]
用这个检查,发现确实是有问题,分片不见了,但是查看后台,分片的文件还在,说明是记录被清除了,之前查看了bucket的生命周期没有问题的
查看生命周期的脚本 查看生命周期
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import boto3 from botocore.exceptions import ClientError s3 = boto3.client( 's3' , endpoint_url='http://ip:7480' , aws_access_key_id='access' , aws_secret_access_key='key' ) bucket_name = 'bucket' try: response = s3.get_bucket_lifecycle_configuration(Bucket=bucket_name) print ("Lifecycle Rules:" ) for rule in response['Rules' ]: print (rule) except ClientError as e: if e.response['Error' ]['Code' ] == 'NoSuchLifecycleConfiguration' : print ("Bucket '{}' has no lifecycle configuration." .format(bucket_name)) else : print ("Error getting lifecycle configuration: {}" .format(e))
输出
1 {'Status' : 'Enabled' , 'AbortIncompleteMultipartUpload' : {'DaysAfterInitiation' : 7}, 'Prefix' : '' , 'ID' : 'abort multipart upload object clean' }
可以看到设置的是7天,没有问题
调试生命周期参数 有一个参数是可以缩短生命周期 默认值
1 "rgw_lc_debug_interval" : "-1" ,
有一个是
1 "rgw_lc_debug_interval" : "300" ,
这个意思是一天换算成300秒,那么我们设置的7天就是设置的2100秒,大概是35min,这个就是35分钟以上的,就开始执行回收,这样每次超过35分钟的,前面的肯定被回收,当前的上一个对象有概率被回收,也就出现404的情况了,这里把参数调整以后,再次验证上传没有问题了
调整调试参数 这个支持在线调整,下个上传即可生效
1 ceph daemon /var/run/ceph/ceph-client.rgw.asok config set rgw_lc_debug_interval -1
结论 通过调整参数后,环境上传分片恢复正常