我在s3上有超过500,000个对象。我正在尝试获取每个对象的大小。我为此使用了以下python代码
import boto3
bucket = 'bucket'
prefix = 'prefix'
contents = boto3.client('s3').list_objects_v2(Bucket=bucket, MaxKeys=1000, Prefix=prefix)["Contents"]
for c in contents:
print(c["Size"])
但这只是给了我前1000个对象的大小。根据文档,我们不能获得更多1000。有什么办法可以使我获得更多?
答案 0 :(得分:40)
内置的boto3 Paginator
类是克服list-objects-v2
的1000条记录限制的最简单方法。可以实现如下
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='bucket', Prefix='prefix')
for page in pages:
for obj in page['Contents']:
print(obj['Size'])
答案 1 :(得分:10)
如果您不需要使用select count(account_id), sum(if(temp2.old_balance is null,temp1.balance, temp2.old_balance))
from
(
select
pa.account_id, pa.balance, temp.acc_id as acc_id from account as pa force index (created_index)
left join
((select mod(debit,100000000000) as acc_id from transaction where created BETWEEN '2018-02-28 18:29:59' AND '2019-11-30 18:29:59')
union
(select mod(credit,100000000000) as acc_id from transaction where created BETWEEN '2018-02-28 18:29:59' AND '2019-11-30 18:29:59')
) as temp
on pa.account_id=temp.acc_id
where pa.type = '1' AND pa.created <= '2018-02-28 18:29:59'
having acc_id is null
)
as temp1
left join
(
select temp.acc_id,temp.txn_amt,b.balance,(b.balance-temp.txn_amt) as old_balance from
(
select mod(temp.acc_id,100000000000) as acc_id, sum(if(type=1,temp.amount,0-temp.amount)) as txn_amt from
(
select credit as acc_id,sum(amount) as amount, '1' as type from transaction where created > '2019-11-30 18:29:59' and status= "SUCCESSFUL" group by credit
UNION
select debit as acc_id, sum(amount) as amount, '0' as type from transaction where created > '2019-11-30 18:29:59' and status= "SUCCESSFUL" group by debit
) as temp group by temp.acc_id
) as temp join account as b on temp.acc_id=b.account_id where b.created <= '2018-02-28 18:29:59' and type='1'
) as temp2
on temp1.account_id=temp2.acc_id
,则可以使用boto3.client
来获取文件的完整列表:
boto3.resource
然后得到大小:
s3r = boto3.resource('s3')
bucket = s3r.Bucket('bucket_name')
files_in_bucket = list(bucket.objects.all())
根据存储桶的大小,这可能需要一分钟。
答案 2 :(得分:2)
使用响应中返回的ContinuationToken作为后续调用的参数,直到响应中返回的IsTruncated值为false。
这可以分解为一个整洁的生成器函数:
def get_all_s3_objects(s3, **base_kwargs):
continuation_token = None
while True:
list_kwargs = dict(MaxKeys=1000, **base_kwargs)
if continuation_token:
list_kwargs['ContinuationToken'] = continuation_token
response = s3.list_objects_v2(**list_kwargs)
yield from response.get('Contents', [])
if not response.get('IsTruncated'): # At the end of the list?
break
continuation_token = response.get('NextContinuationToken')
for file in get_all_s3_objects(boto3.client('s3'), Bucket=bucket, Prefix=prefix):
print(file['size'])