Question

我正在尝试编写一个脚本，该脚本将遍历我的目录和子目录，并列出特定大小的文件数。例如0kb-1kb：3,1kb-4kb：4,4-16KB：4,16kb-64-kb：11并且以4的倍数继续。我能够获得文件编号列表，大小为人类可读格式并查找大小组中的文件数。但我觉得我的代码非常混乱，并且没有接近标准。需要帮助翻新代码

import os
suffixes = ['B', 'KB', 'MB', 'GB', 'TB', 'PB']
route = raw_input('Enter a location')


def human_Readable(nbytes):
        if nbytes == 0: return '0 B'
        i = 0
        while nbytes >= 1024 and i < len(suffixes)-1:
                nbytes /= 1024.
                i += 1
        f = ('%.2f' % nbytes).rstrip('0').rstrip('.')
        return '%s %s' % (f, suffixes[i])


def file_Dist(path, start,end):
        counter = 0
        counter2 = 0
        for path, subdir, files in os.walk(path):
                for r in files:
                        if os.path.getsize(os.path.join(path,r)) > start and os.path.getsize(os.path.join(path,r)) < end:
                                counter += 1
        #print "Number of files less than %s:" %(human_Readable(end)),  counter
        print "Number of files greater than %s less than %s:" %(human_Readable(start), human_Readable(end)),  counter
file_Dist(route, 0, 1024)
file_Dist(route,1024,4095)
file_Dist(route, 4096, 16383)
file_Dist(route, 16384, 65535)
file_Dist(route, 65536, 262143)
file_Dist(route, 262144, 1048576)
file_Dist(route, 1048577, 4194304)
file_Dist(route, 4194305, 16777216)

Answer 1

以下是一些需要改进的建议。

将信息作为命令行参数提供而不是提示它通常更有用。
在一次遍历目录树中计算所有文件的效率比为多个大小的组重复遍历树更有效。
由于大小限制形成了常规序列，因此可以计算它们，无需单独记录。
您的程序不会计算大小等于组限制的文件;虽然它通过说大于和小于来正确地说明了这一点，但我发现不省略这些文件会更有用。
os.path.getsize()因符号链接断开而失败;我使用os.lstat().st_size，它会产生正确的链接文件的树内大小。

这是该计划的一个版本，并实施了上述建议。请注意，它仍然会忽略大小为16 MiB的文件 - 这也可以改进。

#!/usr/bin/env python
import math
import os
import sys
route = sys.argv[1]

suffixes = ['B', 'KB', 'MB', 'GB', 'TB', 'PB']
def human_Readable(nbytes):
        if nbytes == 0: return '0 B'
        i = 0
        while nbytes >= 1024 and i < len(suffixes)-1:
                nbytes /= 1024.
                i += 1
        f = ('%.2f' % nbytes).rstrip('0').rstrip('.')
        return '%s %s' % (f, suffixes[i])

counter = [0]*8             # count files with size up to 4**(8-1) KB
for path, subdir, files in os.walk(route):
    for r in files:
        size = os.lstat(os.path.join(path, r)).st_size
        group = (math.frexp(size/1024)[1]+1)/2
        if group < len(counter):
            counter[group] += 1
start = 0
for g in range(len(counter)):
    end = 1024*4**g
    print "Number of files at least %s less than %s:" \
          %(human_Readable(start), human_Readable(end)), counter[g]
    start = end

我认为产生与group = (math.frexp(size/1024)[1]+1)/2对应的计数器列表元素的索引的行size需要一些解释。考虑

>>> sizes = [0]+[1024*4**i for i in range(8)]
>>> sizes
[0, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216]
>>> [math.frexp(s/1024) for s in sizes]
[(0.0, 0), (0.5, 1), (0.5, 3), (0.5, 5), (0.5, 7), (0.5, 9), (0.5, 11), (0.5, 13), (0.5, 15)]
>>> [math.frexp(2*s/1024) for s in sizes]
[(0.0, 0), (0.5, 2), (0.5, 4), (0.5, 6), (0.5, 8), (0.5, 10), (0.5, 12), (0.5, 14), (0.5, 16)]
>>> [math.frexp(3*s/1024) for s in sizes]
[(0.0, 0), (0.75, 2), (0.75, 4), (0.75, 6), (0.75, 8), (0.75, 10), (0.75, 12), (0.75, 14), (0.75, 16)]

我们通过选择以KB为单位的浮动表示的基数2指数并将其调整一点（+1来得到图片，因为尾数位于[0.5, 1[而不是{{1}我们可以计算出正确的计数器列表索引，并且[1, 2[从基数2转换为基数4。

具有特定文件大小范围的Python文件数

1 个答案: