Question

我想对大型文件进行10次交叉验证（每次运行数十万行）。我想在每次开始读取文件时执行“wc -l”，然后生成固定次数的随机数，每次都将该行号写入单独的文件中。我正在使用这个：

import os 
for i in files:
    os.system("wc -l <insert filename>").

如何在那里插入文件名。它是一个变量。我浏览了文档，但他们主要列出了ls命令，这些命令没有这个问题。

Answer 1

让我们进行比较：

from subprocess import check_output

def wc(filename):
    return int(check_output(["wc", "-l", filename]).split()[0])

def native(filename):
    c = 0
    with open(filename) as file:
        while True:
            chunk = file.read(10 ** 7)
            if chunk == "":
                return c
            c += chunk.count("\n")

def iterate(filename):
    with open(filename) as file:
        for i, line in enumerate(file):
            pass
        return i + 1

Go go timeit功能！

from timeit import timeit
from sys import argv

filename = argv[1]

def testwc():
    wc(filename)

def testnative():
    native(filename)

def testiterate():
    iterate(filename)

print "wc", timeit(testwc, number=10)
print "native", timeit(testnative, number=10)
print "iterate", timeit(testiterate, number=10)

结果：

wc 1.25185894966
native 2.47028398514
iterate 2.40715694427

因此，wc在150 MB压缩文件上的速度是大约两倍，其中包含大约500 000个换行符，这是我测试的。但是，测试使用seq 3000000 >bigfile生成的文件，我得到以下数字：

wc 0.425990104675
native 0.400163888931
iterate 3.10369205475

嘿，看，python FTW！但是，使用较长的线（约70个字符）：

wc 1.60881590843
native 3.24313092232
iterate 4.92839002609

所以结论：这取决于，但是wc似乎是最好的选择。

Answer 2

import subprocess
for f in files:
    subprocess.call(['wc', '-l', f])

另请查看http://docs.python.org/library/subprocess.html#convenience-functions - 例如，如果您想要访问字符串中的输出，则需要使用subprocess.check_output()代替subprocess.call()

Answer 3

无需使用wc -l使用以下python函数

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f, 1):
            pass
    return i

这可能比调用外部实用程序（以类似方式循环输入）更有效。

<强>更新

死错，wc -l 很多更快！

seq 10000000 > huge_file

$ time wc -l huge_file 
10000000 huge_file

real    0m0.267s
user    0m0.110s
sys 0m0.010s

$ time ./p.py 
10000000

real    0m1.583s
user    0m1.040s
sys 0m0.060s

Answer 4

os.system获取一个字符串。只需明确构建字符串：

import os 
for i in files:
    os.system("wc -l " + i)

Answer 5

我找到了解决此问题的Python方法：

cabal sandbox

Answer 6

我的解决方案非常类似于lazyr的“原生”功能：

import functools

def file_len2(fname):
    with open(fname, 'rb') as f:
        lines= 0
        reader= functools.partial(f.read, 131072)
        for datum in iter(reader, ''):
            lines+= datum.count('\n')
            last_wasnt_nl= datum[-1] != '\n'
        return lines + last_wasnt_nl

与wc不同，它将最后一行不以“\ n”结尾作为单独的行。如果想要与wc具有相同的功能，那么它可以（非常不自然地）写成：

import functools as ft, itertools as it, operator as op

def file_len3(fname):
    with open(fname, 'rb') as f:
        reader= ft.partial(f.read, 131072)
        counter= op.methodcaller('count', '\n')
        return sum(it.imap(counter, iter(reader, '')))

在我生成的所有测试文件中与wc的时间相当。

注意：这适用于Windows和POSIX计算机。旧MacOS使用'\ r'作为行尾字符。

Answer 7

我找到了一种更简单的方法：

import os
linux_shell='more /etc/hosts|wc -l'
linux_shell_result=os.popen(linux_shell).read()
print(linux_shell_result)

在Python代码</filename>中运行“wc -l <filename>”

7 个答案:

在Python代码</filename>中运行“wc -l <​​filename>”

7 个答案:

在Python代码</filename>中运行“wc -l <filename>”