Question

我必须逐行读取一个文件，该文件的索引为向量的1的

所以例如： 1 3 9 10

表示： 0,1,0,1,0,0,0,0,0,1,1

我的目标是编写将占用每一行的程序，并用0打印出完整的向量。

我可以用我目前的程序做几行：

#create a sparse vector
list_line_sparse = [0] * int(num_features)

#loop over all the lines
for item in lines:
    #split the line on spaces
    zz = item.split(' ')
    #get all ints on a line
    d = [int(x.strip()) for x in zz]
    #loop over all ints and change index to 1 in sparse vector
    for i in d:
        list_line_sparse[i]=1

    out_file += (', '.join(str(item) for item in list_line_sparse))
    #change back to 0's
    for i in d:
        list_line_sparse[i]=0
    out_file +='\n'


f = open('outfile', 'w')
f.write(out_file)
f.close()

问题是对于具有大量功能和行的文件，我的程序非常低效 - 它基本上永远不会完成。是否有任何突出的东西我应该改变以使其更有效率？（即2 for for循环）

Answer 1

在生成输出文件时将每行数据写入输出文件可能更有效，而不是在内存中构建一个巨大的字符串。

numpy是一个流行的Python模块，适用于对数字进行批量操作。如果您从：

开头

import numpy as np
list_line_sparse = np.zeros(num_features, dtype=np.uint8)

然后，将d作为当前行上的数字列表，您可以执行以下操作：

list_line_sparse[d] = 1

同时在数组中设置所有这些索引，不需要循环。（至少在Python级别，显然仍然存在循环，但它在numpy的C实现中有所下降。）

Answer 2

它正在减速，因为你正在进行字符串连接。最好使用列表。

此外，您可以使用csv读取空格分隔的行，然后用自动添加的逗号写下每一行：

import csv

num_features = 20

with open('input.txt', 'r', newline='') as f_input, open('output.txt', 'w', newline='') as f_output:    
    csv_input = csv.reader(f_input, delimiter=' ')
    csv_output = csv.writer(f_output)

    for row in csv_input:
        list_line_sparse = [0] * int(num_features)

        for v in map(int, row):
            list_line_sparse[v] = 1

        csv_output.writerow(list_line_sparse)

因此，如果input.txt包含以下内容：

1 3 9 10
1 3 9 11
2 7 3 5

给你一个output.txt包含：

0,1,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0
0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0
0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0

Answer 3

太多循环：首先是item.split()，然后是for x in zz，然后是for i in d，然后是for item in list_line_sparse，然后是for i in d。字符串连接可能是您最昂贵的部分：.join和output +=。所有这一切都适用于每一行。

您可以尝试“逐字符”解析和书写。像这样：

#features per line
count = int(num_features)
f = open('outfile.txt', 'w')

#loop over all lines
for item in lines:
    #reset the feature
    i = 0
    #the characters buffer
    index = ""

    #parse character by character
    for character in item:
        #if a space or end of line is found,
        #and the characters buffer (index) is not empty
        if character in (" ", "\r", "\n"):
            if index:
                #parse the characters buffer
                index = int(index)
                #if is not the first feature
                if i > 0:
                    #add the separator
                    f.write(", ")
                #add 0's until index
                while i < index:
                    f.write("0, ")
                    i += 1
                #and write 1
                f.write("1")
                i += 1
                #reset the characters buffer
                index = ""
        #if is not a space or end on line
        else:
            #add the character to the buffer
            index += character

    #if the last line didn't end with a carriage return,
    #index could be waiting to be parsed
    if index:
        index = int(index)
        if i > 0:
            f.write(", ")
        while i < index:
            f.write("0, ")
            i += 1
        f.write("1")
        i += 1
        index = ""

    #fill with 0's
    while i < count:
        if i == 0:
            f.write("0")
        else:
            f.write(", 0")
        i += 1

    f.write("\n")

f.close()

Answer 4

让我们将您的代码重新编写为更简单的包，以便更好地利用Python的功能：

import sys

NUM_FEATURES = 12

with open(sys.argv[1]) as source, open(sys.argv[2], 'w') as sink:
    for line in source:
        list_line_sparse = [0] * NUM_FEATURES

        indicies = map(int, line.rstrip().split())

        for index in indicies:
            list_line_sparse[index] = 1

        print(*list_line_sparse, file=sink, sep=',')

我用“更有效率”重新审视了这个问题。虽然上面的内存效率更高，但是时间更慢。我重新考虑了你的原创，并提出了一个内存效率较低但比你的代码快2倍的解决方案：

import sys

NUM_FEATURES = 12

data = ''

with open(sys.argv[1]) as source:
    for line in source:
        list_line_sparse = ["0"] * NUM_FEATURES

        indicies = map(int, line.rstrip().split())

        for index in indicies:
            list_line_sparse[index] = "1"

        data += ",".join(list_line_sparse) + '\n'

with open(sys.argv[2], 'w') as sink:
    sink.write(data)

与原始解决方案一样，它将所有数据存储在内存中并在最后写出来，这既是一个缺点（内存方面）又是一个优势（时间方面）。

<强> input.txt中

1 3 9 10
1 3 9 11
2 7 3 5

<强> USAGE

% python3 test.py input.txt output.txt

<强> output.txt的

0,1,0,1,0,0,0,0,0,1,1,0
0,1,0,1,0,0,0,0,0,1,0,1
0,0,1,1,0,1,0,1,0,0,0,0

可以更有效地执行此操作（将压缩文件转为稀疏文件）

4 个答案: