Python将字符串解析为文件浮点列表的有效方法

时间:2018-11-23 17:09:28

标签: python performance parsing

此文档有一个单词,每行有成千上万个浮点数,我想将其转换为以单词为键的字典,以及所有浮点数的向量。 这就是我的工作方式,但是由于文件的大小(每个文件大约20k行,每个文件具有大约10k的值),该过程花费了一些时间。我找不到更有效的解析方式。只是一些不能保证减少运行时间的替代方法。

with open("googlenews.word2vec.300d.txt") as g_file:
  i = 0;
  #dict of words: [lots of floats]
  google_words = {}

  for line in g_file:
    google_words[line.split()[0]] = [float(line.split()[i]) for i in range(1, len(line.split()))]

3 个答案:

答案 0 :(得分:5)

在您的解决方案中,每个单词慢line.split()进行两次。考虑以下修改:

with open("googlenews.word2vec.300d.txt") as g_file:
    i = 0;
    #dict of words: [lots of floats]
    google_words = {}

    for line in g_file:
        word, *numbers = line.split()
        google_words[word] = [float(number) for number in numbers]

我在这里使用的一个高级概念是“拆包”: word, *numbers = line.split()

Python允许将可迭代的值分解为多个变量:

a, b, c = [1, 2, 3]
# This is practically equivalent to
a = 1
b = 2
c = 3

*是“将剩菜剩饭放入list并将列表分配给名称”的快捷方式:

a, *rest = [1, 2, 3, 4]
# results in
a == 1
rest == [2, 3, 4]

答案 1 :(得分:3)

请不要多次拨打line.split()

with open("googlenews.word2vec.300d.txt") as g_file:
    i = 0;
    #dict of words: [lots of floats]
    google_words = {}

    for line in g_file:
        temp = line.split()
        google_words[temp[0]] = [float(temp[i]) for i in range(1, len(temp))]

这是此类文件的简单生成器:

s = "x"
for i in range (10000):
    s += " 1.2345"
print (s)

以前的版本需要一些时间。 仅有一个split调用的版本是即时的。

答案 2 :(得分:1)

您还可以使用csv模块,该模块应该比您正在做的事情更有效率。

那会是这样的:

import csv

d = {}
with (open("huge_file_so_huge.txt", "r")) as g_file:
    for row in csv.reader(g_file, delimiter=" "):
        d[row[0]] = list(map(float, row[1:]))