Question

我正在用Python 3做NLP，并试图优化代码的速度。该代码使用给定的词典将单词列表转换为数字列表（或数组）。

例如，

mydict = {'hello': 0, 'world': 1, 'this': 2, 'is': 3, 'an': 4, 'example': 5}
word_list = ['hello', 'world']

def f(mydict, word_list):
    return [mydict[w] for w in word_list]

# f(mydict, word_list) == [1, 2]

我想加快函数f的速度，特别是在word_list大约100个单词长的情况下。可能吗？可以使用nltk，spacy，numpy等外部库。

当前，我的笔记本电脑需要花费6us。

>>> %timeit f(mydict, word_list*50)
6.74 us +- 2.77 us per loop (mean +- std. dev. of 7 runs, 100000 loops each)

Answer 1

有多个库可以处理将字符串/令牌列表转换为矢量表示形式。

例如，使用gensim：

>>> import gensim
>>> from gensim.corpora import Dictionary
>>> documents = [['hello', 'world'], ['NLP', 'is', 'awesome']]
>>> dict = Dictionary(documents)

# This is not necessary, but if you need to debug
# the word and attached indices, you can do:

>>> {idx:dict[idx]for idx in dict}
{0: 'hello', 1: 'world', 2: 'NLP', 3: 'awesome', 4: 'is'}

# To get the indices of the words per document, e.g.
>>> dict.doc2idx('hello world'.split())
[0, 1]
>>> dict.doc2idx('hello world is awesome'.split())
[0, 1, 4, 3]

在Python中快速搜索单词列表的字典

1 个答案: