数据集字符串替换未通过线程加速

时间:2018-07-13 10:07:02

标签: python multithreading replace dataset threadpool

我最近进入大学项目的自然语言处理,并且给出了单词列表,我想尝试从字符串数据集中删除所有这些单词。 我的数据集看起来像这样,但是更大:

data_set = ['Human machine interface for lab abc computer applications',
         'A survey of user opinion of computer system response time',
         'The EPS user interface management system',
         'System and human system engineering testing of EPS',
         'Relation of user perceived response time to error measurement',
         'The generation of random binary unordered trees',
         'The intersection graph of paths in trees',
         'Graph minors IV Widths of trees and well quasi ordering',
         'Graph minors A survey']

要删除的单词列表如下所示,但又长得多:

to_remove = ['abc', 'of', 'quasi', 'well']

由于在Python中我没有找到任何直接从字符串中删除单词的函数,因此我使用了replace()函数。 程序应采用data_set,并且对于to_remove中的每个单词,应在data_set的不同字符串上调用replace()。我希望线程可以加快速度,但是不幸的是,它与没有线程的程序几乎需要相同的时间。我是否正确实现线程?还是我错过了什么?

带有线程的代码如下:

from multiprocessing.dummy import Pool as ThreadPool

def remove_words(params):
    changed_data_set = params[0]
    for elem in params[1]:
        changed_data_set = changed_data_set.replace(' ' + elem, ' ')
    return changed_data_set

def parallel_task(params, threads=2):
    pool = ThreadPool(threads)
    results = pool.map(remove_words, params)
    pool.close()
    pool.join()
    return results

parameters = []
for rows in data_set:
    parameters.append((rows, to_remove))
new_data_set = parallel_task(parameters, 8)

没有线程的代码如下:

def remove_words(data_set, to_replace):
    for len in range(len(data_set)):
        for word in to_replace:
            data_set[len] = data_set[len].replace(' ' + row, ' ')
    return data_set

changed_data_set = remove_words(data_set, to_remove)

0 个答案:

没有答案