Question

我有一个 Pandas 数据框，其中包含名为 Potential Word、Fixed Word 的两列。 Potential Word 列包含不同语言的单词，其中包含拼写错误的单词和正确的单词，Fixed Word 列包含对应于 Potential Word 的正确单词。

下面我分享了一些样本数据

<头>

潜在词	固定字
示例	示例
pipol	人物
痘痘	痘痘
Iunik	独特

我的 vocab 数据框包含 600K 唯一行。

我的解决方案：

key = given_word
glob_match_value = 0
potential_fixed_word = ''
match_threshold = 0.65
for each in df['Potential Word']:
    match_value = match(each, key) # match is a function that returns a 
    # similarity value of two strings
    if match_value > glob_match_value and match_value > match_threshold:
        glob_match_value = match_value
        potential_fixed_word = each

问题

我的代码有问题，因为循环遍历大型词汇表，所以需要花费大量时间来修复每个单词。当词汇中缺少一个单词时，解决一个 10 ~ 12 个单词的句子需要将近 5 或 6 秒的时间。匹配函数表现不错，所以优化的目标。

我需要优化的解决方案在这里帮助我

Answer 1

从Information Retrieval (IR)的角度来看，你需要减少搜索空间。将 given_word（作为 key）与所有 Potential Word 匹配绝对是低效的。相反，您需要匹配合理数量的候选人。

要找到这样的候选词，您需要索引潜在词和固定词。

from whoosh.analysis import StandardAnalyzer
from whoosh.fields import Schema, TEXT
from whoosh.index import create_in

ix = create_in("indexdir", Schema(
    potential=TEXT(analyzer=StandardAnalyzer(stoplist=None), stored=True),
    fixed=TEXT(analyzer=StandardAnalyzer(stoplist=None), stored=True)
))
writer = ix.writer()
writer.add_document(potential='E x e m p l e', fixed='Example')
writer.add_document(potential='p i p o l', fixed='People')
writer.add_document(potential='p i m p l e', fixed='Pimple')
writer.add_document(potential='l u n i k', fixed='unique')
writer.commit()

通过这个索引，你可以搜索一些候选人。

from whoosh.qparser import SimpleParser

with ix.searcher() as searcher:
    results = searcher.search(SimpleParser('potential', ix.schema).parse('p i p o l'))
    for result in results[:2]:
        print(result)

输出是

<Hit {'fixed': 'People', 'potential': 'p i p o l'}>
<Hit {'fixed': 'Pimple', 'potential': 'p i m p l e'}>

现在，您可以match given_word 只针对少数候选人，而不是全部 600K。

它并不完美，但是，这是不可避免的权衡以及 IR 的基本工作原理。尝试使用不同数量的候选人。

Answer 2

不会对您的实现进行太多更改，因为我认为在某种程度上需要迭代每个单词的潜在单词列表。

这里我的目的不是优化匹配函数本身，而是利用多个线程并行搜索。

import concurrent.futures
import time
from concurrent.futures.thread import ThreadPoolExecutor
from typing import Any, Union, Iterator

import pandas as pd

# Replace your dataframe here for testing this

df = pd.DataFrame({'Potential Word': ["a", "b", "c"], "Fixed Word": ["a", "c", "b"]})

# Replace by your match function

def match(w1, w2):
    # Simulate some work is happening here
    time.sleep(1)
    return 1

# This is mostly your function itself
# Using index to recreate the sentence from the returned values
def matcher(idx, given_word):
    key = given_word
    glob_match_value = 0
    potential_fixed_word = ''
    match_threshold = 0.65
    for each in df['Potential Word']:
        match_value = match(each, key)  # match is a function that returns a
        # similarity value of two strings
        if match_value > glob_match_value and match_value > match_threshold:
            glob_match_value = match_value
            potential_fixed_word = each
            return idx, potential_fixed_word
        else:
            # Handling default case, you might want to change this
            return idx, ""


sentence = "match is a function that returns a similarity value of two strings match is a function that returns a " \
           "similarity value of two strings"

start = time.time()

# Using a threadpool executor 
# You can increase or decrease the max_workers based on your machine
executor: Union[ThreadPoolExecutor, Any]
with concurrent.futures.ThreadPoolExecutor(max_workers=24) as executor:
    futures: Iterator[Union[str, Any]] = executor.map(matcher, list(range(len(sentence.split()))), sentence.split())

# Joining back the input sentence
out_sentence = " ".join(x[1] for x in sorted(futures, key=lambda x: x[0]))
print(out_sentence)
print(time.time() - start)

请注意，此操作的运行时间取决于

单个匹配调用所用的最长时间
句子中的单词数
工作线程的数量（提示：试试看能不能和句子中的单词数量一样多）

Answer 3

我会使用 sortedcollections 模块。一般来说，访问 SortedList 或 SortedDict 的时间是 O(log(n)) 而不是 O(n)；在您的情况下，19.1946 if/then 检查与 600,000 if/then 检查。

from sortedcollections import SortedDict

从大型歌手中寻找最匹配的词

3 个答案: