Question

我有两个看起来像这样的名称（字符串）列表：

executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']

analysts = ['Justin Post', 'Some Dude', 'Some Chick']

我需要找到那些名字出现在一个字符串列表，看起来像这样：

str = ['Justin Post - Bank of America',
 "Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.", 
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
 'Brian Olsavsky - Amazon.com',
 "Thank you, Justin. Yeah, let me just remind you a couple of things from last year.", 
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
 "I'll just remind you that the units  those do not count",
 "In-stock is very strong, especially as we head into the holiday period.",
 'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores.

我之所以需要这样做是为了使我可以连接会话串在一起（由名称隔开）。我怎么会去有效地这样做呢？

我看着一些类似的问题，并试图解决方案没有用，如这样的：

if any(x in str for x in executives):
    print('yes')

还有这个...

match = next((x for x in executives if x in str), False)
match

Answer 1

我不确定这是否是您要寻找的东西

executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']
text = ['Justin Post - Bank of America',
 "Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.", 
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
 'Brian Olsavsky - Amazon.com',
 "Thank you, Justin. Yeah, let me just remind you a couple of things from last year.", 
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
 "I'll just remind you that the units  those do not count",
 "In-stock is very strong, especially as we head into the holiday period.",
 'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores."]

result = [s for s in text if any(ex in s for ex in executives)]
print(result)

输出： ['Brian Olsavsky-Amazon.com']

Answer 2

"\system\etc\security\cacerts"

此外，如果您需要确切的位置，则可以使用以下位置：

str = ['Justin Post - Bank of America',
 "Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.", 
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
 'Brian Olsavsky - Amazon.com',
 "Thank you, Justin. Yeah, let me just remind you a couple of things from last year.", 
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
 "I'll just remind you that the units  those do not count",
 "In-stock is very strong, especially as we head into the holiday period.",
 'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores"]

executives = ['Brian Olsavsky', 'Justin', 'Some Guy', 'Some Lady']

此输出

print([[i, str.index(q), q.index(i)] for i in executives for q in str if i in q ])

Answer 3

TLDR

此答案的重点是效率。如果不是关键问题，请使用其他答案。如果是这样，请从您要搜索的语料库中创建一个dict，然后使用此字典来查找您要寻找的内容。

#import stuff we need later

import string
import random
import numpy as np
import time
import matplotlib.pyplot as plt

创建示例语料库

首先，我们创建一个要搜索的字符串列表。

使用以下功能创建随机的单词，我的意思是随机字符序列，其长度从Poisson distribution中得出，

def poissonlength_words(lam_word): #generating words, length chosen from a Poisson distrib
    return ''.join([random.choice(string.ascii_lowercase) for _ in range(np.random.poisson(lam_word))])

（{lam_word是Poisson distribution的参数。）

让我们从这些单词创建number_of_sentences变长句子（通过句子我的意思是随机生成的的列表单词（用空格分隔）。

句子的长度也可以从Poisson distribution中得出。

lam_word=5
lam_sentence=1000
number_of_sentences = 10000

sentences = [' '.join([poissonlength_words(lam_word) for _ in range(np.random.poisson(lam_sentence))])
             for x in range(number_of_sentences)]

sentences[0]现在将像这样开始：

tptt lxnwf iem fedg wbfdq qaa aqrys szwx zkmukc ...

让我们也创建名称，我们将搜索这些名称。让这些名称为bigrams。名（即bigram的第一个元素）将是n个字符，姓氏（第二个bigram元素）将是m个字符长，它将包含随机字符：

def bigramgen(n,m):
    return ''.join([random.choice(string.ascii_lowercase) for _ in range(n)])+' '+\
           ''.join([random.choice(string.ascii_lowercase) for _ in range(m)])

任务

假设我们要查找出现 bigrams （例如ab c）的句子。我们不想找到dab c或ab cd，仅找到ab c独立的地方。

要测试一种方法有多快，让我们找到数量不断增加的双字母组，并测量经过的时间。我们搜索的二元组的数量可以是，例如：

number_of_bigrams_we_search_for = [10,30,50,100,300,500,1000,3000,5000,10000]

蛮力法

只需遍历每个双字母组，遍历每个句子，然后使用in查找匹配项。同时，measure elapsed time和time.time()。

bruteforcetime=[]
for number_of_bigrams in number_of_bigrams_we_search_for:
    bigrams = [bigramgen(2,1) for _ in range(number_of_bigrams)]
    start = time.time()
    for bigram in bigrams:
        #the core of the brute force method starts here
        reslist=[]
        for sentencei, sentence in enumerate(sentences):
            if ' '+bigram+' ' in sentence:
                reslist.append([bigram,sentencei])
        #and ends here
    end = time.time()
    bruteforcetime.append(end-start)

bruteforcetime将保留查找10、30、50 ...二元组所需的秒数。

警告：对于大量的双连词，这可能需要很长时间。

对您的内容进行排序以使其更快方法

让我们为出现在任何句子中的每个单词创建一个空集（使用dict comprehension）：

worddict={word:set() for sentence in sentences for word in sentence.split(' ')}

对于每个集合，在出现的每个单词中添加index：

for sentencei, sentence in enumerate(sentences):
    for wordi, word in enumerate(sentence.split(' ')):
        worddict[word].add(sentencei)

请注意，无论以后搜索多少个双字母组，我们只会执行一次。

使用这本字典，我们可以寻找双字的每个部分出现的句子。因为调用了dict value is very fast，所以速度非常快。然后我们take the intersection of these sets。当我们搜索ab c时，将有一组句子索引，其中ab和c都出现。

for bigram in bigrams:
    reslist=[]
    setlist = [worddict[gram] for gram in target.split(' ')]
    intersection = set.intersection(*setlist)
    for candidate in intersection:
        if bigram in sentences[candidate]:
            reslist.append([bigram, candidate])

让我们把整个东西放在一起，并测量经过的时间：

logtime=[]
for number_of_bigrams in number_of_bigrams_we_search_for:
    
    bigrams = [bigramgen(2,1) for _ in range(number_of_bigrams)]
    
    start_time=time.time()
    
    worddict={word:set() for sentence in sentences for word in sentence.split(' ')}

    for sentencei, sentence in enumerate(sentences):
        for wordi, word in enumerate(sentence.split(' ')):
            worddict[word].add(sentencei)

    for bigram in bigrams:
        reslist=[]
        setlist = [worddict[gram] for gram in bigram.split(' ')]
        intersection = set.intersection(*setlist)
        for candidate in intersection:
            if bigram in sentences[candidate]:
                reslist.append([bigram, candidate])

    end_time=time.time()
    
    logtime.append(end_time-start_time)

警告：对于大量的双字母组，这可能会花费很长时间，但比暴力破解方法要短。

结果

我们可以标出每种方法花费的时间。

plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')

或者在log scale上绘制y axis：

plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.yscale('log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')

给我们情节：

制作worddict字典会花费很多时间，并且在搜索少量名称时是不利的。但是有一点很重要，即语料库足够大，我们要搜索的名称数量也足够多，因此与蛮力方法相比，这次可以通过其搜索速度来补偿。因此，如果满足这些条件，我建议使用此方法。

（笔记本电脑here。）

如何有效地搜索另一个字符串列表中的字符串列表

3 个答案:

TLDR

创建示例语料库

任务

结果