Question

如何遍历一个数据框列中的每个值，并检查它是否在另一数据框列中包含单词？

a = pd.DataFrame({'text': ['the cat jumped over the hat', 'the pope pulled on the rope', 'i lost my dog in the fog']})
b = pd.DataFrame({'dirty_words': ['cat', 'dog', 'parakeet']})

a    
    text
0   the cat jumped over the hat
1   the pope pulled on the rope
2   i lost my dog in the fog

b
    dirty_words
0   cat
1   dog
2   parakeet

我想获得一个仅包含以下值的新数据框：

result

0   the cat jumped over the hat
1   i lost my dog in the fog

Answer 1

按空格分隔字符串后，可以对any使用列表推导。该方法不会仅因为包含“猫”就包含“导管”。

mask = [any(i in words for i in b['dirty_words'].values) \
        for words in a['text'].str.split().values]

print(a[mask])

                          text
0  the cat jumped over the hat
2     i lost my dog in the fog

Answer 2

我认为您可以在isin之后使用str.split

a[pd.DataFrame(a.text.str.split().tolist()).isin(b.dirty_words.tolist()).any(1)]
Out[380]: 
                          text
0  the cat jumped over the hat
2     i lost my dog in the fog

Answer 3

使用与href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css"匹配的正则表达式。

str.contains

边界一词可确保您不会因为包含“ cat”（由于@DSM）而匹配“ catch”。

检查数据框列中的每个值是否包含来自另一个数据框列的单词

3 个答案: