在pandas数据框中查找子串的出现 - Python

时间:2017-10-24 03:43:24

标签: python-2.7 pandas

我有一个单词列表'我想算在下面

word_list = ['one','two','three']

我在pandas数据框中有一个列,下面有文字。

TEXT
-----
"Perhaps she'll be the one for me."
"Is it two or one?"
"Mayhaps it be three afterall..."
"Three times and it's a charm."
"One fish, two fish, red fish, blue fish."
"There's only one cat in the hat."
"One does not simply code into pandas."
"Two nights later..."
"Quoth the Raven... nevermore."

我想要的输出如下所示,其中我想计算word_list中定义的子串出现在数据帧中每行的字符串中的次数。

Word | Count
one        5     
two        3     
three      2 

有没有办法在Python 2.7中执行此操作?

3 个答案:

答案 0 :(得分:2)

我会用vanilla python执行此操作,首先加入字符串:

In [11]: long_string = "".join(df[0]).lower()

In [12]: long_string[:50]  # all the words glued up
Out[12]: "perhaps she'll be the one for me.is it two or one?"

In [13]: for w in word_list:
     ...:     print(w, long_string.count(w))
     ...:
one 5
two 3
three 2

如果你想返回一个系列,你可以使用词典理解:

In [14]: pd.Series({w: long_string.count(w) for w in word_list})
Out[14]:
one      5
three    2
two      3
dtype: int64

答案 1 :(得分:1)

使用str.extractall + value_counts

df

                                         text
0         "Perhaps she'll be the one for me."
1                         "Is it two or one?"
2           "Mayhaps it be three afterall..."
3             "Three times and it's a charm."
4  "One fish, two fish, red fish, blue fish."
5          "There's only one cat in the hat."
6     "One does not simply code into pandas."
7                       "Two nights later..."
8             "Quoth the Raven... nevermore."

rgx = '({})'.format('|'.join(word_list))
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()

one      5
two      3
three    2
Name: 0, dtype: int64

<强>详情

rgx
'(one|two|three)'

df.text.str.lower().str.extractall(rgx).iloc[:, 0]

   match
0  0          one
1  0          two
   1          one
2  0        three
3  0        three
4  0          one
   1          two
5  0          one
6  0          one
7  0          two
Name: 0, dtype: object

<强>性能

# Zero's code 
%%timeit 
pd.Series({w: df.text.str.count(w, flags=re.IGNORECASE).sum() for w in word_list}).sort_values(ascending=False)
1000 loops, best of 3: 1.55 ms per loop
# Andy's code
%%timeit
long_string = "".join(df.iloc[:, 0]).lower()
for w in word_list:
     long_string.count(w)

10000 loops, best of 3: 132 µs per loop
%%timeit
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()
100 loops, best of 3: 2.53 ms per loop

df = pd.concat([df] * 100000)
%%timeit 
pd.Series({w: df.text.str.count(w, flags=re.IGNORECASE).sum() for w in word_list}).sort_values(ascending=False)
1 loop, best of 3: 4.34 s per loop
%%timeit
long_string = "".join(df.iloc[:, 0]).lower()
for w in word_list:
    long_string.count(w)

10 loops, best of 3: 151 ms per loop
%%timeit 
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()
1 loop, best of 3: 4.12 s per loop

答案 2 :(得分:0)

使用

In [52]: pd.Series({w: df.TEXT.str.contains(w, case=False).sum() for w in word_list})
Out[52]:
one      5
three    2
two      3
dtype: int64

或者,计算每行中的多个实例

In [53]: pd.Series({w: df.TEXT.str.count(w, flags=re.IGNORECASE).sum() for w in word_list})
Out[53]:
one      5
three    2
two      3
dtype: int64

使用sort_values

In [55]: s = pd.Series({w: df.TEXT.str.count(w, flags=re.IGNORECASE).sum() for w in word_list})

In [56]: s.sort_values(ascending=False)
Out[56]:
one      5
two      3
three    2
dtype: int64