如何计算DataFrame中字符串中的单词数?

时间:2016-05-27 12:21:19

标签: python pandas dataframe

假设我们有简单的Dataframe

df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 'one banana', 'fruits'])
df.columns = ['fruits']

如何计算关键字中的字数,类似于:

1 word: 2
2 words: 2
3 words: 1
4 words: 1

2 个答案:

答案 0 :(得分:19)

IIUC然后您可以执行以下操作:

In [89]:
count = df['fruits'].str.split().apply(len).value_counts()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count

Out[89]:
1 words:    2
2 words:    2
3 words:    1
4 words:    1
Name: fruits, dtype: int64

在这里,我们使用向量化str.split分隔空格,然后apply len来获取元素数量的计数,然后我们可以调用value_counts来汇总频率计数。

然后我们重命名索引并对其进行排序以获得所需的输出

<强>更新

这也可以使用str.len而非apply来完成,而In [41]: count = df['fruits'].str.split().str.len() count.index = count.index.astype(str) + ' words:' count.sort_index(inplace=True) count Out[41]: 0 words: 2 1 words: 1 2 words: 3 3 words: 4 4 words: 2 5 words: 1 Name: fruits, dtype: int64 应该可以更好地扩展:

In [42]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()

1000 loops, best of 3: 799 µs per loop
1000 loops, best of 3: 347 µs per loop

<强>计时

In [51]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()

100 loops, best of 3: 6.3 ms per loop
100 loops, best of 3: 6 ms per loop

对于6K df:

{{1}}

答案 1 :(得分:6)

您可以使用str.count空格' '作为分隔符。

In [1716]: count = df['fruits'].str.count(' ').add(1).value_counts(sort=False)

In [1717]: count.index = count.index.astype('str') + ' words:'

In [1718]: count
Out[1718]:
1 words:    2
2 words:    2
3 words:    1
4 words:    1
Name: fruits, dtype: int64

<强>计时

str.count稍快一点

<子>

In [1724]: df.shape
Out[1724]: (6, 1)

In [1725]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1000 loops, best of 3: 649 µs per loop

In [1726]: %timeit df['fruits'].str.split().apply(len).value_counts()
1000 loops, best of 3: 840 µs per loop

<子> 媒介

In [1728]: df.shape
Out[1728]: (6000, 1)

In [1729]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
100 loops, best of 3: 6.58 ms per loop

In [1730]: %timeit df['fruits'].str.split().apply(len).value_counts()
100 loops, best of 3: 6.99 ms per loop

<子>

In [1732]: df.shape
Out[1732]: (60000, 1)

In [1733]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1 loop, best of 3: 57.6 ms per loop

In [1734]: %timeit df['fruits'].str.split().apply(len).value_counts()
1 loop, best of 3: 73.8 ms per loop