如何使用熊猫按字母顺序将数据分类到类别中?

时间:2019-05-16 16:48:07

标签: python python-3.x pandas

我有一个数据框,其中包含一列包含一系列字符串的

books = pd.DataFrame([[1,'In Search of Lost Time'],[2,'Don Quixote'],[3,'Ulysses'],[4,'The Great Gatsby'],[5,'Moby Dick']], columns = ['Book ID', 'Title'])

   Book ID                   Title
0        1  In Search of Lost Time
1        2             Don Quixote
2        3                 Ulysses
3        4        The Great Gatsby
4        5               Moby Dick

以及边界的排序列表

boundaries = ['AAAAAAA','The Great Gatsby', 'zzzzzzzz']

我想使用这些边界将数据帧中的值分类为字母箱,类似于pd.cut()对数字数据的工作方式。我的愿望输出如下所示。

   Book ID                   Title                          binning
0        1  In Search of Lost Time   ['AAAAAAA','The Great Gatsby')
1        2             Don Quixote   ['AAAAAAA','The Great Gatsby')
2        3                 Ulysses  ['The Great Gatsby','zzzzzzzz')
3        4        The Great Gatsby  ['The Great Gatsby','zzzzzzzz')
4        5               Moby Dick   ['AAAAAAA','The Great Gatsby')

这可能吗?

1 个答案:

答案 0 :(得分:5)

boundaries = np.array(['The Great Gatsby']) bins = np.array(['[A..The Great Gatsby)', '[The Great Gatsby..Z]']) books.assign(binning=bins[boundaries.searchsorted(books.Title)]) Book ID Title binning 0 1 In Search of Lost Time [A..The Great Gatsby) 1 2 Don Quixote [A..The Great Gatsby) 2 3 Ulysses [The Great Gatsby..Z] 3 4 The Great Gatsby [A..The Great Gatsby) 4 5 Moby Dick [A..The Great Gatsby)

from string import ascii_uppercase as letters
boundaries = np.array([*string.ascii_uppercase[1:-1]])
bins = np.array([f'[{a}..{b})' for a, b in zip(letters, letters[1:])])

books.assign(binning=bins[boundaries.searchsorted(books.Title)])

   Book ID                   Title binning
0        1  In Search of Lost Time  [I..J)
1        2             Don Quixote  [D..E)
2        3                 Ulysses  [U..V)
3        4        The Great Gatsby  [T..U)
4        5               Moby Dick  [M..N)

将此扩展到其他一些边界:

window.scrollTo