根据字符串的相对频率对字符串列表中的单词进行排序,而非常规排序?

时间:2016-09-07 17:30:39

标签: python sorting

假设我有一个pandas.Series对象:

import pandas as pd

s = pd.Series(["hello there you would like to sort me", 
    "sorted i would like to be", "the banana does not taste like the orange", 
    "my friend said hello", "hello there amigo", "apple apple banana orange peach pear plum", 
    "orange is my favorite color"])

我想根据每个单词在整个 Series中出现的频率对每行内的单词进行排序。

我可以轻松地创建单词词典:频率键值对:

from collections import Counter

def create_word_freq_dict(series):
    return Counter(word for row in series for word in row.lower().split())

word_counts = create_word_freq_dict(s)

如果没有程序性地遍历Series中的每一行,我如何按相对频率对该对象中的单词进行排序?也就是说,例如,"你好"发生的频率高于朋友,"所以应该在结果的左边进一步向左边排序"字符串。

这就是我所拥有的:

for row in s:
    ordered_words = []
    words = row.split()
    if len(words) == 1:
        ordered_words.append(words[0])
    else:
        i = 1
        prevWord = words[0]
        prevWord_freq = word_counts[prevWord]
        while i < len(words):
            currWord = words[i]
            currWord_freq = word_counts[currWord]
            if currWord_freq > prevWord_freq:
                prevWord = currWord
                prevWord_freq = currWord_freq
                words.append(currWord)
   ...

它还没有完成,但是有没有更好的方式(而不是递归)以这种方式进行排序?

2 个答案:

答案 0 :(得分:1)

Python 2

您所要做的就是根据您的计数器创建自定义比较器并调用排序

s = ["hello there you would like to sort me", 
    "sorted i would like to be", "the banana does not taste like the orange", 
    "my friend said hello", "hello there amigo", "apple apple banana orange peach pear plum", 
    "orange is my favorite color"]


from collections import Counter

def create_word_freq_dict(series):
    return Counter(word for row in series for word in row.lower().split())

word_counts = create_word_freq_dict(s)

for row in s:
    print sorted(row.lower().split(), lambda x, y: word_counts[y] - word_counts[x])

所以我在这里所做的只是用自定义比较运算符调用sorted,它忽略了单词,而是使用word_counts映射来确定哪一个应该是第一个。

和效果

['hello', 'like', 'there', 'would', 'to', 'you', 'sort', 'me']
['like', 'would', 'to', 'sorted', 'i', 'be']
['like', 'orange', 'the', 'banana', 'the', 'does', 'not', 'taste']
['hello', 'my', 'friend', 'said']
['hello', 'there', 'amigo']
['orange', 'apple', 'apple', 'banana', 'peach', 'pear', 'plum']
['orange', 'my', 'is', 'favorite', 'color']

并证明它确实根据频率排序:

for row in s:
    sorted_row = sorted(row.split(), lambda x, y: word_counts[y] - word_counts[x])
    print zip(sorted_row, map(lambda x: word_counts[x], sorted_row))

产生

[('hello', 3), ('like', 3), ('there', 2), ('would', 2), ('to', 2), ('you', 1), ('sort', 1), ('me', 1)]
[('like', 3), ('would', 2), ('to', 2), ('sorted', 1), ('i', 1), ('be', 1)]
[('like', 3), ('orange', 3), ('the', 2), ('banana', 2), ('the', 2), ('does', 1), ('not', 1), ('taste', 1)]
[('hello', 3), ('my', 2), ('friend', 1), ('said', 1)]
[('hello', 3), ('there', 2), ('amigo', 1)]
[('orange', 3), ('apple', 2), ('apple', 2), ('banana', 2), ('peach', 1), ('pear', 1), ('plum', 1)]
[('orange', 3), ('my', 2), ('is', 1), ('favorite', 1), ('color', 1)]

Python 3

在Python 3中,sorted不再接受函数,而是接受key,因此您必须进行转换

s = ["hello there you would like to sort me", 
    "sorted i would like to be", "the banana does not taste like the orange", 
    "my friend said hello", "hello there amigo", "apple apple banana orange peach pear plum", 
    "orange is my favorite color"]

from functools import cmp_to_key
from collections import Counter

def create_word_freq_dict(series):
    return Counter(word for row in series for word in row.lower().split())

word_counts = create_word_freq_dict(s)


for row in s:
    sorted_row = sorted(row.split(), key=cmp_to_key(lambda x, y: word_counts[y] - word_counts[x]))
    print(sorted_row)

答案 1 :(得分:0)

print create_word_freq_dict(series).most_common()