计算元组列表中项目的频率

时间:2017-12-16 07:56:57

标签: python python-3.x list tuples generator

我有一个元组列表,如下所示。我必须计算有多少项的数字大于1.我到目前为止编写的代码非常慢。即使有大约10K元组,如果你看到下面的例子字符串出现两次,所以我必须得到这样的字符串。我的问题是通过迭代生成器来实现字符串计数的最佳方法是什么

列表:

 b_data=[('example',123),('example-one',456),('example',987),.....]

到目前为止我的代码:

blockslst=[]
for line in b_data:
    blockslst.append(line[0])

blocklstgtone=[]
for item in blockslst:
    if(blockslst.count(item)>1):
        blocklstgtone.append(item)

4 个答案:

答案 0 :(得分:13)

你有正确的想法从每个元组中提取第一个项目。您可以使用列表/生成器理解使代码更简洁,如下所示。

从那时起,查找元素频率计数的最惯用方式是使用collections.Counter对象。

  1. 从元组列表中提取第一个元素(使用理解)
  2. 将此传递给Counter
  3. 查询example
  4. 的数量
    from collections import Counter
    
    counts = Counter(x[0] for x in b_data)
    print(counts['example'])
    

    当然,您可以使用list.count,如果它只是一个项目,您想要查找频率计数,但在一般情况下,Counter是要走的路

    Counter的优势在于它在线性(example)时间内执行所有元素(不仅仅是O(N))的频率计数。假设您还想查询另一个元素的计数,比如说foo。这将用 -

    完成
    print(counts['foo'])
    

    如果列表中不存在'foo',则会返回0

    如果您想找到最常见的元素,请致电counts.most_common -

    print(counts.most_common(n))
    

    其中n是您要显示的元素数。如果您想查看所有内容,请不要通过n

    要检索大多数常见元素的计数,一种有效的方法是查询most_common,然后使用itertools有效地提取计数超过1的所有元素。

    from itertools import takewhile
    
    l = [1, 1, 2, 2, 3, 3, 1, 1, 5, 4, 6, 7, 7, 8, 3, 3, 2, 1]
    c = Counter(l)
    
    list(takewhile(lambda x: x[-1] > 1, c.most_common()))
    [(1, 5), (3, 4), (2, 3), (7, 2)]
    

    (OP编辑)或者,使用列表理解来获取具有计数>的项目列表。 1 -

    [item[0] for item in counts.most_common() if item[-1] > 1]
    

    请注意,这不如itertools.takewhile解决方案有效。例如,如果您有一个带有count>的项目如果你不需要(因为most_common按降序返回频率计数),你最终会在列表中迭代一百万次和一次。由于takewhile不是这种情况,因为只要计数条件>您就停止迭代。 1变得虚假。

答案 1 :(得分:2)

第一种方法:

  

没有循环怎么样?

print(list(map(lambda x:x[0],b_data)).count('example'))

输出:

2

第二种方法:

您可以使用简单的dict计算,无需导入任何外部模块或不使其如此复杂:

b_data = [('example', 123), ('example-one', 456), ('example', 987)]

dict_1={}
for i in b_data:
    if i[0] not in dict_1:
        dict_1[i[0]]=1
    else:
        dict_1[i[0]]+=1

print(dict_1)



print(list(filter(lambda y:y!=None,(map(lambda x:(x,dict_1.get(x)) if dict_1.get(x)>1 else None,dict_1.keys())))))

输出:

[('example', 2)]
  

Test_case:

b_data = [('example', 123), ('example-one', 456), ('example', 987),('example-one', 456),('example-one', 456),('example-two', 456),('example-two', 456),('example-two', 456),('example-two', 456)]

输出:

[('example-two', 4), ('example-one', 3), ('example', 2)]

答案 2 :(得分:2)

  

我花了这么多时间   ayodhyankit-paul   发布相同的 - 保留生成器代码   对于测试用例和时间安排:

创建 100001 项目大约需要5秒钟,计算大约 0.3s , 对计数进行过滤太快而无法衡量(使用datetime.now() - 没有使用perf_counter) - 总而言之,从开始到结束大约需要少于5.1s 对您操作的数据进行计时。

我认为这与COLDSPEED s answer中的Counter类似:

item中的foreach list of tuples

  • 如果item[0]不在列表中,请dict加入count of 1
  • 在dict increment count
  • else by 1

代码:

from collections import Counter
import random
from datetime import datetime # good enough for a loong running op


dt_datagen = datetime.now()
numberOfKeys = 100000 


# basis for testdata
textData = ["example", "pose", "text","someone"]
numData = [random.randint(100,1000) for _ in range(1,10)] # irrelevant

# create random testdata from above lists
tData = [(random.choice(textData)+str(a%10),random.choice(numData)) for a in range(numberOfKeys)] 

tData.append(("aaa",99))

dt_dictioning = datetime.now()

# create a dict
countEm = {}

# put all your data into dict, counting them
for p in tData:
    if p[0] in countEm:
        countEm[p[0]] += 1
    else:
        countEm[p[0]] = 1

dt_filtering = datetime.now()
#comparison result-wise (commented out)        
#counts = Counter(x[0] for x in tData)
#for c in sorted(counts):
#    print(c, " = ", counts[c])
#print()  
# output dict if count > 1
subList = [x for x in countEm if countEm[x] > 1] # without "aaa"

dt_printing = datetime.now()

for c in sorted(subList):
    if (countEm[c] > 1):
        print(c, " = ", countEm[c])

dt_end = datetime.now()

print( "\n\nCreating ", len(tData) , " testdataitems took:\t", (dt_dictioning-dt_datagen).total_seconds(), " seconds")
print( "Putting them into dictionary took \t", (dt_filtering-dt_dictioning).total_seconds(), " seconds")
print( "Filtering donw to those > 1 hits took \t", (dt_printing-dt_filtering).total_seconds(), " seconds")
print( "Printing all the items left took    \t", (dt_end-dt_printing).total_seconds(), " seconds")

print( "\nTotal time: \t", (dt_end- dt_datagen).total_seconds(), " seconds" )

输出:

# reformatted for bevity
example0  =  2520       example1  =  2535       example2  =  2415
example3  =  2511       example4  =  2511       example5  =  2444
example6  =  2517       example7  =  2467       example8  =  2482
example9  =  2501

pose0  =  2528          pose1  =  2449          pose2  =  2520      
pose3  =  2503          pose4  =  2531          pose5  =  2546          
pose6  =  2511          pose7  =  2452          pose8  =  2538          
pose9  =  2554

someone0  =  2498       someone1  =  2521       someone2  =  2527
someone3  =  2456       someone4  =  2399       someone5  =  2487
someone6  =  2463       someone7  =  2589       someone8  =  2404
someone9  =  2543

text0  =  2454          text1  =  2495          text2  =  2538
text3  =  2530          text4  =  2559          text5  =  2523      
text6  =  2509          text7  =  2492          text8  =  2576      
text9  =  2402


Creating  100001  testdataitems took:    4.728604  seconds
Putting them into dictionary took        0.273245  seconds
Filtering donw to those > 1 hits took    0.0  seconds
Printing all the items left took         0.031234  seconds

Total time:      5.033083  seconds 

答案 3 :(得分:0)

让我举一个让你理解的例子。虽然这个例子与你的例子非常不同,但我发现在解决这些问题时它非常有用。

from collections import Counter

a = [
(0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
(1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"),
(2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"),
(3, "statistics"), (3, "regression"), (3, "probability"),
(4, "machine learning"), (4, "regression"), (4, "decision trees"),
(4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
(5, "Haskell"), (5, "programming languages"), (6, "statistics"),
(6, "probability"), (6, "mathematics"), (6, "theory"),
(7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"),
(7, "neural networks"), (8, "neural networks"), (8, "deep learning"),
(8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"),
(9, "Java"), (9, "MapReduce"), (9, "Big Data")
]
# 
# 1. Lowercase everything
# 2. Split it into words.
# 3. Count the results.

dictionary = Counter(word for i, j in a for word in j.lower().split())

print(dictionary)

# print out every words if the count > 1
[print(word, count) for word, count in dictionary.most_common() if count > 1]

现在这是以上述方式解决的例子

from collections import Counter
a=[('example',123),('example-one',456),('example',987),('example2',987),('example3',987)]

dict = Counter(word for i,j in a for word in i.lower().split() )

print(dict)

[print(word ,count) for word,count in dict.most_common() if count > 1  ]