Question

我有一些来自流API的JSON Twitter数据，我想使用Counter函数来了解此数据集中最受欢迎的主题标签。我所遇到的问题是循环通过具有多个主题标签的推文，而不仅仅是拔出第一个主题标签并忽略任何剩余的主题标签。

问题：如何在dict中循环嵌套列表以提取推文中的所有主题标签，而不仅仅是第一个＃标签？

In [1]: import json

In [2]: from collections import Counter

In [3]: data = []

In [4]: for line in open('DC.json'):
   ...:     try:
   ...:         data.append(json.loads(line))
   ...:     except:
   ...:         pass
   ...:     

In [5]: hashtags = []

In [6]: for i in data:
   ...:     if 'entities' in i and len(i['entities']['hashtags']) > 0:
   ...:         hashtags.append(i['entities']['hashtags']['text'])
   ...:     else:
   ...:         pass
   ...:     
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-66d7538509f9> in <module>()
      1 for i in data:
      2     if 'entities' in i and len(i['entities']['hashtags']) > 0:
----> 3         hashtags.append(i['entities']['hashtags']['text'])
      4     else:
      5         pass

TypeError: list indices must be integers, not str

In [7]: Counter(hashtags).most_common()[:10]

i['entities']['hashtags']

中包含4个主题标签的示例

In [12]: i[0]['entities']['hashtags']
Out[12]: 
[{u'indices': [28, 35], u'text': u'selfie'},
 {u'indices': [82, 92], u'text': u'omg'},
 {u'indices': [93, 104], u'text': u'Champ'},
 {u'indices': [105, 117], u'text': u'FIRST'}]

Answer 1

您说i['entities']['hashtags']是list的{{1}}，所以行：

dict

正在尝试使用字符串索引列表。这使无意义，并导致错误。我认为你最好把它分成几步，首先得到所有hashtags.append(i['entities']['hashtags']['text'])词典：

'hashtag'

然后提取hashtags = [] for i in data: if 'entities' in i: hashtags.extend(i['entities']['hashtags'])：

'text'

然后将其转储到hashtags = [tag['text'] for tag in hashtags]：

Counter

计算一组推文中的所有主题标签

1 个答案: