如何使用Tweepy从多个用户收集多条推文?

时间:2016-04-29 15:22:09

标签: python twitter tweepy

我知道有关此问题的类似问题,但我正在使用的项目是使用Tweepy for Python,所以它更具体一些。

我从可口可乐和百事可乐的粉丝中收集了一千个用户ID,然后搜索每个用户的最新20个状态以收集使用的主题标签。

我使用的是Tweepy followers_ids和user_timeline API,但我一直在Twitter上收到401。如果我将用户ID的数量设置为仅搜索10而不是1000,我有时会得到我想要的结果,但即便如此,我有时也会获得401。所以它有效.... 有点。它似乎是导致这些错误的大集合,我不知道如何绕过它们。

我知道Twitter对通话有限制,但如果我能够即时获取1000个用户ID,为什么我无法获取状态?我意识到我试图获得20,000种状态,但我已经尝试过只有100 * 20甚至50 * 20但仍然可以获得401。我已经多次重置我的系统时钟,但只能偶尔使用10 * 20设置。我希望那里的人可能比我到目前为止有更好,更有效的方法。我是Twitter API的新手,也是Python的新手,所以希望它只是我。

以下是代码:

import tweepy
import pandas as pd

consumer_key = 'REDACTED'
consumer_secret = 'REDACTED'
access_token = 'REDACTED'
access_token_secret = 'REDACTED'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.secure = True
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

pepsiUsers = []
cokeUsers = []
cur_pepsiUsers = tweepy.Cursor(api.followers_ids, screen_name='pepsi')
cur_cokeUsers = tweepy.Cursor(api.followers_ids, screen_name='CocaCola')

for user in cur_pepsiUsers.items(1000):
    pepsiUsers.append({ 'userId': user, 'hTags': [], 'favSoda': 'Pepsi' })
    for status in tweepy.Cursor(api.user_timeline, user).items(20):
        status = status._json
        hashtags = status['entities']['hashtags']
        index = len(pepsiUsers) - 1
        if len(hashtags) > 1:
            for ht in hashtags:
                pepsiUsers[index]['hTags'].append(ht['text'])

for user in cur_cokeUsers.items(1000):
    cokeUsers.append({ 'userId': user, 'hTags': [], 'favSoda': 'Coke' })
    for status in tweepy.Cursor(api.user_timeline, user).items(20):
        status = status._json
        hashtags = status['entities']['hashtags']
        index = len(cokeUsers) - 1
        if len(hashtags) > 1:
            for ht in hashtags:
                cokeUsers[index]['hTags'].append(ht['text'])

"""create a master list of coke and pepsi users to write to CSV"""
mergedList = cokeUsers + pepsiUsers
"""here we'll turn empty hashtag lists into blanks and turn all hashtags for each user into a single string
    for easier searching with R later"""
for i in mergedList:
    if len(i['hTags']) == 0:
        i['hTags'] = ''
    i['hTags'] = ''.join(i['hTags'])

list_df = pd.DataFrame(mergedList, columns=['userId', 'favSoda', 'hTags'])
list_df.to_csv('test.csv', index=False)

这是我在尝试运行运行api.user_timeline代码的块时遇到的错误

---------------------------------------------------------------------------
TweepError                                Traceback (most recent call last)
<ipython-input-134-a7658ed899f3> in <module>()
      3 for user in cur_pepsiUsers.items(1000):
      4     pepsiUsers.append({ 'userId': user, 'hTags': [], 'favSoda': 'Pepsi' })
----> 5     for status in tweepy.Cursor(api.user_timeline, user).items(20):
      6         status = status._json
      7         hashtags = status['entities']['hashtags']

/Users/.../anaconda/lib/python3.5/site-packages/tweepy/cursor.py in __next__(self)
     47 
     48     def __next__(self):
---> 49         return self.next()
     50 
     51     def next(self):

/Users/.../anaconda/lib/python3.5/site-packages/tweepy/cursor.py in next(self)
    195         if self.current_page is None or self.page_index == len(self.current_page) - 1:
    196             # Reached end of current page, get the next page...
--> 197             self.current_page = self.page_iterator.next()
    198             self.page_index = -1
    199         self.page_index += 1

/Users/.../anaconda/lib/python3.5/site-packages/tweepy/cursor.py in next(self)
    106 
    107         if self.index >= len(self.results) - 1:
--> 108             data = self.method(max_id=self.max_id, parser=RawParser(), *self.args, **self.kargs)
    109 
    110             if hasattr(self.method, '__self__'):

/Users/.../anaconda/lib/python3.5/site-packages/tweepy/binder.py in _call(*args, **kwargs)
    243             return method
    244         else:
--> 245             return method.execute()
    246 
    247     # Set pagination mode

/Users/.../anaconda/lib/python3.5/site-packages/tweepy/binder.py in execute(self)
    227                     raise RateLimitError(error_msg, resp)
    228                 else:
--> 229                     raise TweepError(error_msg, resp, api_code=api_error_code)
    230 
    231             # Parse the response payload

TweepError: Twitter error response: status code = 401

2 个答案:

答案 0 :(得分:1)

你只需要Twitter JSON吗?由于您的收集区域的范围,您可能想尝试twarc:https://github.com/edsu/twarc

答案 1 :(得分:0)

尝试在创建 API 时添加速率限制。

<p>Click the button to return the number of characters in the string "Hello World!".</p>
<input id="id" value="Hello World" />
<button onclick="myFunction()">Try it</button>

<p id="demo"></p>

如果这不能完全解决问题,请在 python 中使用(尝试和异常)来捕获错误并等待 15 分钟后再返回。