Question

我试图获取一个唯一域列表，但返回每个域的计数。因此，如果我检查一堆用户，我会得到他们推特上最常见的域名。

现在，我只使用自己的用户名进行尝试，每次都会返回1。我可以从输出中看到Twitter.com有两次，所以它似乎并没有起作用。

我觉得这与订单有关？也许它每次都检查每个的计数，然后我猜它总是1。

from tweepy import API
from tweepy import OAuthHandler
from tweepy import Cursor
from tld import get_tld
from collections import Counter

ckey = "foo"
csecret = "foo"
atoken = "foo"
asecret = "foo"

import tweepy
import re
import requests
auth = tweepy.OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
usernames = ['myname']
api = tweepy.API(auth)
for name in usernames:
    public_tweets = api.user_timeline(name, count=10)
    for tweet in public_tweets:     
        urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweet.text)
        links = []
        domains = []
        for url in urls:
            links.append(requests.get(url).url)
            for link in links:
                domains.append(get_tld(link))
                print Counter(domains)

输出如下所示：

Counter({u'businessinsider.com': 1})
Counter({u'twitter.com': 1})
Counter({u'bloomberg.com': 1})
Counter({u'mo.github.io': 1})
Counter({u'distilled.net': 1})
Counter({u't.co': 1})
Counter({u'twitter.com': 1})
Counter({u'justbuythisone.com': 1})
Counter({u'techcrunch.com': 1})
Counter({u'chriszacharias.com': 1})

Answer 1

您正在重置每条推文的列表。您的计数会计算每条推文的链接，而不是所有推文的链接。

您甚至不需要创建列表。只需在验证时直接计算链接：

counts = Counter()
for name in usernames:
    public_tweets = api.user_timeline(name, count=10)
    for tweet in public_tweets:     
        urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweet.text)
        for url in urls:
            link = requests.head(url, allow_redirects=True).url  # follow redirects to the end
            domain = get_tld(link)
            counts[domain] += 1

如果您确实想收集所有链接和域，请在循环外创建列表，并可能推迟计数，直到处理完所有推文为止。

links = []
domains = []
for name in usernames:
    public_tweets = api.user_timeline(name, count=10)
    for tweet in public_tweets:     
        urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweet.text)
        for url in urls:
            link = requests.head(url, allow_redirects=True).url  # follow redirects to the end
            links.append(link)
            domain = get_tld(link)
            domains.append(domain)
counts = Counter(domains)

收藏柜台不工作

1 个答案: