无法从某些相同的网页中始终获取三个字段

时间:2019-11-20 11:18:14

标签: python python-3.x web-scraping python-requests

我已经用python编写了一个脚本,可以从instagram提取usernamefollowersposts的某些帐户。当我运行脚本时,我可以看到它的行为很奇怪。更清楚一点-我尝试使用三个帐户和

这是我得到的结果:

('backstreetboys', '2.2m Followers', '151 Posts')
('akon', '', '')
('louisnpearls', '', '080 posts')

我希望得到的东西:

('backstreetboys', '2.2m Followers', '151 Posts')
('akon', '6.4m followers', '1,700 posts')
('louisnpearls', '55.5k followers', '080 posts')

我尝试过的脚本:

import re
import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.instagram.com/backstreetboys/',
    'https://www.instagram.com/akon/',
    'https://www.instagram.com/louisnpearls/'
]

def get_instagram_info(url):

    res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text,"lxml")
    username = soup.select_one("meta[property='al:ios:url']").get("content").split("=")[-1]

    try:
        desc = soup.select_one("meta[property='og:description']").get("content")
    except Exception: desc = ""

    try:
        followers = re.findall(r".*(?<=Followers)",desc,re.I)[0]
    except Exception: followers = ""

    try:
        posts = re.findall(r"[^,]+(?<=Posts)",desc,re.I)[0]
    except Exception: posts = "" 

    return username,followers,posts

if __name__ == '__main__':
    for url in urls:
        print(get_instagram_info(url))

我应该进行哪些可能的更改,以使脚本使用请求来相应地获取上述字段?

1 个答案:

答案 0 :(得分:1)

如果您看一下提取的元描述,那么那里提取的数字就不存在。您的方法可能仅适用于某些帐户,而不适用于其他帐户。我的方法使用存储在页面源中的json数据。另外,我相信如果您想查看一下,可以使用Instagram api。

代码

import json
import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.instagram.com/backstreetboys/',
    'https://www.instagram.com/akon/',
    'https://www.instagram.com/louisnpearls/'
]


def get_instagram_info(url):
    res = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(res.text, "lxml")

    script_data = [script.text for script in soup.find_all('script') if script.text[:18] == 'window._sharedData'][0]
    script_json = json.loads(script_data[21:-1])
    username = script_json['entry_data']['ProfilePage'][0]['graphql']['user']['username']
    followers = script_json['entry_data']['ProfilePage'][0]['graphql']['user']['edge_followed_by']['count']
    posts = script_json['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['count']
    return username, followers, posts


if __name__ == '__main__':
    for url in urls:
        print(get_instagram_info(url))

输出

('backstreetboys', 2279332, 2152)
('akon', 6476386, 1700)
('louisnpearls', 55513, 1080)