代码

Question

我已经用python编写了一个脚本，可以从instagram提取username，followers和posts的某些帐户。当我运行脚本时，我可以看到它的行为很奇怪。更清楚一点-我尝试使用三个帐户和

这是我得到的结果：

('backstreetboys', '2.2m Followers', '151 Posts')
('akon', '', '')
('louisnpearls', '', '080 posts')

我希望得到的东西：

('backstreetboys', '2.2m Followers', '151 Posts')
('akon', '6.4m followers', '1,700 posts')
('louisnpearls', '55.5k followers', '080 posts')

我尝试过的脚本：

import re
import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.instagram.com/backstreetboys/',
    'https://www.instagram.com/akon/',
    'https://www.instagram.com/louisnpearls/'
]

def get_instagram_info(url):

    res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text,"lxml")
    username = soup.select_one("meta[property='al:ios:url']").get("content").split("=")[-1]

    try:
        desc = soup.select_one("meta[property='og:description']").get("content")
    except Exception: desc = ""

    try:
        followers = re.findall(r".*(?<=Followers)",desc,re.I)[0]
    except Exception: followers = ""

    try:
        posts = re.findall(r"[^,]+(?<=Posts)",desc,re.I)[0]
    except Exception: posts = "" 

    return username,followers,posts

if __name__ == '__main__':
    for url in urls:
        print(get_instagram_info(url))

我应该进行哪些可能的更改，以使脚本使用请求来相应地获取上述字段？

Answer 1

如果您看一下提取的元描述，那么那里提取的数字就不存在。您的方法可能仅适用于某些帐户，而不适用于其他帐户。我的方法使用存储在页面源中的json数据。另外，我相信如果您想查看一下，可以使用Instagram api。

代码

import json
import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.instagram.com/backstreetboys/',
    'https://www.instagram.com/akon/',
    'https://www.instagram.com/louisnpearls/'
]


def get_instagram_info(url):
    res = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(res.text, "lxml")

    script_data = [script.text for script in soup.find_all('script') if script.text[:18] == 'window._sharedData'][0]
    script_json = json.loads(script_data[21:-1])
    username = script_json['entry_data']['ProfilePage'][0]['graphql']['user']['username']
    followers = script_json['entry_data']['ProfilePage'][0]['graphql']['user']['edge_followed_by']['count']
    posts = script_json['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['count']
    return username, followers, posts


if __name__ == '__main__':
    for url in urls:
        print(get_instagram_info(url))

输出

('backstreetboys', 2279332, 2152)
('akon', 6476386, 1700)
('louisnpearls', 55513, 1080)

无法从某些相同的网页中始终获取三个字段

1 个答案:

代码

输出