抓取包含特定文本的脚本标签

时间:2021-02-21 22:29:25

标签: python selenium web-scraping beautifulsoup

我目前正在尝试从下面的页面源代码中的脚本标记中抓取“imageToken”值,下面的 python 代码将在大约 75% 的情况下获取令牌,但其他时候脚本标记的数量必须正在改变,它选择了错误的标签。

有没有办法在所有脚本标签中搜索包含“imageToken”的标签?

这是在 75% 的时间里都在工作的代码。

    html_source = driver.page_source
    soup = BeautifulSoup(html_source, 'html.parser')
                
    scripts = soup.find_all('script')[20]
    findtoken = scripts.string.split(',')[58]
    token = findtoken.split(':')[2].strip('"')
    print(token)

我也试过这个,但没有返回:

    html_source = driver.page_source
    soup = BeautifulSoup(html_source, 'html.parser')
                
    scripts = soup.find_all('script')
    for script in scripts:
        if 'imageToken' in script:
            print(script)

这里是 script 标记的来源,页面上还有很多其他脚本,但这是唯一带有“imageToken”的脚本。

<script>

                ((data) => {
                    /* TASK: Fix this. Move away from F3.page */
                    window.F3 = window.F3 || {};
                    window.F3.page = window.F3.page || {};
                    Object.assign(window.F3.page, data);
                })({"user":{"email":"email@address.com","useFacebookPhoto":false,"joinDate":"2021-02-05T10:11:11-07:00","hasIcon":false,"confirmed":true,"disabled":false,"hasPassword":true,"ancestrySubscriber":false,"admin":false,"accountStatus":"monthly-subscriber","subscriptionStatus":"subscriber","FreeAccess":true,"accountState":{"signedOut":false,"registered":true,"subscriber":true,"expiring":false,"freeTrialSubscriber":false,"payingSubscriber":true,"bundleSubscriber":false,"newspaperSubscriber":false,"acomSubscriber":false,"formerSubscriber":false,"formerPayingSubscriber":false,"formerBundle":false,"currentSubscriptionType":"monthly","currentAccountStatus":"monthly-subscriber","oldSubscriptionStatus":"subscriber"},"passwordSerial":1,"userId":6812311,"username":"myusername"},"totalImages":585709948,"config":{"api":{"host":"http://svc.fold3.com:50000","f3Api":"http://api.fold3.com/fold31-api","path":"/fold31/api"},"app":{"canonical":"https://www.fold3.com","cookieDomain":".fold3.com","env":"live","goStack":"https://go.fold3.com","hostname":"www.fold3.com","trustedHostname":"fold3.com"},"ancestry":{"domain":"https://www.ancestry.com","internalDomain":"ancestry.int","redirectHost":"https://www.fold3.com","clientId":"60e8bf12987c2a38a1f48b3c8e41f4400d3b7eb2","redirectPath":"/auth/openid","ssoPath":"/sso/oidc/authorize"},"fold3":{"contactNumber":"1-800-613-0181"},"image":{"host":"https://img.fold3.com","hostRotating":"https://img#.fold3.com","path":"/img/"},"oldStack":{"host":"http://php.fold3.com:9090"},"regiment":{"host":"http://regiment.fold3.com","path":"/fold31-regiment/api"},"search":{"host":"http://search-es.fold3.com","path":"/fold31-search/api"}},"isMobile":false,"image":{"imageToken":"4IIROAS9p-z9rCHcF2toENYedok9hGmwdOsdlKGAfCzNNch2fNPT9HcElRYXBOL66kcnDgT7C9-aivjlk5o4Kwlgc7HB6U_MeIjtQuF2mMrfZq6dsivylzR2d30JiKv46hcMyMMwmBuRSI9_TlCelg==","imageId":692219369,"publication":{"dbid":61641,"mediaProvider":"EMS","allowDownloadDoc":true,"allowAnnotations":true,"hasOcr":false,"recordCountMode":"images","rollupImage":"NONE","lastModification":"2020-10-14T11:06:06-06:00","lastSorted":"2020-10-15T09:16:04-06:00","configuredAccessLevel":"REGISTERED","maximumAccessLevel":"REGISTERED","minimumAccessLevel":"REGISTERED","featured":false,"hashPath":"hiOcMlUzt","publicationId":1104,"contentType":"IMAGE"</script>

2 个答案:

答案 0 :(得分:0)

要搜索包含特定文本的标签,您可以使用 :contains(<my text>) 选择器。

在您的示例中查看 script 是否包含文本 imageToken 使用:

print(soup.select_one("script:contains('imageToken')"))

注意:要使用选择器,请使用 select() 方法代替 find_all(),或者使用 select_one() 代替 find()

答案 1 :(得分:0)

您的第二种方法是正确的,但缺少 .string

for script in scripts:
    if 'imageToken' in script.string: # <== add .string
        print(script.string)
相关问题