使用BS4和Python刮除<ul标签中的变体

时间:2019-06-18 12:48:08

标签: python html web-scraping beautifulsoup

我想抓取此网页https://www.off---white.com/en/GB/men/products/omia139f198000403020# /视图源:https://www.off---white.com/en/GB/men/products/omia139f198000403020#

对于变体,例如


<div class='product-variants'>
<form class="product-cart-form js-cart-form" action="/en/GB/orders/populate.json" accept-charset="UTF-8" method="post"><input name="utf8" type="hidden" value="&#x2713;" /><input type="hidden" name="authenticity_token" value="3VeMLZA3thbrl8EtNfA6rdNcAMXa/29u87AW7KbhyNQ=" /><div class='please-select-text'>
<p>Please select a size</p>
</div>
<div class='availability preorder-product'>
<p>
Pre-order will arrive by October 15
<sup>
th
</sup>
</p>
</div>
<ul class='styled-radio'>
<li>
<input type="radio" name="variant_id" id="variant_id_113207" value="113207" />
<label for="variant_id_113207">40</label>
</li>
<li>
<input type="radio" name="variant_id" id="variant_id_113208" value="113208" />
<label for="variant_id_113208">41</label>
</li>
<li>
<input type="radio" name="variant_id" id="variant_id_113209" value="113209" />
<label for="variant_id_113209">42</label>
</li>
<li>
<input type="radio" name="variant_id" id="variant_id_113210" value="113210" />
<label for="variant_id_113210">43</label>
</li>
<li>
<input type="radio" name="variant_id" id="variant_id_113211" value="113211" />
<label for="variant_id_113211">44</label>
</li>
<li>
<input type="radio" name="variant_id" id="variant_id_113212" value="113212" />
<label for="variant_id_113212">45</label>
</li>
</ul>

我当前的代码是:

s = requests.session()

def loadproduct():
    product = 'https://www.off---white.com/en/GB/men/products/omia139f198000403020#'
    getproduct = s.get(product)
    bsproduct = bs(getproduct.text, 'html.parser')
    #print(bsproduct)
    allsizes = bsproduct.find('ul',{'class':'styled-radio'}).findAll('input)    
    print(allsizes)
loadproduct()
x= input('d')

1 个答案:

答案 0 :(得分:-1)

该网页由javascript生成。 您必须使用selenium之类的包将其报废。

检查此代码段:

代码:

from selenium import webdriver
from bs4 import BeautifulSoup
import time

driver = webdriver.Firefox()
driver.get('https://www.off---white.com/en/GB/men/products/omia139f198000403020#')

time.sleep(5)

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

allsizes = soup.find('ul',{'class':'styled-radio'}).findAll('input')
for size in allsizes:
    print(size.get('value'))

输出:

113207
113208
113209
113210
113211
113212