Question

我想从这个网站http://tweakers.net获取智能手机的价格。这是荷兰的网站。问题是价格不是从网站上收集的。

文本文件'TweakersTelefoons.txt'包含3个条目：

三星星系s6-32gb-兹瓦特

LG-关系-5X-32GB-兹瓦特

华为关系-6P-32GB-兹瓦特

我正在使用python 2.7，这是我使用的代码：

std::getline

输出：

三星-xyxy-s6-32gb-zwart的价格是[]

lg-nexus-5x-32gb-zwart的价格是[]

huawei-nexus-6p-32gb-zwart的价格是[]

价格未显示我尝试使用[^。]来摆脱欧元符号，但这不起作用。

此外，在欧洲我们可能会使用“，”而不是“。”作为小数的分隔符。请帮忙。

提前谢谢。

Answer 1

import requests

from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get("http://tweakers.net/categorie/215/smartphones/producten/").content)

print [(p.a["href"], p.a.text) for p in soup.find_all("p",{"class":"price"})]

获取所有页面：

from bs4 import BeautifulSoup

# base url to pass page number to 1-69 in this case
base_url = "http://tweakers.net/categorie/215/smartphones/producten/?page={}"
soup = BeautifulSoup(requests.get("http://tweakers.net/categorie/215/smartphones/producten/").content, "lxml")

# get and store all prices and phone links
data = {1: (p.a["href"], p.a.text) for p in soup.find_all("p", {'class': "price"})}

pag = soup.find("span", attrs={"class":"pageDistribution"}).find_all("a")

# last page number
mx_pg = max(int(a.text) for a in pag if a.text.isdigit())

# get all the pages from the second to  mx_pg 
for i in range(2, mx_pg + 1):
    req = requests.get(base_url.format(i))
    print req
    soup = BeautifulSoup(req.content)
    data[i] = [(p.a["href"], p.a.text) for p in soup.find_all("p",{"class":"price"})]

您需要requests，BeautifulSoup。如果你想要获取更多数据，那么dict会链接到你可以访问的每个手机页面。

Answer 2

我认为您的问题是您希望网络服务器使用"http://tweakers.net/pricewatch/[^.]*/来解析网址中的通配符，而您不会检查我怀疑是404的返回代码。

如果产品ID已修复，您需要识别产品ID，或使用表单发布方法发布搜索请求。

Python：从网站上获取智能手机的价格

2 个答案: