我要买房子,并且已经建造了BeautifulSoup刮板,它的作用就像一种魅力-从我们当地的房地产网站刮下我需要的标签。现在,我只需要实现一种机制,该机制会在网站发生更改时通知我。
当新的抓取输出与上一个不同(HTML更改时),我需要它通知我。
previous_content = ''
URL = 'whatever.com'
while True:
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
titles = soup.find_all('span', attrs={'class':['title']})
years = soup.find_all('span', attrs={'class':['year']})
sizes = soup.find_all('span', attrs={'class':['size']})
prices = soup.find_all('span', attrs={'class':['price']})
for titles, years, sizes, prices in zip(titles, years, sizes, prices):
print('Location: ' + titles.get_text(strip="True") + '\n' + 'Year: ' + years.get_text(strip="True"), '\n' + 'Size: ' + sizes.get_text(strip="True"), '\n' 'Price: ' + prices.get_text(strip="True"))
previous_content = new_content
if previous_content == new_content:
print("CONTENT NOT CHANGED. | " + str(today))
elif previous_content != new_content:
print("CONTENT CHANGED | " + str(today))
time.sleep(sleeptime)
非常感谢!
答案 0 :(得分:0)
我认为您在分配previous_content
您应该在while迭代结束时分配previous_content
,而不是在使用new_content测试相等性之前分配它,否则它将始终为True
类似的东西应该起作用(我无法测试)
previous_content = []
URL = 'whatever.com'
while True:
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
titles = soup.find_all('span', attrs={'class':['title']})
years = soup.find_all('span', attrs={'class':['year']})
sizes = soup.find_all('span', attrs={'class':['size']})
prices = soup.find_all('span', attrs={'class':['price']})
new_content = [] # Initialize the new_content list
for titles, years, sizes, prices in zip(titles, years, sizes, prices):
content = 'Location: ' + titles.get_text(strip="True") + '\n' + 'Year: ' + years.get_text(strip="True"), '\n' + 'Size: ' + sizes.get_text(strip="True"), '\n' 'Price: ' + prices.get_text(strip="True")
print(content)
new_content.append(content)
if sorted(previous_content) == sorted(new_content): # The list needs to be sorted as I expect the order to change but not the content
print("CONTENT NOT CHANGED. | " + str(today))
else:
print("CONTENT CHANGED | " + str(today))
previous_content = new_content # Assigning for next iteration of the loop
time.sleep(sleeptime)