仅当标题或描述包含%string%时才获取URL,标题和描述

时间:2015-07-28 10:47:13

标签: python regex python-2.7 web-scraping beautifulsoup

我有一个包含某些RSS提要的URL的文本文件。我想找出哪些URL包含某些字符串(或单词列表)的标题或描述(或任何其他标记)。

至于现在,我能够获得URL,标题和标题(以及其他)。虽然不太确定如何继续。我想我会用正则表达式检查标签。如果我检查了一个URL标题并找到了一个wordmatch,那么我将如何再次检索该URL? URL需要连接到标签,例如.csv。有点困惑在这里。也许有人可以朝正确的方向射击我?

到目前为止我的路径:

import requests
from bs4 import BeautifulSoup

rssfeed = open('input.txt')
rss_source = rssfeed.read()
rss_sources = rss_source.split()

i=0
while i<len(rss_sources):
    get_rss = requests.get(rss_sources[i])
    rss_soup = BeautifulSoup(get_rss.text, 'html.parser')
    rss_urls = rss_soup.find_all('link')
    i=i+1

for url in rss_urls:
        rss_all_urls = url.text
        open_urls = requests.get(rss_all_urls)
        target_urls_soup = BeautifulSoup(open_urls.text, 'html.parser')
        urls_titles = target_urls_soup.title
        urls_headlines = target_urls_soup.h1
        print (rss_all_urls, urls_titles, urls_headlines)

1 个答案:

答案 0 :(得分:0)

So you want to have an array of URLs. That array should contain certain URLs based on some conditions: - if the Title of that URL match one of the strings contained on an array

So first you need your arrays:

titlesToMatch = ['title1', 'title2', 'title3']
finalArrayWithURLs = []

then when you have your: rss_all_urls, urls_titles, urls_headlines for a URL you want to include on the finalArrayWithURLs just those ones that match one of the titles on the titleToMatch

for url in rss_urls:
    rss_all_urls = url.text
    open_urls = requests.get(rss_all_urls)
    target_urls_soup = BeautifulSoup(open_urls.text, 'html.parser')
    urls_titles = target_urls_soup.title
    urls_headlines = target_urls_soup.h1

    if any(item in urls_titles for item in titlesToMatch):
        finalArrayWithURLs.push(url)

So after that you will have on the finalArrayWithURLs just those URLs where the title match one of the titles of your titlesToMatch array