使用正则表达式排除搜索结果中的href链接

时间:2018-08-22 15:14:33

标签: python regex python-3.x list google-api

我正在尝试从我的Google API搜索结果中排除某些链接。我正在尝试使用从links_to_exclude列表中提取的正则表达式。这种方法仍然输出我不需要的链接。

返回的一些链接:

https://money.cnn.com/2018/08/21/technology/facebook-disinformation-iran-russia/index.html

https://www.cnn.com/videos/politics/2018/08/22/carl-bernstein-worse-than-watergate-egregious-trump-newday-sot-vpx.cnn

https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news

如何使用正则表达式排除这些链接?

links_to_exclude = ['cnn.com', 'nytimes.com']

for item in search_terms:
results = google_search(item, api_key, cse_id, num=1)
for result in results:
    rtn_link = result.get('link')
    for link in links_to_exclude:
        regex = '((http[s]?|ftp):\/)?\/?([^:\/\s]+)?({})\/([^\/]+)'.format(link)
        if re.search(regex, rtn_link):
            continue
        else:
            pprint.pprint(result.get('link'))

1 个答案:

答案 0 :(得分:1)

您的正则表达式似乎是正确的。我认为您只是在脚本上缺少import re

参见此处:https://ideone.com/Uzcf1K

import re

links_to_exclude = ['cnn.com', 'nytimes.com']
results = ['https://foo.bar', 'https://money.cnn.com/2018/08/21/technology/facebook-disinformation-iran-russia/index.html','https://www.cnn.com/videos/politics/2018/08/22/carl-bernstein-worse-than-watergate-egregious-trump-newday-sot-vpx.cnn','https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news']

for result in results:
    print "URL: " + result
    for link in links_to_exclude:
        regex = '((http[s]?|ftp):\/)?\/?([^:\/\s]+)?({})\/([^\/]+)'.format(link)
        if re.search(regex, result):
            print '  Matches: ' + link
        else:
            print '  Does not match: ' + link

输出:

URL: https://foo.bar
  Does not match: cnn.com
  Does not match: nytimes.com
URL: https://money.cnn.com/2018/08/21/technology/facebook-disinformation-iran-russia/index.html
  Matches: cnn.com
  Does not match: nytimes.com
URL: https://www.cnn.com/videos/politics/2018/08/22/carl-bernstein-worse-than-watergate-egregious-trump-newday-sot-vpx.cnn
  Matches: cnn.com
  Does not match: nytimes.com
URL: https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news
  Does not match: cnn.com
  Matches: nytimes.com