检查网址(字符串)

时间:2017-10-20 15:06:03

标签: python-3.x url

我在将网址解析为字符串方面遇到了一些麻烦。我需要检查url是否属于白名单中的域,但是检查是失败的。我想了解原因以及我的代码是否缺乏。

whitelist = []
whitelist_file = open(whitelist_file, 'r')
url = whitelist_file.readline()
for url in whitelist_file:
    whitelist = whitelist + [str(url)]
whitelist_file.close()

test_file = open(test_file, 'r')
url_to_check = test_file.readlines()

for url in url_to_check:
    for word in whitelist:
        print(str(word), str(url), word in url)
        print("-----")

这是上述内容的打印输出(因此您有已检查字符串的样本)。你可以看到a2a.eu失败了

a2a.eu
 https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html
 False
-----
ansa.it
 https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html
 False
-----
atlantia.it
 https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html
 False
-----
azimut-group.com
 https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html
 False
-----
a2a.eu
 https://www.a2a.eu/en/2017-financial-calendar-a2a-spa
 False
-----
ansa.it
 https://www.a2a.eu/en/2017-financial-calendar-a2a-spa
 False
-----
atlantia.it
 https://www.a2a.eu/en/2017-financial-calendar-a2a-spa
 False
-----
azimut-group.com
 https://www.a2a.eu/en/2017-financial-calendar-a2a-spa
 False
-----
a2a.eu
 http://www.a2a.eu/en
 False
-----
ansa.it
 http://www.a2a.eu/en
 False
-----
atlantia.it
 http://www.a2a.eu/en
 False
-----
azimut-group.com
 http://www.a2a.eu/en
 False

感谢

2 个答案:

答案 0 :(得分:0)

首先,根据您输出的某些情况,此检查应产生True结果。这实际上只是根据输出打印来判断。我怀疑你的网址或单词(在whilelist中)不是你认为它们的字符串对象;尝试在print语句中转换为str

  print(str(word), str(url), str(word) in str(url))

另外,您似乎只是要检查域名,看看urllib https://docs.python.org/3/library/urllib.html,您可以在其中将网址解析到域部分并检查它:

  from urllib.parse import urlparse
  print(str(word), str(url), str(word) in urlparse(str(url)).hostname

答案 1 :(得分:0)

第5行中的网址包含换行符。调用strip()并应该修复它:

whitelist = []
whitelist_file = open(whitelist_file, 'r')
url = whitelist_file.readline()
for url in whitelist_file:
  whitelist = whitelist + [str(url.strip())]
  whitelist_file.close()

test_file = open(test_file, 'r')
url_to_check = test_file.readlines()

for url in url_to_check:
  for word in whitelist:
    print(str(word), str(url), word in url)
    print("-----")