Question

我有一个python脚本，它将搜索页面源并下载它在源代码中找到的任何文件。

但是，该脚本实际上会下载不存在的文件（死链接）。

我做了一些研究，发现可以使用HEAD来解决这个问题，它提供错误代码而无需下载文件或其他内容。

基本上，我想检查服务器是否返回404.如果是，那么我的文件不存在，我不想下载它。

我发现以下代码似乎可行，但需要进行一些修改才能使用我的脚本..

c = httplib.HTTPConnection(<hostname>)
c.request("HEAD", <url>)
print c.getresponse().status 

urllib.urlretrieve(test, get)

应该等于网站（http://google.com）应该等于文件（/file1.pdf）

我需要使用此代码，以便只需要URL：http://google.com/file1.pdf即可工作..

无论如何我能做到吗？

代码来自此处：How do I check the HTTP status code of an object without downloading it?

Answer 1

上面似乎不起作用:(

我设法解决了它！

#Gets the header code and stores in status
status = urllib.urlopen(test).getcode()
print status #Prints status, testing purposes

#if status code is equal to 200 (OK)
  if status == 200:
      urllib.urlretrieve(test, get) #download the file
      print 'The file:', doc, 'has been saved to:', get #display success message 
  elif status == 404: #if status is equal to 404 (NOT FOUND) 
      print 'The file:', doc, 'could not be saved. Does not exist!!' #display error
  else: #Any other message then display error and the status code
      print 'Unknown Error:', status

Answer 2

import httplib    

file = "http://google.com/file1.pdf"

c = httplib.HTTPConnection("google.com")
c.request("HEAD", file)
if c.getresponse().status == 200:
  download(file)

使用head下载之前检查文件是否存在

2 个答案: