Question

我正在编写一个脚本（多线程）来检索网站上的内容，而且该网站不是很稳定，因此不时有挂起的http请求，甚至不能由socket.setdefaulttimeout()暂时取消。由于我无法控制该网站，我唯一能做的就是改进我的代码，但我现在已经没想完了。

示例代码：

socket.setdefaulttimeout(150)

MechBrowser = mechanize.Browser()
Header = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)'}
Url = "http://example.com"
Data = "Justatest=whatever&letstry=doit"
Request = urllib2.Request(Url, Data, Header)
Response = MechBrowser.open(Request)
Response.close()

我该怎么办才能强制停止请求？实际上我想知道为什么socket.setdefaulttimeout(150)首先不起作用。有人可以帮帮我吗？

已添加:(并且问题仍未解决）

好的，我已经按照tomasz的建议并将代码更改为MechBrowser.open(Request, timeout = 60)，但同样的事情发生了。到目前为止，我仍然会随机挂起请求，有时需要几个小时，其他时间可能需要几天。现在我该怎么做？有没有办法强制这些挂起请求退出？

Answer 1

虽然socket.setsocketimeout将设置新套接字的默认超时，但如果您没有直接使用套接字，则可以轻松覆盖该设置。特别是，如果库在其套接字上调用socket.setblocking，它将重置超时。

urllib2.open有一个超时参数，hovewer，urllib2.Request没有超时。当您使用mechanize时，您应该参考他们的文档：

从Python 2.6开始，urllib2在内部对Request对象使用.timeout属性。但是，urllib2.Request没有超时构造函数参数，urllib2.urlopen（）忽略此参数。 mechanize.Request有一个超时构造函数参数，用于设置相同名称的属性，而mechanize.urlopen（）不会忽略超时属性。

来源：http://wwwsearch.sourceforge.net/mechanize/documentation.html

<强> --- --- EDIT

如果socket.setsockettimeout或传递超时到mechanize使用较小的值，但没有较高的值，则问题的根源可能完全不同。有一件事是你的图书馆可以打开多个连接（这里归功于@CédricJulien），所以超时适用于socket.open的每一次尝试，如果它没有因第一次失败而停止 - 可能需要timeout * num_of_conn秒。另一件事是socket.recv：如果连接非常慢并且你运气不够，那么整个请求可能需要timeout * incoming_bytes，因为每个socket.recv我们都可以获得一个字节，并且每次这样的通话都需要timeout秒。因为你不太可能遭受这个黑暗场景（每个超时秒一个字节？你必须是一个非常粗鲁的男孩），很可能要求花很长时间连接非常慢和非常高的超时。

您唯一的解决方案是强制整个请求超时，但这里与套接字无关。如果您使用的是Unix，则可以使用带有ALARM信号的简单解决方案。您将信号设置为timeout秒，并且您的请求将被终止（不要忘记捕获它）。您可能希望使用with语句使其清晰易用，例如：

import signal, time

def request(arg):
  """Your http request"""
  time.sleep(2)
  return arg

class Timeout():
  """Timeout class using ALARM signal"""
  class Timeout(Exception): pass

  def __init__(self, sec):
    self.sec = sec

  def __enter__(self):
    signal.signal(signal.SIGALRM, self.raise_timeout)
    signal.alarm(self.sec)

  def __exit__(self, *args):
    signal.alarm(0) # disable alarm

  def raise_timeout(self, *args):
    raise Timeout.Timeout()

# Run block of code with timeouts
try:
  with Timeout(3):
    print request("Request 1")
  with Timeout(1):
    print request("Request 2")
except Timeout.Timeout:
  print "Timeout"

# Prints "Request 1" and "Timeout"

如果想要比这更便携，你必须使用更大的枪支，例如multiprocessing，这样你就会产生一个过程来调用你的请求并在过期时终止它。由于这是一个单独的过程，您必须使用某些东西将结果传回给您的应用程序，它可能是multiprocessing.Pipe。这是一个例子：

from multiprocessing import Process, Pipe
import time

def request(sleep, result):
  """Your http request example"""
  time.sleep(sleep)
  return result

class TimeoutWrapper():
  """Timeout wrapper using separate process"""
  def __init__(self, func, timeout):
    self.func = func
    self.timeout = timeout

  def __call__(self, *args, **kargs):
    """Run func with timeout"""
    def pmain(pipe, func, args, kargs):
      """Function to be called in separate process"""
      result = func(*args, **kargs) # call func with passed arguments
      pipe.send(result) # send result to pipe

    parent_pipe, child_pipe = Pipe() # Pipe for retrieving result of func
    p = Process(target=pmain, args=(child_pipe, self.func, args, kargs))
    p.start()
    p.join(self.timeout) # wait for prcoess to end

    if p.is_alive():
      p.terminate() # Timeout, kill
      return None # or raise exception if None is acceptable result
    else:          
      return parent_pipe.recv() # OK, get result

print TimeoutWrapper(request, 3)(1, "OK") # prints OK
print TimeoutWrapper(request, 1)(2, "Timeout") # prints None

如果您想要在固定秒数后强制终止请求，那么您真的没有太多选择。 socket.timeout将为单个套接字操作（连接/ recv / send）提供超时，但如果你有多个，则可能会遇到很长的执行时间。

Answer 2

从他们的文件：

从Python 2.6开始，urllib2在Request对象上使用.timeout属性内部。但是，urllib2.Request没有超时构造函数参数和urllib2.urlopen（）忽略此参数。 mechanize.Request有一个用于的超时构造函数参数设置相同名称的属性，而mechanize.urlopen（）则不设置忽略超时属性。

也许你应该尝试用mechanize.Request替换urllib2.Request。

Answer 3

您可以尝试使用mechanize with eventlet。它不能解决您的超时问题，但是greenlet是非阻塞的，因此它可以解决您的性能问题。

Answer 4

我建议一个简单的解决方法 - 将请求移动到另一个进程，如果它无法终止从调用进程终止它，这样：

    checker = Process(target=yourFunction, args=(some_queue))
    timeout = 150
    checker.start()
    counter = 0
    while checker.is_alive() == True:
            time.sleep(1)
            counter += 1
            if counter > timeout :
                    print "Son process consumed too much run-time. Going to kill it!"
                    kill(checker.pid)
                    break

简单，快速，有效。

如果socket.setdefaulttimeout（）不起作用，我该怎么办？

4 个答案: