Question

我有一些代码可以排队大量（1000s）的芹菜任务，例如，我们可以这样说：

for x in xrange(2000):
    example_task.delay(x)

是否有更好/更有效的方法可以同时排队大量任务？他们都有不同的论点。

Answer 1

对您的芹菜工人来说，调用大量任务可能不健康。此外，如果您正在考虑收集调用任务的结果，那么您的代码将不是最佳的。

您可以批量调整任务。考虑以下链接中提到的示例。

http://docs.celeryproject.org/en/latest/userguide/canvas.html#chunks

Answer 2

当我们想要使用Celery处理数百万份PDF时，我们也遇到了这个问题。我们的解决方案是写一些我们称之为CeleryThrottle的东西。基本上，您使用所需的Celery队列配置了限制，并在其中配置了所需的任务数，然后在循环中创建任务。在创建任务时，限制将监视实际队列的长度。如果它耗尽太快，它会加速你的循环一段时间，所以更多的任务被添加到队列中。如果队列变得太大，它将减慢你的循环并让一些任务完成。

以下是代码：

class CeleryThrottle(object):
    """A class for throttling celery."""

    def __init__(self, min_items=100, queue_name='celery'):
        """Create a throttle to prevent celery run aways.

        :param min_items: The minimum number of items that should be enqueued. 
        A maximum of 2× this number may be created. This minimum value is not 
        guaranteed and so a number slightly higher than your max concurrency 
        should be used. Note that this number includes all tasks unless you use
        a specific queue for your processing.
        """
        self.min = min_items
        self.max = self.min * 2

        # Variables used to track the queue and wait-rate
        self.last_processed_count = 0
        self.count_to_do = self.max
        self.last_measurement = None
        self.first_run = True

        # Use a fixed-length queue to hold last N rates
        self.rates = deque(maxlen=15)
        self.avg_rate = self._calculate_avg()

        # For inspections
        self.queue_name = queue_name

    def _calculate_avg(self):
        return float(sum(self.rates)) / (len(self.rates) or 1)

    def _add_latest_rate(self):
        """Calculate the rate that the queue is processing items."""
        right_now = now()
        elapsed_seconds = (right_now - self.last_measurement).total_seconds()
        self.rates.append(self.last_processed_count / elapsed_seconds)
        self.last_measurement = right_now
        self.last_processed_count = 0
        self.avg_rate = self._calculate_avg()

    def maybe_wait(self):
        """Stall the calling function or let it proceed, depending on the queue.

        The idea here is to check the length of the queue as infrequently as 
        possible while keeping the number of items in the queue as closely 
        between self.min and self.max as possible.

        We do this by immediately enqueueing self.max items. After that, we 
        monitor the queue to determine how quickly it is processing items. Using 
        that rate we wait an appropriate amount of time or immediately press on.
        """
        self.last_processed_count += 1
        if self.count_to_do > 0:
            # Do not wait. Allow process to continue.
            if self.first_run:
                self.first_run = False
                self.last_measurement = now()
            self.count_to_do -= 1
            return

        self._add_latest_rate()
        task_count = get_queue_length(self.queue_name)
        if task_count > self.min:
            # Estimate how long the surplus will take to complete and wait that
            # long + 5% to ensure we're below self.min on next iteration.
            surplus_task_count = task_count - self.min
            wait_time = (surplus_task_count / self.avg_rate) * 1.05
            time.sleep(wait_time)

            # Assume we're below self.min due to waiting; max out the queue.
            if task_count < self.max:
                self.count_to_do = self.max - self.min
            return

        elif task_count <= self.min:
            # Add more items.
            self.count_to_do = self.max - task_count
            return

我们使用它像：

throttle = CeleryThrottle(min_items=30, queue_name=queue)
for item in items:
    throttle.maybe_wait()
    do_something.delay()

所以它使用起来非常简单，它可以很好地将队列保持在一个快乐的地方 - 不会太长，也不会太短。它保持队列耗尽的速率的滚动平均值，并且可以相应地调整它自己的定时器。

芹菜 - 批量队列任务

2 个答案: