Question

我正在使用带有foreach()的R包%dopar%来并行进行长（〜天）计算。我希望能够在其中一个产生错误的情况下停止整个计算集。但是，我还没有找到实现这一目标的方法，从文档和各种论坛我发现没有迹象表明这是可能的。特别是，break()不起作用，stop()仅停止当前计算，而不是整个foreach循环。

请注意，我不能使用简单的for循环，因为最终我想使用doRNG包来并行化它。

以下是我正在尝试的内容的简化，可重现的版本（此处与%do%一起显示，但在使用doRNG和%dopar%时我遇到了同样的问题）。请注意，实际上我想并行运行此循环的所有元素（此处为10）。

library(foreach)
myfunc <- function() {
  x <- foreach(k = 1:10, .combine="cbind", .errorhandling="stop") %do% {
    cat("Element ", k, "\n")
    Sys.sleep(0.5) # just to show that stop does not cause exit from foreach
    if(is.element(k, 2:6)) {
      cat("Should stop\n")
      stop("Has stopped")
    }
    k
  }
  return(x)
}
x <- myfunc()
# stop() halts the processing of k=2:6, but it does not stop the foreach loop itself.
# x is not returned. The execution produces the error message
# Error in { : task 2 failed - "Has stopped"

我想要实现的是，在某些条件下（此处遇到stop()时）可以立即退出整个foreach循环。

我发现无法用foreach实现此目的。似乎我需要一种方法向所有其他进程发送消息以使它们也停止。

如果foreach无法实现，有没有人知道替代方案？我也试图用parallel::mclapply实现这一点，但这也不起作用。

> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods base

other attached packages:
[1] foreach_1.4.0

loaded via a namespace (and not attached):
[1] codetools_0.2-8 compiler_3.0.0  iterators_1.0.6

Answer 1

听起来你想要一个不耐烦的版本的“停止”错误处理。您可以通过编写自定义组合函数来实现它，并在每个结果返回后安排foreach来调用它。要做到这一点，你需要：

使用支持即时调用combine的后端，例如doMPI或doRedis
不要启用.multicombine
将.inorder设为FALSE
将.init设置为某些内容（例如NULL）

以下是一个例子：

library(foreach)
parfun <- function(errval, n) {
  abortable <- function(errfun) {
    comb <- function(x, y) {
      if (inherits(y, 'error')) {
        warning('This will leave your parallel backend in an inconsistent state')
        errfun(y)
      }
      c(x, y)
    }
    foreach(i=seq_len(n), .errorhandling='pass', .export='errval',
            .combine='comb', .inorder=FALSE, .init=NULL) %dopar% {
      if (i == errval)
        stop('testing abort')
      Sys.sleep(10)
      i
    }
  }
  callCC(abortable)
}

请注意，我还将错误处理设置为“pass”，因此foreach将使用错误对象调用combine函数。无论callCC和后端中使用的错误处理如何，foreach函数都用于从foreach循环返回。在这种情况下，callCC将调用abortable函数，向其传递一个使用强制callCC立即返回的函数对象。通过从combine函数调用该函数，当我们检测到错误对象时，我们可以从foreach循环中转义，并让callCC返回该对象。有关详细信息，请参阅?callCC。

您实际上可以在没有注册并行后端的情况下使用parfun，并在执行引发错误的任务时验证foreach循环“中断”，但这可能需要一段时间，因为任务按顺序执行。例如，如果没有注册后端，则需要20秒才能执行：

print(system.time(parfun(3, 4)))

当并行执行parfun时，我们需要做的不仅仅是突破foreach循环：我们还需要停止工作，否则他们将继续计算分配的任务。使用doMPI，可以使用mpi.abort停止工作人员：

library(doMPI)
cl <- startMPIcluster()
registerDoMPI(cl)
r <- parfun(getDoParWorkers(), getDoParWorkers())
if (inherits(r, 'error')) {
  cat(sprintf('Caught error: %s\n', conditionMessage(r)))
  mpi.abort(cl$comm)
}

请注意，在循环中止后无法使用集群对象，因为事情没有被正确清理，这就是正常的“停止”错误处理不能以这种方式工作的原因。

Answer 2

这不是您问题的直接答案，但使用when()可以避免在满足条件时进入循环：

x <- foreach(k = 1:10, .combine="cbind", .errorhandling="stop") %:%
  when( !is.element(k, 2:6) ) %do%
  {
    cat("Element ", k, "\n")
    Sys.sleep(0.5)
    k
  }

修改

我忘记了一些事情：我认为这是设计上的，你不能只停止foreach循环。如果并行运行循环，则每个回合都是独立处理的，这意味着当您停止k=2的整个循环时，如果k=1的进程已经终止或仍在运行，则无法预测。因此，使用when()条件会为您提供确定性结果。

编辑2：考虑您的评论的另一种解决方案。

shouldStop <- FALSE x <- foreach(k = 1:10, .combine="cbind", .errorhandling="stop") %do% { if( !shouldStop ){ # put your time consuming code here cat("Element ", k, "\n") Sys.sleep(0.5) shouldStop <- shouldStop || is.element(k, 2:6) k } }

使用此解决方案，在停止条件变为真时运行的进程仍将计算结束，但您可以避免在所有即将进行的进程中耗费时间。

Answer 3

我从REvolution技术支持得到的答案：“不 - foreach目前没有办法停止对任何一个错误的所有并行计算”。

Answer 4

我没有太多运气让foreach做我想做的事情，所以这是一个使用parallel包的解决方案，似乎可以做我想要的。我使用intermediate中的mcparallel()选项将我的函数do.task()的结果立即传递给函数check.res()。如果do.task()抛出错误，则会在check.res()中使用此操作来触发调用tools::pskill以明确杀死所有工作人员。这可能不是很优雅，但它的工作原理是它会立即停止所有工作。此外，我可以简单地从当前环境继承do.task()中处理所需的所有变量。（实际上do.task()是一个复杂得多的函数，需要传入许多变量。）

library(parallel)

# do.task() and check.res() inherit some variables from enclosing environment

do.task <- function(x) {
  cat("Starting task", x, "\n")
  Sys.sleep(5*x)
  if(x==stopat) { 
    stop("Error in job", x) # thrown to mccollect() which sends it to check.res()
  }
  cat("  Completed task", x, "\n")
  return(10*x)
}

check.res <- function(r) { # r is list of results so far
  cat("Called check.res\n")
  sendKill <- FALSE
  for(j in 1:Njob) { # check whether need to kill
    if(inherits(r[[j]], 'try-error')) {
      sendKill <- TRUE
    }
  }
  if(sendKill) { # then kill all
    for(j in 1:Njob) {
      cat("Killing job", job[[j]]$pid, "\n") 
      tools::pskill(job[[j]]$pid) # mckill not accessible
    }
  }
}

Tstart <- Sys.time()
stopat <- 3
Njob <- 4
job <- vector("list", length=Njob)
for(j in 1:Njob) {
  job[[j]]<- mcparallel(do.task(j))
}
res <- mccollect(job, intermediate=check.res) # res is in order 1:Njob, regardless of how long jobs took
cat("Collected\n")
Tstop <- Sys.time()
print(difftime(Tstop,Tstart))
for(j in 1:Njob) {
  if(inherits(res[[j]], 'try-error')) {
    stop("Parallel part encountered an error")
  }
}

这为变量res

提供了以下屏幕转储和结果

> source("exp5.R")
Starting task 1 
Starting task 2 
Starting task 3 
Starting task 4 
  Completed task 1 
Called check.res
Called check.res
  Completed task 2 
Called check.res
Called check.res
Called check.res
Killing job 21423 
Killing job 21424 
Killing job 21425 
Killing job 21426 
Called check.res
Killing job 21423 
Killing job 21424 
Killing job 21425 
Killing job 21426 
Called check.res
Killing job 21423 
Killing job 21424 
Killing job 21425 
Killing job 21426 
Collected
Time difference of 15.03558 secs
Error in eval(expr, envir, enclos) : Parallel part encountered an error
> res
$`21423`
[1] 10

$`21424`
[1] 20

$`21425`
[1] "Error in do.task(j) : Error in job3\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in do.task(j): Error in job3>

$`21426`
NULL

Answer 5

我没有尝试打破循环，而是在到达终端循环时将一个小文件写入磁盘，然后根据该文件的存在跳过所有剩余的迭代。

检查文件是否存在花费了我们不到一毫秒的计算时间。

Type 'Promise<string>' is not assignable to type 'string'.

当您没有固定数量的迭代，或者您的过程可以在所有迭代完成之前完成（例如，收敛）时，这非常好

# 1.4 seconds to check if a file exists a million times
system.time(lapply(1:1e6, function(x) file.exists("checker.txt")))
   user  system elapsed 
  1.204   0.233   1.437

关于这一点的妙处在于，即使您的列表非常长，即使您只将unlist（）删除，也只能得到这些值。

library(foreach)

alist <- foreach(i = 1:5000) %dopar% { 
  if(file.exists("checker.txt")) {
    return(NULL)
  } else {
    if(i = 20) {
      write("", "checker.txt") # write an empty file
    }
    return(i)
  }
}

file.remove("checker.txt")

不要费心尝试打破，而只是“跳过其余部分”！

Answer 6

史蒂夫韦斯顿的原始答案基本上回答了这个问题。但这里是他的答案的略微修改版本，它还以我需要的方式保留了两个额外的功能：（1）随机数生成; （2）打印运行时诊断。

suppressMessages(library(doMPI))

comb <- function(x, y) {
  if(inherits(y, 'error')) {
    stop(y)
  }
  rbind(x, y) # forces the row names to be 'y'
}

myfunc <- function() {
  writeLines(text="foreach log", con="log.txt")
  foreach(i=1:12, .errorhandling='pass', .combine='comb', .inorder=FALSE, .init=NULL) %dopar% {
    set.seed(100)
    sink("log.txt", append=TRUE)
    if(i==6) {
      stop('testing abort')
    }
    Sys.sleep(10)
    cat("Completed task", i, "\n")
    sink(NULL)
    rnorm(5,mean=i)
  }
}

myerr <- function(e) {
  cat(sprintf('Caught error: %s\n', conditionMessage(e)))
  mpi.abort(cl$comm)
}

cl <- startMPIcluster(4)
registerDoMPI(cl)
r <- tryCatch(myfunc(), error=myerr)
closeCluster(cl)

当此文件来源时，它会按预期退出，并显示错误消息

> source("exp2.R")
    4 slaves are spawned successfully. 0 failed.
Caught error: testing abort
[ganges.local:16325] MPI_ABORT invoked on rank 0 in communicator  with errorcode 0

'log.txt'文件提供了直至错误点的正确诊断，然后提供其他错误信息。至关重要的是，一旦遇到foreach循环中的stop（），就会暂停执行所有任务：它不会等到整个foreach循环完成。因此，我只看到“完成任务”消息，直到i = 4。（请注意，如果Sys.sleep（）较短，则可以在处理mpi.abort（）之前启动以后的任务。）

如果我将停止条件更改为“i == 100”，则不会触发停止，因此不会触发错误。代码成功存在，没有错误消息，r是尺寸为12 * 5的2D数组。

顺便说一句，似乎我实际上并不需要.inorder = FALSE（我认为只是在发现错误的情况下才会给我一个小的速度提升）。

有没有办法打破foreach循环？

6 个答案: