Question

我试图并行化一个遍历矩阵行的进程。我希望对于该行的每个元素，它是一个物种，它提取并写入一个文件（栅格），对应于每个物种在其栖息地上的分布。

Habitas图层是一个栅格文件，每个物种分布都是一个shapefile的多边形（或多边形组）。我首先将物种多边形转换为栅格，然后提取物种的栖息地（存储在物种栖息地代码与栖息地栅格值匹配的矩阵中），最后交叉（相乘）分布和栖息地

另外，我想产生丰富度（物种数量图）文件（光栅）。然后，我将（总和）添加到每个最终物种分布的空栅格（值为零）。我写了以下函数：

extract_habitats=function(k,spp_data,spp_polygons,sep,habitat_codes,cover)
{
  #Libraries
  library(rgdal)
  library(raster)
  #raster file with zeros
  richness_cur=raster("richness_current.tif")
  #Selection of species polygons
  rows=as.numeric(which(as.character(spp_polygons@data$binomial)==
                          as.character(spp_data$binomial[k])))
  spp_poly=spp_polygons[rows,]
  #Covert polygon(s) to raster
  spp_poly=rasterize(spp_poly,cover,1,background=0)
  #Match species habitats codes with habitats raster values
  habs=as.character(spp_data$hab_code[k])
  habs=unlist(strsplit(habs, split=sep))#habitat codes are separeted by a ";"
  cov_classes=as.numeric(as.character(habitat_codes[,2]#Get the hab
                                      [which(as.character(habitat_codes[,1])%in%habs)]))
  #Intersect species distributions with habitats
  cov_mask=spp_poly*cover
  #Extract species habitats
  cov_mask=Which(cov_mask%in%cov_classes)
  writeRaster(cov_mask,paste(spp_data$binomial[k]," current.tif",sep=""))
  #Sum species richness
  richness_cur=richness_cur+cov_mask
  return (richness_cur)
}

我尝试使用clusterApply和foreach函数并行化该过程。但是，我无法从函数中返回一个栅格对象（这是在常规循环函数中显而易见的东西），在任何一个函数中都可以向该对象添加物种丰富度的总和。所以，这是我的第一个问题。的 1。有没有人知道如何在并行化过程中返回与列表，矩阵或向量不同的对象？

我在每次“迭代”中编写丰富文件时解决了这个问题。然而，这个选项导致进程变慢，所以对我来说，它并不理想。然后，该函数被重写如下：

extract_habitats=function(k,spp_data,spp_polygons,sep,habitat_codes,cover)
{
  #Libraries
  library(rgdal)
  library(raster)
  #raster file with zeros
  richness_cur=raster("richness_current.tif")
  #Selection of species polygons
  rows=as.numeric(which(as.character(spp_polygons@data$binomial)==
                          as.character(spp_data$binomial[k])))
  spp_poly=spp_polygons[rows,]
  #Covert polygon(s) to raster
  spp_poly=rasterize(spp_poly,cover,1,background=0)
  #Match species habitats codes with habitats raster values
  habs=as.character(spp_data$hab_code[k])
  habs=unlist(strsplit(habs, split=sep))#habitat codes are separeted by a ";"
  cov_classes=as.numeric(as.character(habitat_codes[,2]#Get the hab
                                      [which(as.character(habitat_codes[,1])%in%habs)]))
  #Intersect species distributions with habitats
  cov_mask=spp_poly*cover
  #Extract species habitats
  cov_mask=Which(cov_mask%in%cov_classes)
  writeRaster(cov_mask,paste(spp_data$binomial[k]," current.tif",sep=""))
  #Sum species richness
  richness_cur=richness_cur+cov_mask
  writeRaster(richness_cur,"richness_current.tif")
}

运行并行化的完整代码是：

#Number of cores
no_cores=detectCores()-1
#Initiate cluster
cl=makeCluster(no_cores,type="PSOCK")
registerDoParallel(cl)

#Table with name and habitat information (columns) for each species (rows)
spp_data=read.xlsx2("species_file.xls",sheetIndex=1)
#Shape file with species distributions as polygons
spp_polygons=readOGR("distributions.shp")
#Separation symbol for species habitats stored in spp_data
sep=";"
#Tabla joining habitas species codes with habitats raster
habitat_codes=read.xlsx2("spp_habitats_final.xls",sheetIndex=1)
#Habitats raster
cover=raster("Z:/Data/cover_2015_proj_fixed_reclas_1km.tif")

#Paralelization
foreach(k=1:nrow(spp_data)) %dopar% extract_habitats(k=k,
                                                     spp_data=spp_data,
                                                     spp_polygons=spp_polygons,sep=sep,
                                                     habitat_codes=habitat_codes,
                                                     cover=cover)
stopImplicitCluster()
stopCluster(cl)

并行化过程运行;然而，它并没有按照我的预期运行，因为它没有使用所有核心：Image of processors working。因此，并行化过程的作用是启动39（核心数）进程：Image of processes opened，但它不会逐个写入文件，我在常规循环中的期望。它突然写了39个文件的块（我能理解的东西），但是花了很多时间（因为它似乎在几个核心中工作），甚至比我运行常规循环（运行常规循环，每个文件都写入）每隔两三分钟，而大约每一小时写一次39个文件的块。

所以，这是我的第二组问题。 2.我做得不好？ 3.为什么它没有使用所有39个处理器，或者它使用它们，为什么它不在最高级别使用它们？ 4.为什么当它完成一个任务时它没有开始另一个任务（我想它因为它总是以39块为单位写文件）？

提前感谢您的帮助。

干杯，

的Jaime

Answer 1

有没有人知道如何在并行化过程中返回与列表，矩阵或向量不同的对象？

对于你的第一个问题，它没有意义。你想要什么样的物品回归？列表可以包含任何R对象。

为什么它没有使用所有39个处理器，或者它使用它们，为什么呢不会在最高级别使用它们吗？

有很多潜在的原因。从查看代码，一个原因可能是磁盘IO有限，因为您正在将大量映像写入磁盘。另一个潜在原因是内存大小限制。

我做得不好？

如果您使用的是Linux（或任何非Windows），则应使用基本R并行程序包中的mclapply函数。

R中的并行处理不使用所有核心

1 个答案: