Mpi_scatterv:分段错误取决于进程数

时间:2019-07-14 15:14:48

标签: python-3.x segmentation-fault mpi hpc mpi4py

我想将2D-numpy数组拆分为x个部分,并通过 mpi4py 将其发送到x个不同的进程,但是当我使用 scatterv 函数时,在x = 2的任务1上遇到段错误,但对于x = 4或x = 8则工作正常。

因此numpy数组的尺寸为(151789810,9)和dtype float32,我想将其沿0轴散布。该节点有足够的内存(512 GB)。

我正在使用:

  • Python 3.x
  • ParaStationMPI 5.2.2
  • mpi4py 3.0.1

某些代码:

这就是我的小“散布类”中的我的小“散布方法”:


def distribute(self):

    #
    #self.data is the numpy array with shape (151789810, 9) and dtype float32
    #

    #
    #Broadcast scatter params (prev. calculated)
    #

    self.displacements_input = self.comm.bcast(self.displacements_input, root=0)
    self.split_sizes_input = self.comm.bcast(self.split_sizes_input,root=0)
    self.split_shapes = self.comm.bcast(self.split_shapes,root = 0)

    #
    #Alloc. Target
    #

    self.chunk = np.zeros((self.split_shapes[self.rank]),dtype = np.float32)

    self.comm.barrier()

    #
    #Print Info
    #

    if self.rank == 0:
        print(self.data.shape)
        print(self.data.dtype)

        print("dis: ",self.displacements_input)
        print("sizes: ",self.split_sizes_input)
        print("shapes: ",self.split_shapes)

    print("Chunk of rank {} has shape {} and dtype {}".format(self.rank,self.chunk.shape,self.chunk.dtype))

    #
    #Call scatterv
    #

    #
    #And here is the segfault
    #

    self.comm.Scatterv([self.data,self.split_sizes_input, self.displacements_input,self.MPI_obj.FLOAT],self.chunk,root=0)

两个过程的输出:

    #(151789810, 9)
    #float32

    #dis:  [        0 683054145]
    #sizes:  [683054145 683054145]
    #shapes:  [(75894905, 9), (75894905, 9)]

    #Chunk of rank 0 has shape (75894905, 9) and dtype float32
    #Chunk of rank 1 has shape (75894905, 9) and dtype float32

...我得到:

    #srun: error: nodename: task 0: Terminated
    #srun: error: nodename: task 1: Segmentation fault

四个过程的输出:


    #(151789810, 9)
    #float32

    #dis:  [         0  341527077  683054154 1024581222]
    #sizes:  [341527077 341527077 341527068 341527068]
    #shapes:  [(37947453, 9), (37947453, 9), (37947452, 9), (37947452, 9)]

    #Chunk of rank 0 has shape (37947453, 9) and dtype float32
    #Chunk of rank 1 has shape (37947453, 9) and dtype float32
    #Chunk of rank 2 has shape (37947452, 9) and dtype float32
    #Chunk of rank 3 has shape (37947452, 9) and dtype float32

...一切都很好。

编辑:

因此,这是一个(希望)再现错误的小示例。 numpy数组的维数可以由 first_dim_n second_dim_n 定义。可以通过调整 srun 参数来更改进程数。

我知道该程序适用于(1000,9)之类的“小”数组,但我对 large 内存中的 big 数组感兴趣。因此,请确保您具有类似的 ratio 。如果仍然有效,则该错误可能是任何地方 ...

对于@GillesGouaillardet,一切正常。我现在正在寻找任何有用的调试信息/崩溃报告...

reprex:

from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

import numpy as np

def get_split_sizes_and_displacements(arr,indices_or_sections,axis = 0):

        Ntotal = arr.shape[axis]
        Nsections = int(indices_or_sections)
        Neach_section, extras = divmod(Ntotal, Nsections)

        section_sizes = ([0] +
                         extras * [Neach_section+1] +
                         (Nsections-extras) * [Neach_section])

        feature_size = arr.shape[1]

        div_points = np.array(section_sizes).cumsum()
        div_points *= feature_size

        displacements = div_points[:-1]
        split_sizes = np.ediff1d(div_points)
        split_shapes = list(map(lambda x:(int(x/feature_size),feature_size),split_sizes))

        return split_sizes, displacements, split_shapes

if __name__ == '__main__':

    if rank == 0:

        first_dim_n = 151789810
        second_dim_n = 9

        data = np.random.rand(first_dim_n,second_dim_n).astype(np.float32)
        split_sizes, displacements, split_shapes = get_split_sizes_and_displacements(data,size)

    else:

        split_sizes = None
        displacements = None
        split_shapes = None
        data = None

    split_sizes_input = comm.bcast(split_sizes, root = 0)
    displacements_input = comm.bcast(displacements, root = 0)
    split_shapes_input = comm.bcast(split_shapes, root = 0)

    comm.barrier()

    chunk = np.zeros(split_shapes_input[rank],dtype=np.float32)
    comm.Scatterv([data,split_sizes_input, displacements_input,MPI.FLOAT],chunk,root=0)

    if rank == 0:

        print(data.shape)

    print("rank {} has shape {}".format(rank,chunk.shape))

两个过程的输出:

srun: error: nodename: task 0: Terminated
srun: error: nodename: task 1: Segmentation fault

四个过程的输出:

(151789810, 9)
rank 0 has shape (37947453, 9)
rank 1 has shape (37947453, 9)
rank 2 has shape (37947452, 9)
rank 3 has shape (37947452, 9)

一切都很好。

编辑2:

因此有必要说服您我的代码正在执行我想要的操作。 很好,我会尝试。

澄清

让我们定义一个任意形状的数组。

(7,3)

它看起来可能像这样:

a = array([[ 0, 11],
         [13,  6],
         [ 1,  9],
         [ 3, 14],
         [ 4,  8],
         [ 9, 16],
         [ 3, 17]])

第一个轴向下,第二个轴右侧。我们想将数组沿第一轴分为两部分-向下。那是模棱两可的。因此,我们引入了惯例,即更高级别的进程会被稍后填充。

所以等级0应该得到:

array([[ 0, 11],
       [13,  6],
       [ 1,  9],
       [ 3, 14]])

第1级应该获得:

array([[ 4,  8],
       [ 9, 16],
       [ 3, 17]])

那是可行的。至少在 my 机器上。这也是我想要的。

0 个答案:

没有答案