Xeon-Phi与Xeon无法解释的开销

时间:2018-07-10 19:23:48

标签: parallel-processing acceleration xeon-phi offloading

我正在尝试在Xeon Phi KNC(具有61个内核和4T / C)和Xeon(具有2个Xeon E5-2660 v2插槽)上运行以下具有不同n大小的代码。

我得到的时间表如下表所示。但是,我试图理解为什么MIC的性能比运行Xeon处理器差。我在这里做错什么了,如何解决(如果可能)?

谢谢!

代码:

program prog
  integer, allocatable :: arr1(:), arr2(:)
  integer :: i, n, time_start, time_end
  n=481
  do while (n .le. 481000000)
    allocate(arr1(n),arr2(n))
    call system_clock(time_start)
    !dir$ offload begin target(mic)
    !$omp SIMD 
    do i=1,n
       arr1(i) = arr1(i) + arr2(i)
    end do
    !dir$ end offload 
    call system_clock(time_end)
    write (,) "n=",n," time=",time_end-time_start
    deallocate(arr1,arr2)
    n = n*10
  end do
end program

至强皮结果:

 n=         481  time=        8881
 n=        4810  time=          75
 n=       48100  time=          53
 n=      481000  time=         261
 n=     4810000  time=        1991
 n=    48100000  time=       18912
 n=   481000000  time=      188203

设置:

#!/bin/bash #SBATCH -N 1 #SBATCH -o out_122 #SBATCH --exclusive export MIC_KMP_AFFINITY=verbose,granularity=fine,scatter export MIC_OMP_NUM_THREADS=122 ./prog.exe

sbatch -p xphi -N 1 --exclusive run_par.sh

所有设置都在run_par.sh中,而xphi是设备的名称。

还值得一提的是,本机运行(在!$ omp SIMD之前添加!dir $卸载开始target(mic))会产生更好的结果。

n= 481       time= 0 
n= 4810      time= 0 
n= 48100     time= 6 
n= 481000    time= 55 
n= 4810000   time= 455 
n= 48100000  time= 4342 
n= 481000000 time= 43322

在本机运行中,rhe设置为:

#!/bin/bash #SBATCH -N 1 #SBATCH -o out_244_native #SBATCH --exclusive export SINK_LD_LIBRARY_PATH=...intel/compilers_and_libraries/linux/lib/mic:$SINK_LD_LIBRARY_PATH micnativeloadex ./prog.exe.MIC -e "KMP_AFFINITY=verbose,granularity=fine,scatter"

至强结果:

 n=         481         time=           0
 n=        4810         time=           0
 n=       48100         time=           2
 n=      481000         time=          19
 n=     4810000         time=          93
 n=    48100000         time=         706
 n=   481000000         time=        7006

以下是我的Xeon机器上的lscpu命令的输出:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
Stepping:              4
CPU MHz:               1203.382
BogoMIPS:              4405.99
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-9,20-29
NUMA node1 CPU(s):     10-19,30-39

我的MIC规格是(/ proc / cpuinfo的尾部):

processor       : 239
vendor_id       : GenuineIntel
cpu family      : 11
model           : 1
model name      : 0b/01
stepping        : 3
cpu MHz         : 1052.630
cache size      : 512 KB
physical id     : 0
siblings        : 240
core id         : 59
cpu cores       : 60
apicid          : 239
initial apicid  : 239
fpu             : yes
fpu_exception   : yes
cpuid level     : 4
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr mca pat fxsr htsyscall nx lm nopl lahf_lm
bogomips        : 2112.44
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

0 个答案:

没有答案