Question

在我的代码成功实现OpenMP之后，我试图检查实现有多少提高了我的代码性能，但是使用gprof它给了我完全不同的平面配置文件。下面是我调用所有子程序的主程序。

program main
  use my_module
  call inputf       !to read inputs from a file
! call echo         !to check if the inputs are read in correctly, but is muted
  call allocv       !to allocate dimension to all array variable
  call bathyf       !to read in the computational domain
  call inicon       !to setup initial conditions
  call comput       !computation from iteration 1 to n
  call deallv       !to deallocate all array variables
end program main

以下是串行和并行代码的cpu_time和OMP_GET_WTIME()。 OpenMP并行区域位于subroutine comput。

!serial code
CPU time elapsed =   260.5080 seconds.
!parallel code
CPU time elapsed =   153.3600 seconds.
OMP time elapsed =    49.3521 seconds.

以下是串行和并行代码的扁平配置文件。

!Serial code
Flat profile:
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 96.26    227.63   227.63        1   227.63   236.45  comput_
  3.60    236.13     8.50     2001     0.00     0.00  update_
  0.08    236.32     0.19     2000     0.00     0.00  openbc_
  0.05    236.45     0.13       41     0.00     0.00  output_
  0.01    236.47     0.02        1     0.02     0.02  bathyf_
  0.01    236.49     0.02        1     0.02     0.03  inicon_
  0.00    236.50     0.01        1     0.01     0.01  opwmax_
  0.00    236.50     0.00     1001     0.00     0.00  timser_
  0.00    236.50     0.00        2     0.00     0.00  timestamp_
  0.00    236.50     0.00        1     0.00     0.00  allocv_
  0.00    236.50     0.00        1     0.00     0.00  deallv_
  0.00    236.50     0.00        1     0.00     0.00  inputf_

!Parallel code
Flat profile:
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 95.52     84.90    84.90                             openbc_
  1.68     86.39     1.49     2001     0.74     0.74  update_
  0.10     86.48     0.09       41     2.20     2.20  output_
  0.00     86.48     0.00     1001     0.00     0.00  timser_
  0.00     86.48     0.00        2     0.00     0.00  timestamp_
  0.00     86.48     0.00        1     0.00     0.00  allocv_
  0.00     86.48     0.00        1     0.00     0.00  bathyf_
  0.00     86.48     0.00        1     0.00     0.00  deallv_
  0.00     86.48     0.00        1     0.00     2.20  inicon_
  0.00     86.48     0.00        1     0.00     0.00  inputf_
  0.00     86.48     0.00        1     0.00     0.00  comput_
  0.00     86.48     0.00        1     0.00     0.00  opwmax_

在subroutine update内调用

openbc，output，timser和subroutine comput。如您所见，subroutine comput假设花费最多的运行时间，但并行代码的平面配置文件显示为其他情况。如果您需要其他信息，请告诉我。

Answer 1

This article说：

某些内核（例如Linux）下gprof的一个问题是它在多线程应用程序中的行为不正确。它实际上只描述主线程，这是非常无用的。

本文还提供了一种解决方法，但由于您不是手动创建线程，而是使用OpenMP（透明地创建线程），因此您必须对其进行修改以使其适用于您。

您也可以选择能够使用并行程序的探查器。

Answer 2

gprof不适合分析并行程序，因为它不了解OpenMP的复杂性。您应该使用类似Score-P和Cube的组合。前者是一个仪器框架，而后者是一个用于分层性能数据的可视化工具。两者都是开源项目。在商业方面，可以使用英特尔VTune放大器。

OpenMP和没有OpenMP代码的gprof生成不同的平面轮廓

2 个答案: