
时间:2018-10-05 12:12:34

标签: haskell




integrateT :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a 
integrateT f (ini, fin) dx 
  = let lst = map f [ini,ini+dx..fin]
    in sum lst * dx - 0.5 * (f ini + f fin) * dx


main = do
  print $ (integrateT (\x -> x^4 - x^3 + x^2 + x/13 + 1) (0.0,1000000.0) 0.01 :: Double)


stack exec lab5 -- +RTS -ls -N2 -s
  18,400,147,552 bytes allocated in the heap
      20,698,168 bytes copied during GC
          66,688 bytes maximum residency (2 sample(s))
          35,712 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause
  Gen  0     17754 colls, 17754 par    0.123s   0.105s     0.0000s    0.0011s
  Gen  1         2 colls,     1 par    0.000s   0.000s     0.0001s    0.0002s

  Parallel GC work balance: 0.27% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    6.054s  (  5.947s elapsed)
  GC      time    0.123s  (  0.106s elapsed)
  EXIT    time    0.001s  (  0.008s elapsed)
  Total   time    6.178s  (  6.061s elapsed)

  Alloc rate    3,039,470,269 bytes per MUT second

  Productivity  98.0% of total user, 98.2% of total elapsed

gc_alloc_block_sync: 77
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0


integrateT :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a 
integrateT f (ini, fin) dx 
  = let lst = (map f [ini,ini+dx..fin]) `using` parListChunk 100 rdeepseq
    in sum lst * dx - 0.5 * (f ini + f fin) * dx


stack exec lab5 -- +RTS -ls -N2 -s
  59,103,320,488 bytes allocated in the heap
  17,214,458,128 bytes copied during GC
  2,787,092,160 bytes maximum residency (15 sample(s))
  43,219,264 bytes maximum slop
        5570 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause
  Gen  0     44504 colls, 44504 par   16.907s  10.804s     0.0002s    0.0014s
  Gen  1        15 colls,    14 par    4.006s   2.991s     0.1994s    1.2954s

  Parallel GC work balance: 33.60% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)

  SPARKS: 1000001 (1000001 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time   14.298s  ( 12.392s elapsed)
  GC      time   20.912s  ( 13.795s elapsed)
  EXIT    time    0.000s  (  0.003s elapsed)
  Total   time   35.211s  ( 26.190s elapsed)

  Alloc rate    4,133,806,996 bytes per MUT second

  Productivity  40.6% of total user, 47.3% of total elapsed

gc_alloc_block_sync: 2304055
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 1105370


  • 使用更多的内存
  • 更长的时间
  • 很多时间花在GC上



  • 使用parMap,parList和自定义parListChunk'函数评估列表-每次结果都比顺序版本差得多
  • 使用了不同的块大小-从很小的(如5)到列表长度的一半-每次的结果都比顺序版本差很多
  • 将主函数的因数更改为非常大的值,例如x ^ 123442,添加了更多的除数而不是乘法等。而且我还减少了问题的范围。所有这些都减少了火花,但每次计算都更加昂贵。在这里,我得到的结果类似于顺序版本的结果(使用这些新功能运行大约28秒)-并行运行在31秒内完成
  • 使用Threadscope对每次运行进行测试,以确保在预期的时候使用了两个内核!


  1. 随着并行性能随着每个块的计算成本(例如x ^ 12345)的增加和块数量的减少而提高-在因子很小的情况下(例如x ^ 4,x ^ 3 -计算速度快),因此顺序版本会更快吗?有没有一种方法可以成功地并行化其更好的性能?
  2. 为什么并行版本使用这么多的内存和GC时间?
  3. 如何减少并行版本中花费在GC上的时间?

1 个答案:

答案 0 :(得分:3)


解决方案:让每个线程运行适当可熔的顺序版本,即,划分 interval 而不是离散列表形式。喜欢

integrateT :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a 
integrateT f (ini, fin) dx 
  = let lst = map f [ini,ini+dx..fin]
    in sum lst * dx - 0.5 * (f ini + f fin) * dx

integrateT_par :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a
integrateT_par f (l,r) dx
  = let chunks = [ integrateT f (l + i*wChunk, l + (i+1)*wChunk) dx
                 | i<-[0..nChunks-1] ]
               `using` parList rdeepseq
    in sum chunks
 where nChunks = 100
       wChunk = (r-l)/nChunks

