Question

假设我有几个200mb +文件，我想要通过。我如何在Haskell中做到这一点？

这是我的初始计划：

import Data.List
import Control.Monad
import System.IO
import System.Environment

main = do
  filename <- liftM head getArgs
  contents <- liftM lines $ readFile filename
  putStrLn . unlines . filter (isPrefixOf "import") $ contents

在解析之前将整个文件读入内存。然后我就去了：

import Data.List
import Control.Monad
import System.IO
import System.Environment

main = do
  filename <- liftM head getArgs
  file <- (openFile filename ReadMode)
  contents <- liftM lines $ hGetContents file
  putStrLn . unlines . filter (isPrefixOf "import") $ contents

我认为hGetContents是懒惰的，it will avoid reading the whole file into memory。但是在valgrind下运行这两个脚本显示两者的内存使用量相似。所以要么我的脚本错了，要么valgrind错了。我使用

编译脚本

ghc --make test.hs -prof

我错过了什么？奖金问题：我看到很多关于Haskell中的懒惰IO实际上是一件坏事的提及。我/我为什么要使用严格的IO？

更新

所以在我阅读valgrind时看起来我错了。使用+RTS -s，这是我得到的：

 7,807,461,968 bytes allocated in the heap
 1,563,351,416 bytes copied during GC
       101,888 bytes maximum residency (1150 sample(s))
        45,576 bytes maximum slop
             2 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 13739 collections,     0 parallel,  2.91s,  2.95s elapsed
Generation 1:  1150 collections,     0 parallel,  0.18s,  0.18s elapsed

INIT  time    0.00s  (  0.00s elapsed)
MUT   time    2.07s  (  2.28s elapsed)
GC    time    3.09s  (  3.13s elapsed)
EXIT  time    0.00s  (  0.00s elapsed)
Total time    5.16s  (  5.41s elapsed)

重要的一行是101,888 bytes maximum residency，它表示在任何给定的点上我的脚本最多使用101 kb的内存。我翻阅的文件是44 MB。所以我认为判决结果是：readFile和hGetContents都是懒惰的。

后续问题：

为什么我在堆上看到7gb的内存分配？对于一个读取44 MB文件的脚本来说，这似乎非常高。

更新后续问题

看起来堆上分配的几gb内存对Haskell来说并不典型，所以不用担心。使用ByteString而不是String s可以减少内存使用量：

  81,617,024 bytes allocated in the heap
      35,072 bytes copied during GC
      78,832 bytes maximum residency (1 sample(s))
      26,960 bytes maximum slop
           2 MB total memory in use (0 MB lost due to fragmentation)

Answer 1

请不要使用String s（特别是在处理＆gt; 100 Mb文件时）。只需将其替换为ByteString s（或Data.Text）：

{-# LANGUAGE OverloadedStrings #-}

import Control.Monad
import System.Environment
import qualified Data.ByteString.Lazy.Char8 as B

main = do
  filename <- liftM getArgs
  contents <- liftM B.lines $ B.readFile filename
  B.putStrLn . B.unlines . filter (B.isPrefixOf "import") $ contents

我打赌，这会快几倍。

UPD：关于您的后续问题。
切换到字节串时，分配的内存量与魔术加速密切相关由于String只是一个通用列表，它需要额外的内存用于每个Char：指向下一个元素，对象标题等的指针。所有这些内存都需要分配然后收回。这需要大量的计算能力另一方面，ByteString是块的列表，即连续的内存块（我认为，每个不少于64个字节）。这大大减少了分配和集合的数量，并且还改善了缓存局部性。

Answer 2

readFile和hGetContents都应该是懒惰的。尝试使用+RTS -s运行程序，看看实际使用了多少内存。是什么让你认为整个文件被读入内存？

至于问题的第二部分，懒惰IO有时是意外space leaks或resource leaks的根源。不是懒惰IO本身的错误，但确定它是否泄漏需要分析它是如何使用的。

在Haskell中解析大型日志文件

2 个答案: