如何在写入文件时压缩输出?

时间:2016-10-20 11:06:19

标签: haskell compression gzip

我有一个计算,与其他东西一起生成一些数据(很多),我想写入一个文件。

现在构建代码的方式是(简化):

Gzip

然后在更大的计算期间定期调用此函数。它几乎就像一个日志,实际上,同时写入多个文件。

现在我希望使用outStream = new GzipOutputStream(new FileOutputStream(path)) 压缩输出文件。 在像Java这样的语言中,我会做类似的事情:

writeRecord h r = hPut h ((compressed . toByteString) r)

然后将写入包装的输出流。

在Haskell中执行此操作的方式是什么? 我想写一些像

这样的东西
ByteString

是不正确的,因为单独压缩每个小位是没有效率的(我甚至尝试过它,压缩文件的大小比未压缩的文件大)。

我也不认为我可以生成一个懒惰的compressed . fromChunks(甚至是一个块列表),然后用allHeaderFields来编写它,因为这将需要我的"发电机"在记忆中建立完整的东西。并且同时生成多个文件的事实使其更加复杂。

那么在Haskell中解决这个问题的方法是什么?写入文件并对它们进行gzip压缩?

5 个答案:

答案 0 :(得分:5)

所有流媒体库都支持压缩。如果我了解特定问题以及您考虑的方式,io-streams可能是最简单的。在这里,我在写入trumpclinton输出流之间交替,这些输出流被写为压缩文件。我接着展示迈克尔pipes计划

conduit等价物
#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc --package io-streams
{-# LANGUAGE OverloadedStrings #-}

import qualified System.IO.Streams as IOS
import qualified System.IO as IO
import Data.ByteString (ByteString)

analyzer :: IOS.OutputStream ByteString -> IOS.OutputStream ByteString -> IO ()
analyzer clinton trump = do 
  IOS.write (Just "This is a string\n") clinton
  IOS.write (Just "This is a string\n") trump
  IOS.write (Just "Clinton string\n") clinton
  IOS.write (Just "Trump string\n") trump   
  IOS.write (Just "Another Clinton string\n") clinton
  IOS.write (Just "Another Trump string\n") trump   
  IOS.write Nothing clinton
  IOS.write Nothing trump

main:: IO ()
main = 
  IOS.withFileAsOutput "some-file-clinton.txt.gz" $ \clinton_compressed ->
  IOS.withFileAsOutput "some-file-trump.txt.gz" $ \trump_compressed -> do
     clinton <- IOS.gzip IOS.defaultCompressionLevel clinton_compressed
     trump <- IOS.gzip IOS.defaultCompressionLevel trump_compressed
     analyzer clinton trump

显然,您可以将IO中的所有analyzer混合在两个输出流的写入行为之间 - 我只是在write中显示,可以这么说。特别是,如果analyzer被理解为依赖于输入流,write可以依赖于输入流中的read s。 Here's一个(略微!)更复杂的程序。如果我运行上面的程序,我会看到

$ stack gzip_so.hs  
$ gunzip some-file-clinton.txt.gz 
$ gunzip some-file-trump.txt.gz 
$ cat some-file-clinton.txt 
This is a string
Clinton string
Another Clinton string
$ cat some-file-trump.txt 
This is a string
Trump string
Another Trump string

管道和导管有多种方法可以达到上述效果,部件分解程度更高。然而,写入单独的文件会更加微妙。在任何情况下,管道都相当于Michael S的管道程序:

#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc  --package pipes-zlib 
{-# LANGUAGE OverloadedStrings #-}
import Control.Monad.IO.Class (MonadIO, liftIO)
import Data.ByteString (ByteString, hPutStr)
import System.IO  (IOMode(..), withFile, Handle)
import Pipes  
import qualified Pipes.ByteString as PB
import qualified Pipes.GZip as P

-- Some helper function you may have
someAction :: IO ByteString
someAction = return "This is a string\n"

-- Original version
producerHandle :: Handle -> IO ()
producerHandle h = do
    str <- someAction
    hPutStr h str

producerPipe :: MonadIO m => Producer ByteString m ()
producerPipe = do
    str <- liftIO someAction
    yield str

main :: IO ()
main =  withFile "some-file-pipes.txt.gz"  WriteMode $ \h -> 
     runEffect $ P.compress P.defaultCompression producerPipe  >-> PB.toHandle h 

- 编辑

这里有什么值得的东西是用管道或管道在一个线程上叠加几个生产者的另一种方式,以添加迈克尔S和danidiaz提到的不同方法:

#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc --package pipes-zlib
{-# LANGUAGE OverloadedStrings #-}
import Pipes
import Pipes.GZip
import qualified Pipes.Prelude as P
import qualified Pipes.ByteString as Bytes
import System.IO
import Control.Monad (replicateM_)

producer = replicateM_ 50000 $ do
    marie  "This is going to Marie\n"  -- arbitary IO can be interspersed here
    arthur "This is going to Arthur\n" -- with liftIO
    sylvia "This is going to Sylvia\n" 
  where 
    marie = yield; arthur = lift . yield; sylvia = lift . lift . yield

sinkHelper h p = runEffect (compress bestSpeed p >-> Bytes.toHandle h)

main :: IO ()
main =  
   withFile "marie.txt.gz" WriteMode $ \marie ->
   withFile "arthur.txt.gz"  WriteMode $ \arthur -> 
   withFile "sylvia.txt.gz"  WriteMode $ \sylvia ->
      sinkHelper sylvia
      $ sinkHelper arthur
      $ sinkHelper marie
      $ producer

它非常简单快速,并且可以通过明显的改动在管道中书写 - 但发现它自然涉及更高水平的“monad变压器堆栈”的购买。观点看法。从类似streaming库的角度来看,编写这样一个程序是最自然的方式。

答案 1 :(得分:3)

Doing this with conduit is fairly straightforward, though you'd need to adjust your code a bit. I've put together an example of before and after code to demonstrate it. The basic idea is:

  • Replace hPutStr h with yield
  • Add some liftIO wrappers
  • Instead of using withBinaryFile or the like, use runConduitRes, gzip, and sinkFile

Here's the example:

#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc --package conduit-extra
{-# LANGUAGE OverloadedStrings #-}
import Control.Monad.IO.Class (MonadIO, liftIO)
import Data.ByteString (ByteString, hPutStr)
import Data.Conduit (ConduitM, (.|), yield, runConduitRes)
import Data.Conduit.Binary (sinkFile)
import Data.Conduit.Zlib (gzip)
import System.IO (Handle)

-- Some helper function you may have
someAction :: IO ByteString
someAction = return "This is a string\n"

-- Original version
producerHandle :: Handle -> IO ()
producerHandle h = do
    str <- someAction
    hPutStr h str

-- Conduit version
producerConduit :: MonadIO m => ConduitM i ByteString m ()
producerConduit = do
    str <- liftIO someAction
    yield str

main :: IO ()
main = runConduitRes $ producerConduit
                    .| gzip
                    .| sinkFile "some-file.txt.gz"

You can learn more about conduit in the conduit tutorial.

Your Java idea is interesting, give me a few more minutes, I'll add an answer that looks more like that.

EDIT

Here's a version that's closer to your Java style approach. It relies on a SinkFunc.hs module which is available as a Gist at: https://gist.github.com/snoyberg/283154123d30ff9e201ea4436a5dd22d

#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc --package conduit-extra
{-# LANGUAGE OverloadedStrings #-}
{-# OPTIONS_GHC -Wall -Werror #-}
import Data.ByteString (ByteString)
import Data.Conduit ((.|))
import Data.Conduit.Binary (sinkHandle)
import Data.Conduit.Zlib (gzip)
import System.IO (withBinaryFile, IOMode (WriteMode))
import SinkFunc (withSinkFunc)

-- Some helper function you may have
someAction :: IO ByteString
someAction = return "This is a string\n"

producerFunc :: (ByteString -> IO ()) -> IO ()
producerFunc write = do
    str <- someAction
    write str

main :: IO ()
main = withBinaryFile "some-file.txt.gz" WriteMode $ \h -> do
    let sink = gzip .| sinkHandle h
    withSinkFunc sink $ \write -> producerFunc write

EDIT 2 One more for good measure, actually using ZipSink to stream the data to multiple different files. There are lots of different ways of slicing this, but this is one way that works:

#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc --package conduit-extra
{-# LANGUAGE OverloadedStrings #-}
import Control.Monad.Trans.Resource (MonadResource)
import Data.ByteString (ByteString)
import Data.Conduit (ConduitM, (.|), yield, runConduitRes, ZipSink (..))
import Data.Conduit.Binary (sinkFile)
import qualified Data.Conduit.List as CL
import Data.Conduit.Zlib (gzip)

data Output = Foo ByteString | Bar ByteString

fromFoo :: Output -> Maybe ByteString
fromFoo (Foo bs) = Just bs
fromFoo _ = Nothing

fromBar :: Output -> Maybe ByteString
fromBar (Bar bs) = Just bs
fromBar _ = Nothing

producer :: Monad m => ConduitM i Output m ()
producer = do
    yield $ Foo "This is going to Foo"
    yield $ Bar "This is going to Bar"

sinkHelper :: MonadResource m
           => FilePath
           -> (Output -> Maybe ByteString)
           -> ConduitM Output o m ()
sinkHelper fp f
    = CL.mapMaybe f
   .| gzip
   .| sinkFile fp

main :: IO ()
main = runConduitRes
     $ producer
    .| getZipSink
            (ZipSink (sinkHelper "foo.txt.gz" fromFoo) *>
             ZipSink (sinkHelper "bar.txt.gz" fromBar))

答案 2 :(得分:2)

对于增量压缩,我认为您可以在compressIO中使用foldCompressStream / Codec.Compression.Zlib.Internal

如果您能够以IO (Maybe a)(例如MVar拍摄或InputStream / Chan阅读)来表示您的制作人操作,那么{{1表示输入的结束,这样的事情应该起作用:

Nothing

答案 3 :(得分:1)

此解决方案类似于Michael Snoyman的 EDIT 2 ,但使用foldl管道pipes-zlibstreaming-eversion

 {-# language OverloadedStrings #-}
module Main where

-- cabal install bytestring foldl pipes pipes-zlib streaming-eversion
import Data.Foldable
import Data.ByteString
import qualified Control.Foldl as L 
import Pipes 
import qualified Pipes.Prelude
import Pipes.Zlib (compress,defaultCompression,defaultWindowBits)
import Streaming.Eversion.Pipes (transvertMIO)
import System.IO

type Tag = String

producer :: Monad m => Producer (Tag,ByteString) m ()
producer = do
    yield $ ("foo","This is going to Foo")
    yield $ ("bar","This is going to Bar")

foldForTag :: Handle -> Tag -> L.FoldM IO (Tag,ByteString) ()
foldForTag handle tag = 
      L.premapM (\(tag',bytes) -> if tag' == tag then Just bytes else Nothing)
    . L.handlesM L.folded
    . transvertMIO (compress defaultCompression defaultWindowBits)
    $ L.mapM_ (Data.ByteString.hPut handle)

main :: IO ()
main = do
    withFile "foo.txt" WriteMode $ \h1 ->
        withFile "bar.txt" WriteMode $ \h2 ->
            let multifold = traverse_ (uncurry foldForTag) [(h1,"foo"),(h2,"bar")] 
            in  L.impurely Pipes.Prelude.foldM multifold producer

答案 4 :(得分:1)

此解决方案类似于Michael Snoyman的 EDIT 2 ,但使用streamingstreaming-bytestring,管道和pipes-zlib包。

{-# language OverloadedStrings #-}
module Main where

-- cabal install bytestring streaming streaming-bytestring pipes pipes-zlib 
import Data.ByteString
import qualified Data.ByteString.Streaming as B
import Streaming
import qualified Streaming.Prelude as S
import Pipes (next)
import qualified Pipes.Prelude 
import Pipes.Zlib (compress,defaultCompression,defaultWindowBits)
import System.IO

type Tag = String

producer :: Monad m => Stream (Of (Tag,ByteString)) m ()
producer = do
    S.yield ("foo","This is going to Foo")
    S.yield ("bar","This is going to Bar")

-- I couldn't find a streaming-zlib on Hackage, took a pipes detour
compress' :: MonadIO m 
          => Stream (Of ByteString) m r -> Stream (Of ByteString) m r 
compress' = S.unfoldr Pipes.next
          . compress defaultCompression defaultWindowBits
          . Pipes.Prelude.unfoldr S.next     

keepTag :: Monad m 
        => Tag -> Stream (Of (Tag,ByteString)) m r -> Stream (Of ByteString) m r 
keepTag tag = S.map snd . S.filter ((tag==) . fst)

main :: IO ()
main = runResourceT 
     . B.writeFile "foo.txt" . B.fromChunks . compress' .  keepTag "foo"  
     . B.writeFile "bar.txt"  . B.fromChunks . compress' . keepTag "bar"  
     $ S.copy producer

我使用copy中的Streaming.Prelude功能,允许您

  

复制流的内容,以便可以对其执行两次操作   不同的方式,但没有打破流媒体。