优化Haskell XML解析器

时间:2011-04-19 08:14:52

标签: xml performance haskell

我正在尝试Haskell并且非常享受这种体验,但我正在评估它是否具有一些相当严格的性能要求的真实项目。我的任务的第一步是处理维基百科的完整(无历史)转储(bzip) - 总共约6Gb压缩。在python中,每个原始页面(总共大约1000万)的完整提取的脚本在我的盒子上大约需要30分钟(对于参考,使用pull解析器的scala实现大约需要40分钟)。我一直在尝试使用Haskell和ghc来复制这个性能,并且一直在努力匹配它。

我一直在使用Codec.Compression.BZip进行解压缩,使用hexpat进行解析。我使用lazy bytestrings作为hexpat的输入,使用严格的字节串作为元素文本类型。为了提取每个页面的文本,我正在构建一个指向文本元素的指针的Dlist,然后迭代它以将其转储到stdout。我刚刚描述的代码已经通过了许多分析/重构迭代(我很快从字符串转换到字节串,然后从字符串连接转移到指向文本的指针列表 - 然后转到指向文本的指针列表)。我认为我从原始代码中获得了大约2个数量级的加速,但它仍需要一个半小时来解析(尽管它有一个可爱的小内存占用)。所以我正在寻找社区的一些灵感,让我更加努力。代码如下(为了从分析器中获取更多细节,我将其分解为许多子功能)。请原谅我的Haskell - 我只编写了几天(与Real World Haskell一起度过了一个星期)。并提前感谢!

import System.Exit
import Data.Maybe
import Data.List
import Data.DList (DList)
import qualified Data.DList as DList

import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString.Char8 as BS
import qualified Data.ByteString.Lazy as LazyByteString
import qualified Codec.Compression.BZip as BZip

import Text.XML.Expat.Proc
import Text.XML.Expat.Tree
import Text.XML.Expat.Format

testFile = "../data/enwiki-latest-pages-articles.xml.bz2"

validPage pageData = case pageData of
    (Just _, Just _) -> True
    (_, _) -> False

scanChildren :: [UNode ByteString] -> DList ByteString
scanChildren c = case c of
    h:t -> DList.append (getContent h) (scanChildren t)
    []  -> DList.fromList []

getContent :: UNode ByteString -> DList ByteString
getContent treeElement =
    case treeElement of
        (Element name attributes children)  -> scanChildren children
        (Text text)                         -> DList.fromList [text]

rawData t = ((getContent.fromJust.fst) t, (getContent.fromJust.snd) t)

extractText page = do
    revision <- findChild (BS.pack "revision") page
    text <- findChild (BS.pack "text") revision
    return text

pageDetails tree =
    let pageNodes = filterChildren relevantChildren tree in
    let getPageData page = (findChild (BS.pack "title") page, extractText page) in
    map rawData $ filter validPage $ map getPageData pageNodes
    where
        relevantChildren node = case node of
            (Element name attributes children) -> name == (BS.pack "page")
            (Text _) -> False

outputPages pagesText = do
    let flattenedPages = map DList.toList pagesText
    mapM_ (mapM_ BS.putStr) flattenedPages

readCompressed fileName = fmap BZip.decompress (LazyByteString.readFile fileName)
parseXml byteStream = parse defaultParseOptions byteStream :: (UNode ByteString, Maybe XMLParseError)

main = do
    rawContent <- readCompressed testFile
    let (tree, mErr) = parseXml rawContent
    let pages = pageDetails tree
    let pagesText = map snd pages
    outputPages pagesText
    putStrLn "Complete!"
    exitWith ExitSuccess

1 个答案:

答案 0 :(得分:5)

运行程序后,我得到了一些奇怪的结果:

./wikiparse  +RTS -s -A5m -H5m | tail
./wikiparse +RTS -s -A5m -H5m
3,604,204,828,592 bytes allocated in the heap
  70,746,561,168 bytes copied during GC
      39,505,112 bytes maximum residency (37822 sample(s))
       2,564,716 bytes maximum slop
              83 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 620343 collections,     0 parallel, 15.84s, 368.69s elapsed
  Generation 1: 37822 collections,     0 parallel,  1.08s, 33.08s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time  243.85s  (4003.81s elapsed)
  GC    time   16.92s  (401.77s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time  260.77s  (4405.58s elapsed)

  %GC time       6.5%  (9.1% elapsed)

  Alloc rate    14,780,341,336 bytes per MUT second

  Productivity  93.5% of total user, 5.5% of total elapsed

我认为总时间超过OK:260s比Python快30m。我不知道为什么整体时间如此之大。我真的不认为阅读6Gb文件需要一个多小时才能完成。

我再次运行你的程序来检查结果是否一致。

如果那些4'20''的结果是正确的,那么我相信机器出了问题......或者这里有一些其他奇怪的效果。

代码是在GHC 7.0.2上编译的。


编辑:我尝试过上面这个程序的各种版本。最重要的优化似乎是{ - #INLINE# - } pragma和函数的特化。有些具有相当普通的类型,已知这些类型对性能有害。 OTOH我相信内联应足以触发专业化,所以你应该尝试进一步尝试。

我没有看到我试过的GHC版本有任何显着差异(6.12 .. HEAD)。

Haskell与bzlib的绑定似乎具有最佳性能。以下程序几乎完全重新实现了标准bzcat程序,与原始程序一样快或甚至更快。

module Main where

import qualified Data.ByteString.Lazy as BSL
import qualified Codec.Compression.BZip as BZip
import System.Environment (getArgs)

readCompressed fileName = fmap (BZip.decompress) (BSL.readFile fileName)

main :: IO ()
main = do
    files <- getArgs
    mapM_ (\f -> readCompressed f >>= BSL.putStr) files                 

在我的机器上,需要大约1100秒才能将测试文件解压缩到/dev/null。我能够获得的最快版本是基于SAX样式解析器。我不确定输出是否与原始输出相匹配。在小输出上,结果是相同的,性能也是如此。在原始文件上,SAX版本更快,并在~2400秒内完成。你可以在下面找到它。

{-# LANGUAGE OverloadedStrings #-}

import System.Exit
import Data.Maybe

import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as BSL
import qualified Codec.Compression.BZip as BZip

import System.IO

import Text.XML.Expat.SAX as SAX

type ByteStringL = BSL.ByteString
type Token = ByteString
type TokenParser = [SAXEvent Token Token] -> [[Token]]

testFile = "/tmp/enwiki-latest-pages-articles.xml.bz2"


readCompressed :: FilePath -> IO ByteStringL
readCompressed fileName = fmap (BZip.decompress) (BSL.readFile fileName)

{-# INLINE pageStart #-}
pageStart :: TokenParser
pageStart ((StartElement "page" _):xs) = titleStart xs
pageStart (_:xs) = pageStart xs
pageStart [] = []

{-# INLINE titleStart #-}
titleStart :: TokenParser
titleStart ((StartElement "title" _):xs) = finish "title" revisionStart xs
titleStart ((EndElement "page"):xs) = pageStart xs
titleStart (_:xs) = titleStart xs
titleStart [] = error "could not find <title>"


{-# INLINE revisionStart #-}
revisionStart :: TokenParser
revisionStart ((StartElement "revision" _):xs) = textStart xs
revisionStart ((EndElement "page"):xs) = pageStart xs
revisionStart (_:xs) = revisionStart xs
revisionStart [] = error "could not find <revision>"

{-# INLINE textStart #-}
textStart :: TokenParser
textStart ((StartElement "text" _):xs) = textNode [] xs
textStart ((EndElement "page"):xs) = pageStart xs
textStart (_:xs) = textStart xs
textStart [] = error "could not find <text>"

{-# INLINE textNode #-}
textNode :: [Token] -> TokenParser
textNode acc ((CharacterData txt):xs) = textNode (txt:acc) xs
textNode acc xs = (reverse acc) : textEnd xs

{-# INLINE textEnd #-}
textEnd {- , revisionEnd, pageEnd -} :: TokenParser
textEnd = finish "text" . finish "revision" . finish "page" $ pageStart
--revisionEnd = finish "revision" pageEnd
--pageEnd = finish "page" pageStart

{-# INLINE finish #-}
finish :: Token -> TokenParser -> TokenParser
finish tag cont ((EndElement el):xs) | el == tag = cont xs
finish tag cont (_:xs) = finish tag cont xs
finish tag _ [] = error (show (tag,("finish []" :: String)))

main :: IO ()
main = do
  rawContent <- readCompressed testFile
  let parsed = (pageStart (SAX.parse defaultParseOptions rawContent))
  mapM_ (mapM_ BS.putStr) ({- take 5000 -} parsed) -- remove comment to finish early
  putStrLn "Complete!"

一般来说,我怀疑Python和Scala的版本是否提前完成。如果没有源代码,我无法验证该声明。

总结一下:内联和专业化应该给出合理的,大约两倍的性能提升。