Question

我正在尝试在3.5m line file上实现Kosaraju的图算法，其中每行是两个（空格分隔）Ints表示图形边缘。首先，我需要创建一个摘要数据结构，其中包含节点及其传入和传出边的列表。下面的代码实现了这一点，但是花了一分多钟，而我可以从MOOC论坛上的帖子看到使用其他语言的人在＆lt;＆lt; 10s中完成了。（getLines在我读到的基准测试中花费了10s而不到1s。）

我是Haskell的新手并且使用foldl'实现了一种累积方法（'是一个突破性的方法，使其终止），但它在风格方面感觉相当紧迫，我希望这就是它运行缓慢的原因。此外，我目前正计划使用类似的模式进行深度优先搜索，我担心这一切都会变得太慢。

我发现这个presentation和blog可以讨论这类问题，但过于专家级别。

import System.IO
import Control.Monad
import Data.Map.Strict as Map
import Data.List as L

type NodeName = Int
type Edges = [NodeName]
type Explored = Bool

data Node = Node Explored (Edges, Edges) deriving (Show)

type Graph1 = Map NodeName Node

getLines :: FilePath -> IO [[Int]]
getLines = liftM (fmap (fmap read . words) . lines) . readFile

getLines' :: FilePath -> IO [(Int,Int)]
getLines' = liftM (fmap (tuplify2 . fmap read . words) . lines) . readFile

tuplify2 :: [a] -> (a,a)
tuplify2 [x,y] = (x,y)

main = do
    list <- getLines "testdata.txt"  -- [String]
    --list <- getLines "SCC.txt"  -- [String]   
    let
        list' = createGraph list
    return list'

createGraph :: [[Int]] -> Graph1
createGraph xs = L.foldl' build Map.empty xs
    where
        build :: Graph1-> [Int] -> Graph1
        build = \acc (x:y:_) ->
            let tmpAcc = case Map.lookup x acc of
                Nothing -> Map.insert x (Node False ([y],[])) acc
                Just a -> Map.adjust (\(Node _ (fwd, bck)) -> (Node False ((y:fwd), bck))) x acc
            in case Map.lookup y tmpAcc of
                Nothing -> Map.insert y (Node False ([],[x])) tmpAcc
                Just a -> Map.adjust (\(Node _ (fwd, bck)) -> (Node False (fwd, (x:bck)))) y tmpAcc

Answer 1

使用地图：

尽可能使用IntMap或HashMap。 Int密钥的速度明显快于Map。 HashMap通常比IntMap更快，但使用的内存更多，库也更少。
不要进行不必要的查找。 containers包具有大量专用功能。使用alter，与问题中的createGraph实现相比，查找次数可以减半。

createGraph的示例：

import Data.List (foldl')
import qualified Data.IntMap.Strict as IM

type NodeName = Int
type Edges = [NodeName]
type Explored = Bool

data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = IM.IntMap Node

createGraph :: [(Int, Int)] -> Graph1
createGraph xs = foldl' build IM.empty xs
    where
        addFwd y (Just (Node _ f b)) = Just (Node False (y:f) b)
        addFwd y _                   = Just (Node False [y] [])
        addBwd x (Just (Node _ f b)) = Just (Node False f (x:b))
        addBwd x _                   = Just (Node False [] [x])

        build :: Graph1 -> (Int, Int) -> Graph1
        build acc (x, y) = IM.alter (addBwd x) y $ IM.alter (addFwd y) x acc

使用vectors:

考虑有效的构造函数（累加器，展开，generate，iterate，constructN等。这些可能在幕后使用突变，但使用起来比实际的可变载体更方便。
在更一般的情况下，使用盒装向量的懒惰来在构造向量时启用自引用。
尽可能使用未装箱的矢量。
当您完全确定边界时，请使用不安全的功能。
只有在没有纯替代品的情况下才使用可变载体。在这种情况下，更喜欢ST monad到IO。此外，避免创建many mutable heap objects（即，优选可变载体到可变引用的不可变载体）。

createGraph的示例：

import qualified Data.Vector as V

type NodeName = Int
type Edges = [NodeName]
type Explored = Bool

data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = V.Vector Node

createGraph :: Int -> [(Int, Int)] -> Graph1
createGraph maxIndex edges = graph'' where
    graph    = V.replicate maxIndex (Node False [] [])
    graph'   = V.accum (\(Node e f b) x -> Node e (x:f) b) graph  edges
    graph''  = V.accum (\(Node e f b) x -> Node e f (x:b)) graph' (map (\(a, b) -> (b, a)) edges)

请注意，如果节点索引的范围存在间隙，那么

是明智的

在做其他任何事情之前，连续重新标记指数。
向Node引入一个空构造函数，以表示缺少的索引。

更快的I / O：

使用Data.Text或Data.ByteString中的IO功能。在这两种情况下，还有用于将输入分解为行或单词的有效功能。

示例：

import qualified Data.ByteString.Char8 as BS
import System.IO

getLines :: FilePath -> IO [(Int, Int)]
getLines path = do
    lines <- (map BS.words . BS.lines) `fmap` BS.readFile path
    let pairs = (map . map) (maybe (error "can't read Int") fst . BS.readInt) lines
    return [(a, b) | [a, b] <- pairs]

基准：

总是这样做，不像我在这个答案中。使用criterion。

Answer 2

基于András的建议，我已经将113秒的任务减少到24（用秒表测量，因为我还不能完全做出标准做什么）（然后通过编译-O2降低到10）！去年我参加了一些课程，讨论了优化大型数据集的挑战，但这是我第一次遇到一个实际涉及问题的问题，这与我的教师建议的一样重要。这就是我现在所拥有的：

import System.IO
import Control.Monad
import Data.List (foldl')
import qualified Data.IntMap.Strict as IM
import qualified Data.ByteString.Char8 as BS

type NodeName = Int
type Edges = [NodeName]
type Explored = Bool

data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = IM.IntMap Node

-- DFS uses a stack to store next points to explore, a list can do this
type Stack = [(NodeName, NodeName)]

getBytes :: FilePath -> IO [(Int, Int)]
getBytes path = do
    lines <- (map BS.words . BS.lines) `fmap` BS.readFile path
    let
        pairs = (map . map) (maybe (error "Can't read integers") fst . BS.readInt) lines
    return [(a,b) | [a,b] <- pairs]

main = do
    --list <- getLines' "testdata.txt"  -- [String]
    list <- getBytes "SCC.txt"  -- [String] 
    let list' = createGraph' list
    putStrLn $ show $ list' IM.! 66
    -- return list'


bmark = defaultMain [
    bgroup "1" [
        bench "Sim test" $ whnf bmark' "SCC.txt"
        ]
    ]

bmark' :: FilePath -> IO ()
bmark' path = do
    list <- getLines path
    let
        list' = createGraph list
    putStrLn $ show $ list' IM.! 2


createGraph' :: [(Int, Int)] -> Graph1
createGraph' xs = foldl' build IM.empty xs
    where
        addFwd y (Just (Node _ f b)) = Just (Node False (y:f) b)
        addFwd y _                   = Just (Node False [y] [])
        addBwd x (Just (Node _ f b)) = Just (Node False f (x:b))
        addBwd x _                   = Just (Node False [] [x])

        build :: Graph1 -> (Int, Int) -> Graph1
        build acc (x, y) = IM.alter (addBwd x) y $ IM.alter (addFwd y) x acc

现在继续进行其余的练习......

Answer 3

这不是一个真正的答案，如果我加上50分，我宁愿评论AndrásKovács的帖子......

我已经在IntMap和MVector中实现了图形的加载，试图对可变性和不变性进行基准测试。

两个程序都使用Attoparsec进行解析。肯定有更经济的方法，但Attoparsec相比其高抽象级别（解析器可以站在一行）相对较快。该指南旨在避免String和read。除非正确融合，否则read部分且缓慢，[Char]速度慢且内存效率不高。

AndrásKovács指出，IntMap优于Map for Int键。我的代码提供了alter用法的另一个示例。如果节点标识符映射密集，您可能还想使用Vector和Array。它们允许通过标识符进行O（1）索引。

可变版本按需处理MVector的指数增长。这避免了精确的节点标识符的上限，但引入了更多的复杂性（向量上的引用可能会改变）。

我使用5M边缘的文件进行基准测试，标识符的范围为[0..2 ^ 16]。 MVector版本比IntMap代码快2倍（在我的计算机上12s vs 25s）。

代码为here [Gist]。

当我在我身边进行更多分析时，我会进行编辑。

优化从文件读取Haskell数据

3 个答案:

使用地图：

使用vectors:

更快的I / O：

基准：