最有效的数æ®ç»“构,用于查找最常è§çš„项目

时间:2015-07-10 18:52:02

标签: sorting haskell data-structures word-frequency

我想从Google N-Grams dataset中æå–最常用的å•è¯ï¼Œå…¶æœªåŽ‹ç¼©æ ¼å¼çº¦ä¸º20 GB。我ä¸æƒ³è¦æ•´ä¸ªæ•°æ®é›†ï¼Œæœ€å¸¸è§çš„5000个。但如果我写

take 5000 $ sortBy (flip $ comparing snd) dataset
-- dataset :: IO [(word::String, frequency::Int)]
这将是一个无休止的等待。但我该怎么åšå‘¢ï¼Ÿ

我知é“有Data.Array.MArray包å¯ç”¨äºŽå°±åœ°æ•°ç»„计算,但我在其文档页é¢ä¸Šçœ‹ä¸åˆ°ä»»ä½•ä¿®æ”¹é¡¹ç›®çš„功能。还有Data.HashTable.IO,但它是无åºçš„æ•°æ®ç»“构。

我想使用简å•çš„Data.IntMap.Strict(具有方便的lookupLE功能),但我ä¸è®¤ä¸ºå®ƒä¼šéžå¸¸æœ‰æ•ˆï¼Œå› ä¸ºå®ƒä¼šåœ¨æ¯ä¸ªåœ°æ–¹ç”Ÿæˆä¸€ä¸ªæ–°åœ°å›¾æ”¹é€ ã€‚ ST monadå¯ä»¥æ”¹å–„å—?

UPD:我还在CoreReview.SX上å‘布了最终版本的程åºã€‚

2 个答案:

答案 0 :(得分:5)

怎么样

  • 使用splitAt将数æ®é›†åˆ’分为å‰5000个项目和其余部分。
  • 按频率(å‡åºï¼‰
  • 对å‰5000个项目进行排åº
  • 完æˆå‰©ä¸‹çš„工作
    • 如果项目的频率高于已排åºé¡¹ç›®ä¸­çš„最低频率
    • 从已排åºçš„项目中删除最低频率项目
    • 将新项目æ’入已排åºé¡¹ç›®
    • 中的适当ä½ç½®

然åŽï¼Œè¯¥è¿‡ç¨‹å˜ä¸ºæœ‰æ•ˆçº¿æ€§ï¼Œä½†å¦‚果对具有次线性min-deleteå’Œæ’入的已排åºçš„5000个元素使用数æ®ç»“构,则系数会得到改善。

例如,使用Data.Heap from the heap package:

import Data.List (foldl')
import Data.Maybe (fromJust)
import Data.Heap hiding (splitAt)

mostFreq :: Int -> [(String, Int)] -> [(String, Int)]
mostFreq n dataset = final
  where
    -- change our pairs from (String,Int) to (Int,String)
    pairs = map swap dataset
    -- get the first `n` pairs in one list, and the rest of the pairs in another
    (first, rest) = splitAt n pairs
    -- put all the first `n` pairs into a MinHeap
    start = fromList first :: MinHeap (Int, String)
    -- then run through the rest of the pairs
    stop = foldl' step start rest
    -- modifying the heap to replace its least frequent pair
    -- with the new pair if the new pair is more frequent
    step heap pair = if viewHead heap < Just pair
                       then insert pair (fromJust $ viewTail heap)
                       else heap
    -- turn our heap of (Int, String) pairs into a list of (String,Int) pairs
    final = map swap (toList stop)
    swap ~(a,b) = (b,a)

答案 1 :(得分:1)

ä½ å°è¯•è¿™ä¸ªæˆ–者你åªæ˜¯çŒœæµ‹ï¼Ÿå› ä¸ºè®¸å¤šHaskell排åºå‡½æ•°éƒ½å°Šé‡ laziness ,当你åªè¦æ±‚å‰5000å时,他们会很ä¹æ„é¿å…对其余元素进行排åºã€‚

åŒæ ·ï¼Œè¦éžå¸¸å°å¿ƒï¼†ï¼ƒ34;它会在æ¯æ¬¡æ›´æ”¹æ—¶ç”Ÿæˆä¸€å¼ æ–°åœ°å›¾ï¼†ï¼ƒ34;。在这ç§æ•°æ®ç»“构中,大多数æ’å…¥æ“作都是O(log n),né™åˆ¶ä¸º5000:所以你å¯èƒ½ä¼šåœ¨æ¯æ¬¡æ›´æ”¹æ—¶åœ¨å †ä¸­åˆ†é…~30个新å•å…ƒæ ¼ï¼Œä½†è¿™ä¸æ˜¯ç‰¹åˆ«çš„巨大的æˆæœ¬ï¼Œè‚¯å®šä¸ä¼šåƒ5000那么大。

如果Data.List.sortä¸èƒ½å¾ˆå¥½åœ°è¿ä½œï¼Œé‚£ä¹ˆæ‚¨éœ€è¦çš„是:

import Data.List (foldl')
import Data.IntMap.Strict (IntMap)
import qualified Data.IntMap.Strict as IM

type Freq = Int
type Count = Int
data Summarizer x = Summ {tracking :: !IntMap [x], least :: !Freq, 
                        size :: !Count, size_of_least :: !Count }

inserting :: x -> Maybe [x] -> Maybe [x]
inserting x Nothing = Just [x]
inserting x (Just xs) = Just (x:xs)

sizeLimit :: Summarizer x -> Summarizer x
sizeLimit skip@(Summ strs f_l tot lst) 
    | tot - lst < 5000 = skip
    | otherwise        = Summ strs' f_l' tot' lst'
        where (discarded, strs') = IM.deleteFindMin strs
              (f_l', new_least) = IM.findMin dps'
              tot' = tot - length discarded
              lst' = length new_least

addEl :: (x, Freq) -> Summarizer x -> Summarizer x
addEl (str, f) skip@(Summ strs f_l tot lst)
    | i < f_l && tot >= 5000 = skip
    | otherwise              = sizeLimit $ Summ strs' f_l' tot' lst'
        where strs' = IM.alter (inserting str) f strs
              tot' = tot + 1
              f_l' = min f_l f
              lst' = case compare f_l f of LT -> lst; EQ -> lst + 1; GT -> 1

请注æ„,我们存储字符串列表以处ç†é‡å¤é¢‘率;我们主è¦æ˜¯è·³è¿‡æ›´æ–°ï¼Œå½“我们åšæ›´æ–°æ—¶ï¼Œå®ƒä¼šè¿›è¡ŒO(log n)æ“作以将新元素放入,有时(å†æ¬¡ä¾èµ–于é‡å¤ï¼‰æ“作O(log n)æ“作删除最å°çš„元素,以åŠO(log n)æ“作以找到新的最å°å…ƒç´ ã€‚