Question

此问题类似于之前在Stack Overflow上提出的Remove html character entities in a string问题。 然而，接受的答案并未解决命名HTML实体的问题，例如字符ä的{{1}}; 因此无法取消所有HTML。

我有一些遗留HTML，它使用命名HTML实体作为非ASCII字符。也就是说，ä代替ö，ö代替ä等等。维基百科上提供A full list of all named HTML entities。

我想以快速有效的方式将这些HTML实体转换为它们的角色等价物。

我有使用正则表达式在Python 3中执行此操作的代码：

ä

然而，正则表达式在Haskell中似乎不太流行，快速或易于使用。

import re import html.entities s = re.sub(r'&(\w+?);', lambda m: chr(html.entities.name2codepoint[m.group(1)]), s)（tagsoup）有一个有用的表和函数，用于映射命名实体tpo代码点。使用这个和regex-tdfa包，我在Haskell中设计了一个极慢的等价物：

Text.HTML.TagSoup.Entity

{-# LANGUAGE OverloadedStrings #-} import Data.ByteString.Lazy.Char8 as L import Data.ByteString.Lazy.UTF8 as UTF8 import Text.HTML.TagSoup.Entity (lookupEntity) import Text.Regex.TDFA ((=~~)) unescapeEntites :: L.ByteString -> L.ByteString unescapeEntites = regexReplaceBy "&#?[[:alnum:]]+;" $ lookupMatch where lookupMatch m = case lookupEntity (L.unpack . L.tail . L.init $ m) of Nothing -> m Just x -> UTF8.fromString [x] -- regex replace taken from http://mutelight.org/articles/generating-a-permalink-slug-in-haskell regexReplaceBy :: L.ByteString -> (L.ByteString -> L.ByteString) -> L.ByteString -> L.ByteString regexReplaceBy regex f text = go text [] where go str res = if L.null str then L.concat . reverse $ res else case (str =~~ regex) :: Maybe (L.ByteString, L.ByteString, L.ByteString) of Nothing -> L.concat . reverse $ (str : res) Just (bef, match , aft) -> go aft (f match : bef : res)函数的运行速度比上面的Python版慢几个数量级。 Python代码可以在7秒内转换大约130 MB，而我的Haskell版本可以运行几分钟。

我正在寻找更好的解决方案，主要是在速度方面。但是如果可能的话，我也想避免使用正则表达式（无论如何，速度和避免正则表达式似乎在Haskell中齐头并进）。

Answer 1

这是我的版本。它使用String（而不是ByteString）。

import Text.HTML.TagSoup.Entity (lookupEntity)

unescapeEntities :: String -> String
unescapeEntities [] = []
unescapeEntities ('&':xs) = 
  let (b, a) = break (== ';') xs in
  case (lookupEntity b, a) of
    (Just c, ';':as) ->  c  : unescapeEntities as    
    _                -> '&' : unescapeEntities xs
unescapeEntities (x:xs) = x : unescapeEntities xs

我猜它会更快，因为它不使用昂贵的正则表达式操作。我没有测试过。如果你需要它更快，你可以为ByteString或Data.Text调整它。

Answer 2

你可以安装web-encodings包，获取decodeHtml函数的Sourcecode并添加你需要的字符（适合我）。这就是你所需要的：

import Data.Maybe
import qualified Web.Encodings.StringLike as SL
import Web.Encodings.StringLike (StringLike)
import Data.Char (ord)

-- | Decode HTML-encoded content into plain content.
--
-- Note: this does not support all HTML entities available. It also swallows
-- all failures.
decodeHtml :: StringLike s => s -> s
decodeHtml s = case SL.uncons s of
    Nothing -> SL.empty
    Just ('&', xs) -> fromMaybe ('&' `SL.cons` decodeHtml xs) $ do
        (before, after) <- SL.breakCharMaybe ';' xs
        c <- case SL.unpack before of -- this are small enough that unpack is ok
            "lt" -> return '<'
            "gt" -> return '>'
            "amp" -> return '&'
            "quot" -> return '"'
            '#' : 'x' : hex -> readHexChar hex
            '#' : 'X' : hex -> readHexChar hex
            '#' : dec -> readDecChar dec
            _ -> Nothing -- just to shut up a warning
        return $ c `SL.cons` decodeHtml after
    Just (x, xs) -> x `SL.cons` decodeHtml xs

readHexChar :: String -> Maybe Char
readHexChar s = helper 0 s where
    helper i "" = return $ toEnum i
    helper i (c:cs) = do
        c' <- hexVal c
        helper (i * 16 + c') cs

hexVal :: Char -> Maybe Int
hexVal c
    | '0' <= c && c <= '9' = Just $ ord c - ord '0'
    | 'A' <= c && c <= 'F' = Just $ ord c - ord 'A' + 10
    | 'a' <= c && c <= 'f' = Just $ ord c - ord 'a' + 10
    | otherwise = Nothing

readDecChar :: String -> Maybe Char
readDecChar s = do
    case reads s of
        (i, _):_ -> Just $ toEnum (i :: Int)
        _ -> Nothing

我没有测试性能。但是如果你可以在没有正则表达式的情况下完成它，它可能是一个很好的样本。

取消HTML实体（包括命名的实体）

2 个答案: