Javascript:使用字典过滤掉字符串中的单词?

时间:2012-02-22 21:59:40

标签: javascript

我需要从字符串中过滤掉几百个“停止”字样。由于有许多“停止”字样,我认为做这样的事情并不是一个好主意:

sentence.replace(/\b(?:the|it is|we all|an?|by|to|you|[mh]e|she|they|we...)\b/ig, '');

如何创建类似哈希映射的内容来存储停用词?在此映射中,键本身就是一个停用词,值并不重要。然后过滤将检查停用词映射中是否存在该单词。用于构建此类地图的数据结构是什么?

1 个答案:

答案 0 :(得分:1)

没有任何东西可以胜过这种工作的正则表达式。但是,它们存在两个问题 - 难以维护(您在帖子中指出的内容)和非常大的性能问题。我不知道单个正则表达式可以处理多少个替代品,但我想在任何情况下都可以达到20-30个。

因此,您需要一些代码来从某些数据结构动态构建正则表达式,这些数据结构可以是数组,也可以只是字符串。我个人更喜欢刺痛,因为它最容易维持。

// taken from http://www.ranks.nl/resources/stopwords.html
stops = ""
+"a about above after again against all am an and any are aren't as  "
+"at be because been before being below between both but by can't    "
+"cannot could couldn't did didn't do does doesn't doing don't down  "
+"during each few for from further had hadn't has hasn't have        "
+"haven't having he he'd he'll he's her here here's hers herself     "
+"him himself his how how's i i'd i'll i'm i've if in into is isn't  "
+"it it's its itself let's me more most mustn't my myself no nor     "
+"not of off on once only or other ought our ours ourselves out      "
+"over own same shan't she she'd she'll she's should shouldn't so    "
+"some such than that that's the their theirs them themselves then   "
+"there there's these they they'd they'll they're they've this       "
+"those through to too under until up very was wasn't we we'd we'll  "
+"we're we've were weren't what what's when when's where where's     "
+"which while who who's whom why why's with won't would wouldn't     "
+"you you'd you'll you're you've your yours yourself yourselves      "

// how many to replace at a time
reSize = 20 

// build regexps
regexes = []
stops = stops.match(/\S+/g).sort(function(a, b) { return b.length - a.length })
for (var n = 0; n < stops.length; n += reSize)
    regexes.push(new RegExp("\\b(" + stops.slice(n, n + reSize).join("|") + ")\\b", "gi"));

一旦你有了这个,其余的是显而易见的:

regexes.forEach(function(r) {
    text = text.replace(r, '')
})

您需要尝试使用reSize值来找出正则表达式长度与正则表达式总数之间的最佳平衡。如果性能很关键,您也可以运行生成部分一次,然后在某处缓存结果(即生成的正则表达式)。