如何从字符串中删除单词列表

时间:2010-03-31 14:17:46

标签: string clojure stop-words

我想做什么(在Clojure中):

例如,我有一个需要删除的单词矢量:

(def forbidden-words [":)" "the" "." "," " " ...many more...])

...和一个字符串向量:

(def strings ["the movie list" "this.is.a.string" "haha :)" ...many more...])

因此,应从每个字符串中删除每个禁用的单词,在这种情况下,结果将是:[“movie list”“thisisastring”“haha”]。

怎么做?

3 个答案:

答案 0 :(得分:7)

(def forbidden-words [":)" "the" "." ","])
(def strings ["the movie list" "this.is.a.string" "haha :)"])
(let [pattern (->> forbidden-words (map #(java.util.regex.Pattern/quote %)) 
                (interpose \|)  (apply str))]
  (map #(.replaceAll % pattern "") strings))

答案 1 :(得分:1)

(use 'clojure.contrib.str-utils)
(import 'java.util.regex.Pattern)
(def forbidden-words [":)" "the" "." "," " "])
(def strings ["the movie list" "this.is.a.string" "haha :)"])
(def regexes (map #(Pattern/compile % Pattern/LITERAL) forbidden-words))
(for [s strings] (reduce #(re-gsub %2 "" %1) s regexes))

答案 2 :(得分:0)

使用函数组合和->宏,这可以很简单:

(for [s strings] 
  (-> s ((apply comp 
           (for [s forbidden-words] #(.replace %1 s ""))))))

如果您想要更“惯用”,可以使用clojure.contrib.string中的replace-str,而不是#(.replace %1 s "")

这里不需要使用正则表达式。