删除标点符号在Scala - Spark中形成文本

时间:2015-05-06 10:27:16

标签: regex scala apache-spark punctuation

这是我的数据的一个示例:

case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time) 
xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ($25).

我想删除除点(。)以外的所有标点符号,并删除带有length < = 2的字词,例如我的预期输出为:

case time especially its purse read manual care follow care instructions . make stays waterproof example inspect rubber seals doors especially batterymemory card door open time
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dock chance back base xm3020 . traveling bag connect laptop extra speaker . amount paid $25 .

这应该在Scala中实现, 我试过了:

replaceAll( """\\W\s""", "")
replaceAll(""""[^a-zA-Z\.]""", "")

但效果不好,有人能帮帮我吗?

4 个答案:

答案 0 :(得分:22)

查看正则表达式javadoc(http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html),我们看到标点符号的字符类是\p{Punct},我们可以使用[a-z&&[^def]]之类的东西从字符类中删除字符。从那时起,很容易定义一个正则表达式,它将删除除点之外的所有标点符号:

s.replaceAll("""[\p{Punct}&&[^.]]""", "")

删除大小为&lt; = 2的单词可以这样做:

s.replaceAll("""\b\p{IsLetter}{1,2}\b""")

结合这两者,这给出了:

s.replaceAll("""([\p{Punct}&&[^.]]|\b\p{IsLetter}{1,2}\b)\s*""", "")

请注意我添加\s*以删除多余空格。

此外,您可以看到上述正则表达式完全删除了&#39; $&#39;,因为 是一个标点字符(由unicode定义)。 如果这是不合需要的(似乎表明您的预期输出),请更准确地考虑标点符号。 例如,您可能只想将以下字符视为标点符号:?.!:()

s.replaceAll("""([?.!:]|\b\p{IsLetter}{1,2}\b)\s*""", "")

或者,你可以添加&#39; $&#39;你的&#34; not-punctuation&#34;字符列表,以及点:

s.replaceAll("""([\p{Punct}&&[^.$]]|\b\p{IsLetter}{1,2}\b)\s*""", "")

答案 1 :(得分:1)

这个怎么样:

replaceAll("(\\(|\\)|'|/", "")

然后你只需添加更多标点符号来删除使用|,并确保使用双反斜杠转义像(和)这样的字符?

答案 2 :(得分:0)

您可以尝试过滤字符串,如下所示:

val example = "Hey there! It's me, myself and I."
example.filterNot(x => x == ',' || x == '!' || x == 'm')
 res3: String = Hey there It's e yself and I.

答案 3 :(得分:0)

试试这个,它应该有效:

val str = """
  |case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time) 
  |xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ($25).
  """.stripMargin('|')

println(str)
val pat = """[^\w\s\.\$]"""
val pat2 = """\s\w{2}\s"""
println(str.replaceAll(pat, "").replaceAll(pat2, ""))

输出:

case time especially its purse read manual care follow care instructions make stays waterproof  example inspect rubber seals doors especially batterymemory card door open time 
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dockchance back base xm3020 . traveling bag connect laptop extra speaker . amount paid $25.