strsplit所有空格和标点符号除了撇号

时间:2014-03-06 20:31:19

标签: regex r

我已经问过相关问题HEREHERE。我试图概括这些答案,但都失败了。

基本上我有一个字符串我想分成单词,数字和任何类型的标点符号,但是,我想保留撇号。这是我尝试过的,我非常接近(我认为):

x <- "Raptors don't like robots! I'd pay $500.00 to rid them."

strsplit(x, "(\\s+)|(?=[[:punct:]])", perl = TRUE)

## [[1]]
##  [1] "Raptors" "don"     "'"       "t"       "like"    "robots"  "!"             
##  [8] ""   "I"   "'"    "d"  "pay"     "$"       "500"     "."       "00"      "to"         
## [20] "rid"   "them"    "."  

这就是我追求的目标:

## [[1]]
##  [1] "Raptors" "don't"       "like"    "robots"  "!"       ""        "I'd"      
##  [8] "pay"     "$"       "500"   "."   "00"  "to"      "rid"     "them"    "."  

虽然我想要一个基本解决方案,但我希望看到其他解决方案(我确信有人有一个字符串解决方案),这使得这个问题对其他人更具普遍性。

注意: R有一个特定的正则表达式系统。你需要熟悉R才能回答这个问题。

1 个答案:

答案 0 :(得分:5)

您可以使用否定前瞻(?!')

strsplit(x, "(\\s+)|(?!')(?=[[:punct:]])", perl = TRUE)
#  [1] "Raptors" "don't"   "like"    "robots"  "!"       ""        "I'd"     "pay"     "$"       "500"     "."       "00"      "to"      "rid"     "them"    "."