Question

我在R中有以下向量：

x <- c("id: capture this , something: the useless chunk , otherstuff: useless , more stuff")

我希望得到字符串“捕获这个”。我使用过这个正则表达式：

library(rex)
r <- rex(
  start,
  anything,
  "id: ",
  capture(anything),
  " , ", 
  anything
)
r
# > r
# > ^.*id: (.*) , .*
re_matches(x,r)

但我得到的是：

> re_matches(x,r)
                                                                  1
1 capture this , something: the useless chunk , otherstuff: useless

它捕获我想要的东西，但也捕获字符串的其余部分。我只想要“捕获这个”字段。即使我使用gsub函数：

gsub("^.*id: (.*) , .*", "\\1", x)

使用相同的正则表达式我得到了相同的结果。

和ubuntu的版本： 没有可用的LSB模块。分销商ID：Ubuntu 描述：Ubuntu 14.04.2 LTS 发布：14.04 代号：可信赖

Answer 1

你在和yaml一起工作吗？如果是这样，您可能会发现yaml包有用

x <- c("id: capture this , something: the useless chunk , otherstuff: useless , more: stuff")

yaml::yaml.load(gsub(' , ', '\n', x))$id
# [1] "capture this"

请注意，我必须添加冒号才能使上述功能正常工作，但此解决方案的优点在于您可以根据关键字段提取每个部分。

下一个是使用您的示例字符串，不使用包：

x <- c("id: capture this , something: the useless chunk , otherstuff: useless , more stuff")

gsub('id: (.*?) ,.*', '\\1', x)
# [1] "capture this"

Answer 2

您不一定需要使用包来获取您所追踪的子字符串。使用gsub的下一个错误是您的正则表达式，*是greedy运算符，意味着它将尽可能多地匹配，并且仍然允许正则表达式的其余部分继续匹配。

使用*?表示非贪婪的匹配，意思是“零或更多 - 最好尽可能少”。

gsub("^.*id: (.*?) , .*", "\\1", x)
                ^

如果字符串以“id”开头，则可以删除锚点和初始.*令牌。

sub('id: (.*?) ,.*', '\\1', x)
# [1] "capture this"

注意：我在这里使用了sub，因为您只有一次。

Answer 3

    # using the rex package
    library(rex)
    x <- c("id: capture this , something: the useless chunk , otherstuff: useless , more stuff")
    r <- rex(start,"id: ",capture(non_puncts))
    re_matches(x,r)
    #1 capture this

Answer 4

这是一个直接推广的方法，使用我管理的 qdapRegex 库，可以用来抓住右边界和左边界之间的“东西”：

x <- c("id: capture this , something: the useless chunk , otherstuff: useless , more stuff")

library(qdapRegex)
rm_between(x, "id: ", " ,", extract=TRUE)

## [[1]]
## [1] "capture this"

R中的正则表达式捕获特定字段

4 个答案: