Question

我有一个数据框，其中的一列代表我的用户发出的请求。一些示例如下所示：

GET /enviro/html/tris/tris_overview.html
GET /./enviro/gif/emcilogo.gif
GET /docs/exposure/meta_exp.txt.html
GET /hrmd/
GET /icons/circle_logo_small.gif

我只想选择最后一个“。”之后的字符串的最后一部分。以这样的方式，我返回字符串的页面类型。因此，这些行的输出应为：

.html
.gif
.html

.gif

我尝试对sub执行此操作，但是我只能选择第一个“”之后的所有内容。例如：

tring <- c("GET /enviro/html/tris/tris_overview.html", "GET /./enviro/gif/emcilogo.gif", "GET /docs/exposure/meta_exp.txt.html", "GET /hrmd/", "GET /icons/circle_logo_small.gif")


sub("^[^.]*", "", sapply(strsplit(tring, "\\s+"), `[`, 2))

这将返回：

".html"                     
"./enviro/gif/emcilogo.gif" 
".txt.html"                 
""                          
".gif"

我创建了以下gsub代码，该代码适用于包含两点的字符串：

gsub(pattern = ".*\\.", replacement = "", "GET /./enviro/gif/finds.gif", "\\s+")

这将返回：

"gif"

但是，我似乎无法提出一种适用于所有可能输入的gsub / sub。它应该从右到左读取字符串。当看到第一个“。”时停止。并返回在“。”之后找到的所有内容。

我是R的新手，我无法提出解决方案。任何帮助将不胜感激！

Answer 1

您不能使用R regex更改字符串解析方向。取而代之的是，您最多可以匹配.并将其删除，也可以匹配右边没有.字符的.直到字符串结尾。

string <- c('GET /enviro/html/tris/tris_overview.html','GET /./enviro/gif/emcilogo.gif','GET /docs/exposure/meta_exp.txt.html','GET /hrmd/','GET /icons/circle_logo_small.gif')
res <- regmatches(string, regexec("\\.[^.]*$", string))
res[lengths(res)==0] <- ""
unlist(res)

或

sub("^(.*(?=\\.)|.*)", "", string, perl=TRUE)

请参见R online demo。都返回

[1] ".html" ".gif"  ".html" ""      ".gif"

在这里，\.[^.]*$匹配.，然后匹配.以外的任何0+字符，直到字符串末尾。 sub代码使用的^(.*(?=\\.)|.*)模式与字符串的开头相匹配，然后尽可能多的0+个字符直到.为止而不消耗点，或者仅匹配任何0+个字符作为尽可能多，并将匹配项替换为空字符串。

请参见Regex 1和Regex 2演示。

Answer 2

这里是无正则表达式的解决方案：

sapply(
  seq_along(a),
  function(i) {
    if (grepl("\\.", a[i])) tail(strsplit(a[i], "\\.")[[1]], 1) else ""
  }
)

# [1] "html" "gif"  "html" ""     "gif"

仅选择最后一点之后的字符串的最后一部分

2 个答案: