R:自动化多个参数替换

时间:2015-12-22 23:13:49

标签: regex r loops grep automation

问题:

让我们考虑数据框df

df <- structure(list(id = 1:4, var1 = c("blissard", "Blizzard", "storm of snow", 
"DUST DEVIL/BLIZZARD")), .Names = c("id", "var1"), class = "data.frame", row.names = c(NA, 
-4L))

> df

id  var1   
1   "blissard"
2   "Blizzard"
3   "storm of snow"
4   "DUST DEVIL/BLIZZARD"

> class(dt$var1)
[1] "character"

我想让它整洁漂亮,因此我尝试重新编码var1,它在一个更加亲切和可分析的va1_recoded中拥有四个不同的条目,因此:

df$var1_recoded[grepl("[Bb][Ll][Ii]", df$var1)] <- "blizzard"
df$var1_recoded[grepl("[Ss][Tt][Oo]", df$var1)] <- "storm"

id  var1                  var1_recoded   
1   "blissard"            "blizzard"  
2   "Blizzard"            "blizzard"
3   "storm of snow"       "storm"
4   "DUST DEVIL/BLIZZARD" "blizzard"

问题:

如何创建一个自动执行上述两个函数描述的过程的函数?用不同的话来说:如何推广(比方说)1000替换?

我会输入带有列表的函数(例如c("storm", "blizzard")),然后将apply作为匹配和替换尊重条件的观察的过程。

我在这里找到了宝贵的贡献:Replace multiple arguments with gsub 但我无法以编程方式在R语言中翻译上述功能。特别是,我无法创建允许grep识别要匹配的单词的前三个字母的条件。

2 个答案:

答案 0 :(得分:1)

这是一种可行的方法:

数据

dat <- read.csv(text="id,  var1  
1,   blissard
2,   Blizzard
3,   storm of snow
4,   hurricane
5,   DUST DEVIL/BLIZZARD", header=T, stringsAsFactors = FALSE, strip.white=T)

x <- c("storm", "blizzard")

解决方案

if (!require("pacman")) install.packages("pacman")
pacman::p_load(stringdist, stringi)

dat[["var1_recoded"]] <- NA
tol <- .6

for (i in seq_len(nrow(dat))) {
    potentials <- unlist(stri_extract_all_words(dat[["var1"]][i]))
    y <- stringdistmatrix(tolower(potentials), tolower(x), method = "jaccard") 
    if (min(y) > tol) {
        dat[["var1_recoded"]][i] <- dat[["var1"]][i]
    } else {
        dat[["var1_recoded"]][i] <- x[which(y == min(y), arr.ind = TRUE)[2]]
    }
}

##   id                var1 var1_recoded
## 1  1            blissard     blizzard
## 2  2            Blizzard     blizzard
## 3  3       storm of snow        storm
## 4  4           hurricane    hurricane
## 5  5 DUST DEVIL/BLIZZARD     blizzard

编辑在解决方案中纳入了@ mra68的数据

答案 1 :(得分:1)

f <- function( x )
{
  A <- c( "blizzard", "storm" )
  A3 <- sapply(A,substr,1,3)
  x <- as.character(x)
  n <- max( c( 0, which( sapply( A3, grepl, tolower(x) ) ) ) )

  if ( n==0 )
  {
    warning( "nothing found")
    return (x)
  }

  A[n]
}

df <- data.frame( id = 1:5,
                  var1 = c( "blissard", "Blizzard", "storm of snow", "DUST DEVIL/BLIZZARD", "hurricane" ) )

如果neiher“blizzard”或“storm”匹配,则“var1”保持不变(带警告)。 “飓风”就是一个例子。

> df$var1_recoded <- sapply(df$var1,f)
Warning message:
In FUN(X[[i]], ...) : nothing found
> df
  id                var1 var1_recoded
1  1            blissard     blizzard
2  2            Blizzard     blizzard
3  3       storm of snow        storm
4  4 DUST DEVIL/BLIZZARD     blizzard
5  5           hurricane    hurricane
> 
相关问题