Question

我正在尝试检查数据集中每个关注变量缺少数据的案例的“固定”数目。

以下是一些虚假数据：

c <- data.frame(pin = c(1, 2, 3, 4), type = c(1, 1, 2, 2), v1 = c(1, NA, NA, 
NA), v2 = c(NA, NA, 1, 1))

我写了一个函数“ m.pin”来做到这一点：

m.pin <- function(x, data = "c", return = "$pin") {
  sect <- gsub("^.*\\[", "\\[", deparse(substitute(x)))
  vect <- eval(parse(text = paste(data, return, sect, sep = "")))
  return(vect[is.na(x)])
}

我这样使用它：

m.pin(c$v1[c$type == 1])
[1] 2

我编写了一个函数，在变量列表上应用“ m.pin”以仅返回缺少数据的引脚：

return.m.pin <- function(x, fun = m.pin) {
  val.list <- lapply(x, fun)
  condition <- lapply(val.list, function(x) length(x) > 0)
  val.list[unlist(condition)]
}

但是当我应用它时，出现此错误：

l <- lst(c$v1[c$type == 1], c$v2[c$type == 2])
return.m.pin(l) 
Error in parse(text = paste(data, return, sect, sep = "")) :
  <text>:1:9: unexpected ']'
1: c$pin[i]]
            ^

如何重写我的函数以避免此问题？

非常感谢！

Answer 1

有关代码中最关键的问题，请参阅Gregor的注释（要添加：请勿使用return作为变量名，因为它是基本R函数的名称）。

我不清楚您为什么要定义特定的功能m.pin，也不清楚您最终想要做什么，但是我认为这是关键的设计组件。

将m.pin重写为

m.pin <- function(df, type, vcol) which(df[, "type"] == type & is.na(df[, vcol]))

我们得到

m.pin(df, 1, "v1")
#[1] 2

或标识所有NA的{{1}}中带有"v1"的行

type

更新

为回应Gregor的评论，也许这就是您所追求的？

lapply(unique(df$type), function(x) m.pin(df, x, "v1"))
#[[1]]
#[1] 2
#
#[[2]]
#[1] 3 4

这将为by(df, df$type, function(x) list(v1 = x$pin[which(is.na(x$v1))], v2 = x$pin[which(is.na(x$v2))])) # df$type: 1 # $v1 # [1] 2 # # $v2 # [1] 1 2 # # ------------------------------------------------------------ # df$type: 2 # $v1 # [1] 3 4 # # $v2 # integer(0) / list中的每个pin和type条目返回NA个数字中的v1。

样本数据

v2

Answer 2

我建议这样重写（如果完全采用这种方法）。我将您的数据称为d，因为c已经是一个非常常用的函数的名称。

# string column names, pass in the data frame as an object
# means no need for eval, parse, substitute, etc.
foo = function(data, na_col, return_col = "pin", filter_col, filter_val) {
  if(! missing(filter_col) & ! missing(filter_val)) {
    data = data[data[, filter_col] == filter_val, ]
  }  
  data[is.na(data[, na_col]), return_col]
}

# working on the whole data frame
foo(d, na_col = "v1", return_col = "pin")
# [1] 2 3 4

# passing in a subset of the data
foo(d[d$type == 1, ], "v1", "pin")
# [1] 2

# using function arguments to subset the data
foo(d, "v1", "pin", filter_col = "type", filter_val = 1)
# [1] 2


# calling it with changing arguments:
# you could use `Map` or `mapply` to be fancy, but this for loop is nice and clear
inputs = data.frame(na_col = c("v1", "v2"), filter_val = c(1, 2), stringsAsFactors = FALSE)
result = list()
for (i in 1:nrow(inputs)) {
  result[[i]] = foo(d, na_col = inputs$na_col[i], return_col = "pin",
                    filter_col = "type", filter_val = inputs$filter_val[i])
}
result
# [[1]]
# [1] 2
# 
# [[2]]
# numeric(0)

我建议的另一种方法是将您的数据融合为长格式，并仅获取NA值的子集，从而获得type和v*列的所有组合一次具有NA值。只需执行一次，就不需要查找单个组合的功能。

d_long = reshape2::melt(d, id.vars = c("pin", "type"))

library(dplyr)
d_long %>% filter(is.na(value)) %>%
  arrange(variable, type)
#   pin type variable value
# 1   2    1       v1    NA
# 2   3    2       v1    NA
# 3   4    2       v1    NA
# 4   1    1       v2    NA
# 5   2    1       v2    NA

解析应用于列表的函数中的意外符号错误

2 个答案:

更新

样本数据