Question

我正在尝试使用另一个名为zipless的表中的zipcodes替换df表中的空白（缺失）zipcodes，基于名称。什么是最好的方法？ for循环可能非常慢。

我正在尝试这样的事情，但它不起作用。

df$zip_new <- ifelse(df, is.na(zip_new),
                     left_join(df,zipless, by = c("contbr_nm" = "contbr_nm")),
                     zip_new)

我能够使用这种方法使其工作，但我确信它不是最好的方法。我首先在查找表中添加了一个新列，并在必要时在下一步中选择性地使用它。

library(dplyr)
#temporarly renaming the lookup column in the lookup table
zipless <- plyr::rename(zipless, c("zip_new"="zip_new_temp"))
#adding the lookup column to the main table
df <- left_join(df, zipless, by = c("contbr_nm" = "contbr_nm"))
#taking over the value from the lookup column zip_new_temp if the condition is met, else, do nothing.
df$zip_new  <- ifelse((df$zip_new == "") &
                              (df$contbr_nm %in% zipless$contbr_nm), 
                            df$zip_new_temp,
                            df$zip_new)

这样做的正确方法是什么？

非常感谢！

Answer 1

我建议使用match来抓住你需要的拉链。类似的东西：

miss_zips = is.na(df$zip_new)
df$zip_new[miss_zips] = zipless$zip_new[match(
    df$contbr_nm[miss_zips], 
    zipless$contbr_nm
  )]

如果没有样本数据，我并不完全确定您的列名，但这样的内容应该有用。

Answer 2

我只能为这些事情推荐data.table - 包。但是你的一般方法是正确的。 data.table - 包具有更好的语法，旨在处理大型数据集。

在data.table中它可能看起来像这样：

zipcodes <- data.table(left_join(df, zipless, by = "contbr_nm"))
zipcodes[, zip_new := ifelse(is.na(zip_new), zip_new_temp, zip_new)]

R

2 个答案: