提高代码效率

时间:2015-10-30 15:39:54

标签: r plyr

我一直在编写一个代码,用于读取Excel工作簿的所有工作表,其中每个工作表中的前两列是“Date”和“Time”,接下来的两列是“Level”和“温度,或”水平“和”温度“。代码有效,但我正在努力提高我的编码清晰度和效率,因此非常感谢那些方面的建议。

我的功能1)将数据读入数据帧列表,2)删除任意意外读入的NA列,3)将“Date”和“Time”组合成每个数据帧的“DateTime”,4 )将每个数据帧的“DateTime”舍入到最接近的5分钟,5)用“DateTime”替换每个数据帧中的“Date”和“Time”。我开始对lapply感到更舒服,但我想知道是否可以提高代码效率,而不是拥有lapply这么多行。

library(readxl)
library(plyr)

  read_excel_allsheets <- function(filename) {
  sheets <- readxl::excel_sheets(filename)
  data <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X))
  names(data) <- sheets
  clean <- lapply(data, function(y) y[, colSums(is.na(y)) == 0])
  date <- lapply(clean, "[[", 1)
  time <- lapply(clean, "[[", 2)
  time <- lapply(time, function(z) format(z, format = "%H:%M"))
  datetime <- Map(paste, date, time)
  datetime <- lapply(datetime, function(a) as.POSIXct(a, format = "%Y-%m-%d %H:%M"))
  rounded <- lapply(datetime, function(b) as.POSIXlt(round(as.numeric(b)/(5*60))*(5*60),origin='1970-01-01'))
  addDateTime <- mapply(cbind, clean, "DateTime" = rounded, SIMPLIFY = F)
  final <- lapply(addDateTime, function(z) z[!(names(z) %in% c("Date", "Time"))])
  return(final)
}

接下来,我想绘制我的所有数据。因此,我1)运行我的代码用于文件,2)将数据帧列表合并到一个数据帧中,同时为每个数据帧保持“ID”作为列,3)组合变量列的小写和大写版本,4)添加两个拆分“ID”的新列。每个ID都类似于B1CC或B2CO,我希望将“ID”拆分为:“B1”和“CC”。现在我可以非常轻松地使用ggplot

mysheets <- read_excel_allsheets(filename)
df = ldply(mysheets)
df$Temp <- rowSums(df[, c("Temperature", "TEMPERATURE")], na.rm = T)
df$Lev <- rowSums(df[, c("Level", "LEVEL")], na.rm = T)
df <- df[!names(df) %in% c("Level", "LEVEL", "Temperature", "TEMPERATURE")]

df$exp <- gsub("^[[:alnum:]]{2}", "\\1",df$.id)
df$plot <- gsub("[[:alnum:]]{2}$", "\\1", df$.id)

以下是前两个数据帧的数据,但其中有50多个,每个都相对较大,并且有很多文件需要阅读。因此,我希望尽可能提高效率(在运行时间方面)。非常感谢任何帮助或建议!

dput(head(x[[1]]))
structure(list(Date = structure(c(1305504000, 1305504000, 1305504000, 
1305504000, 1305504000, 1305504000), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209121912, -2209121612, 
-2209121312, -2209121012, -2209120712, -2209120412), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), Level = c(106.9038, 106.9059, 106.89, 
106.9121, 106.8522, 106.8813), Temperature = c(6.176, 6.173, 
6.172, 6.168, 6.166, 6.165)), .Names = c("Date", "Time", "Level", 
"Temperature"), row.names = c(NA, 6L), class = c("tbl_df", "tbl", 
"data.frame"))

dput(head(x[[2]]))
structure(list(Date = structure(c(1305504000, 1305504000, 1305504000, 
1305504000, 1305504000, 1305504000), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209121988, -2209121688, 
-2209121388, -2209121088, -2209120788, -2209120488), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), LEVEL = c(117.5149, 117.511, 117.5031, 
117.5272, 117.4523, 117.4524), TEMPERATURE = c(5.661, 5.651, 
5.645, 5.644, 5.644, 5.645), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_)), .Names = c("Date", "Time", "LEVEL", 
"TEMPERATURE", NA, NA, NA, NA, NA), row.names = c(NA, 6L), class =    
c("tbl_df", "tbl", "data.frame"))

0 个答案:

没有答案