读取多个csv数据并一次创建新列

时间:2017-06-01 08:57:32

标签: r function csv data.table fread

我有一个文件,其中有很多csv个数据 我想一次读取它们并创建新列,然后组合成一个数据表。我在这里解释得更多。

  1. 看看这张照片:
  2. enter image description here

    1. 我想根据csv数据标题创建2个新列YEARMONTH
      恩。以201508 Sales Report(London)为例。我想创建YEAR = 2015MONTH = 8

    2. 我不知道怎么办,但我可以一次阅读它们,而无需创建新列。

      my_read_data <- function(path){  
      data <- data.table::fread(path, header = T, strip.white = T, fill = T)  
      data <- data[data[[5]] != 0,]  
      data <- subset(data, select = c(-1,-7,-10,-12,-13,-14,-15,-17))  
      }  
      file.list <- dir(path = "//path/", pattern='\\.csv', full.names = T)  
      df.list <- lapply(file.list, my_read_data)  
      dt <- rbindlist(df.list)    
      
    3. 如何修改我的代码?
      其实我不确定我的代码是否正确。
      升值。

      感谢@Jaap,我的新代码是:

      my_read_data <- function(x){
      data <- data.table::fread(x, header = T, strip.white = T, fill = T)
      data <- data[data[[5]] != 0,]
      data <- subset(data, select = c(-1,-7,-10,-12,-13,-14,-15,-17))
      }
      file.list <- list.files(path = "/path/", pattern = '*.csv')
      dt.list <- sapply(file.list, my_read_data, simplify=FALSE)
      

      但是,我收到了一个错误。

      Error in data.table::fread(x, header = T, strip.white = T, fill = T) :   
      File not found: C:\Users\PECHEN\AppData\Local\Temp\RtmpiihFR4\filea0c4d726488   
      
      In addition: Warning messages:
      1: running command 'C:\Windows\system32\cmd.exe /c (TWM-201508 Sales Report(London).csv) > C:\Users\PECHEN\AppData\Local\Temp\RtmpiihFR4\filea0c4d726488' had status 1 
      2: In shell(paste("(", input, ") > ", tt, sep = "")) :
        '(TWM-201508 Sales Report(London).csv) > C:\Users\PECHEN\AppData\Local\Temp\RtmpiihFR4\filea0c4d726488' execution failed with error code 1  
      

      此外,我编辑了我的代码:

      my_read_data <- function(x){
      data <- data.table::fread(x, header = T, strip.white = T, fill = T)
      data <- data[data[[5]] != 0,]
      data <- subset(data, select = c(-1,-7,-10,-12,-13,-14,-15,-17))
      }
      file.list <- dir(path = "/path/", pattern='\\.csv', full.names = T)  
      df.list <- lapply(file.list, my_read_data)  
      dt <- rbindlist(df.list, idcol = 'id')[, `:=` (YEAR = substr(id,5,8), MONTH = substr(id,9,10))]   
      

      我使用YEAR = substr(id,5,8), MONTH = substr(id,9,10),因为每个数据标题在数字前都有四个字符。恩。 AAA-201508销售报告
      但是,它不起作用 感谢@Peter TW,它有效。

2 个答案:

答案 0 :(得分:3)

扩展我的评论并假设所有文件具有相同的结构,以下内容应该有效:

library(data.table)
# get list of file-names
file.list <- list.files(pattern='*.csv')

# read the files with sapply & fread
# this will create a named list of data.tables
dt.list <- sapply(file.list, fread, simplify=FALSE)

# bind the list together to one data.table
# using the 'idcol'-parameter puts the names of the data.tables in the id-column
# create the YEAR & MONTH variables with 'substr'
DT <- rbindlist(dt.list, idcol = 'id')[, `:=` (YEAR = substr(id,1,4), MONTH = substr(id,5,6))]

这将生成一个包含所有数据的data.table,并添加了YEARMONTH列。

如果您要从文件中排除某些列,可以使用drop - fread的参数,如下所示:

dt.list <- sapply(file.list, fread, drop = c(1,7,10,12:15,17), simplify=FALSE)

答案 1 :(得分:0)

以下是如何使用dplyr

添加列的方法
nam <- c("201508 Sales Report(London)", "201509 Sales Report(London)", "201604 Sales Report(London)-Monthly")

dat <- data.frame(file=nam, var=nam)
dat %>% 
   separate(var, into=c(paste0("parts", 1:5))) %>% 
   mutate(Year=substring(parts1, 1,4), Month=substring(parts1, 5,6)) %>% 
   select(Year, Month, file)

#   Year Month                                file
# 1 2015    08         201508 Sales Report(London)
# 2 2015    09         201509 Sales Report(London)
# 3 2016    04 201604 Sales Report(London)-Monthly