Question

我有一个excel文件，其中包含以下格式的数据：

Serial          Name          College     Time

Wednesday       24/10/2014
1               StudentA      UA          12:00:00
2               StudentB      UA          13:00:00

Thursday        25/10/2014
3               StudentC      UA          11:00:00
4               StudentA      UA          15:00:00

转换为CSV时，它看起来像这样：

Wednesday,24/10/2014,,    
1,StudentA,UA,12:00:00
2,StudentB,UA,13:00:00

因此，基本上，数据是按天划分的。 2014年10月24日星期三的数据之前是包含2014年10月24日星期三的行，每天都是相同的。我想将此格式转换为以下内容：

Serial          Name          College        Date          Time
1               StudentA      UA             24/10/2014    12:00:00
2               StudentB      UA             24/10/2014    13:00:00
3               StudentC      UA             25/10/2014    11:00:00
4               StudentA      UA             25/10/2014    15:00:00

随意提出任何问题并使用任何工具/技术。不过，我更喜欢R，因为我对它很熟悉。

Answer 1

这是一种非常混乱的格式，但这是处理它的一种方法。首先，只需读取原始行，然后根据特殊值

对行进行分区

rr <- readLines("input.csv")
rr <- rr[nchar(rr)>0]  #remove empty lines
ghead <- grepl(",,", rr)  # find the "headers" by looking for two empty columns
glines <- rle(cumsum(ghead [-1]))$lengths-1  #see how many rows each group has

#read header and details lines separately
dd <- read.csv(text=rr[!ghead ])
gg <- read.csv(text=rr[ghead ], header=F, 
    col.names=c("Weekday","Date","X","Y"), 
    colClasses=c("character","character","NULL","NULL")) 

#merge together
cbind(dd, gg[rep(1:nrow(gg), glines),])

这会产生

    Serial     Name College     Time   Weekday       Date
1        1 StudentA      UA 12:00:00 Wednesday 24/10/2014
1.1      2 StudentB      UA 13:00:00 Wednesday 24/10/2014
2        3 StudentC      UA 11:00:00  Thursday 25/10/2014
2.1      4 StudentA      UA 15:00:00  Thursday 25/10/2014

Answer 2

这是一种使用read.mtable中的GitHub-only "SOfun" package的方法。

## Load SOfun (or just copy and paste the required function)
library(SOfun)      ## For `read.mtable`
library(data.table) ## for setnames and rbindlist

## Reads in each chunk as a data.frame in a list
X <- read.mtable("test.csv", chunkId = ",,$", sep = ",")

## Create a vector of new column names
colNames <- c("Serial", "Name", "College", "Time", "Date")

rbindlist(
  lapply(
    ## The next line adds the dates back in
    Map(cbind, X, lapply(strsplit(names(X), ","), `[`, 2)), 
    setnames, colNames))
#    Serial      Name College        Time        Date
# 1:      1  StudentA      UA 12:00:00 PM  24/10/2014
# 2:      2  StudentB      UA 01:00:00 PM  24/10/2014
# 3:      3  StudentC      UA 11:00:00 AM  25/10/2014
# 4:      4  StudentA      UA 03:00:00 PM  25/10/2014

R中的格式数据

2 个答案: