需要从宽到长转换

时间:2019-02-22 21:05:38

标签: python r excel dataframe reshape

enter image description here

嗨, 我在A列中有一个具有唯一ID变量的数据集,然后为每个患者进行了肾脏扫描。这是一个csv文件,如果可能的话,我想使用R将其重塑为长格式。 每个参与者可以进行1-17次肾脏扫描。

还有一些ID被列为“否”,以表示未接收扫描。 我希望将其重塑为类似的内容

enter image description here

我知道按年份组织的以前的问题,我从参与者那里扫描,这些扫描在年份日期格式yyyy-mm-dd中多次出现

请在下面查看数据

structure(list(id = c(1010001, 1010002, 1010004, 1010005, 1010006, 
1010007), `GFR Scans?` = c("Yes", "Yes", "Yes", "Yes", "Yes", 
"No"), `1. Date of renal scan:` = structure(c(1133913600, 1196812800, 
1237334400, 1124150400, 1192060800, NA), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), `1. Type of renal scan:` = c("DTPA", 
"DTPA", "DTPA", "DTPA", "DTPA", NA), `1. GFR mL/1.73 sq.m` = c(18, 
13, 68, 117, 46, NA), `1. Pre/Post tx?` = c("Pre", "Pre", "Post", 
"Post", "Pre", NA), `2. Date of renal scan:` = structure(c(1146528000, 
1214524800, NA, 1151366400, 1245974400, NA), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), `2. Type of renal scan:` = c("DTPA", 
"DTPA", NA, "DTPA", "DTPA", NA), `2. GFR mL/1.73 sq.m` = c(86, 
110, NA, 148, 123, NA), `2. Pre/Post tx?` = c("Post", "Post", 
NA, "Post", "Post", NA), `3. Date of renal scan:` = structure(c(NA, 
1219104000, NA, 1184025600, NA, NA), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), `3. Type of renal scan:` = c(NA, "DTPA", NA, 
"DTPA", NA, NA), `3. GFR mL/1.73 sq.m` = c(NA, 92, NA, 166, NA, 
NA), `3. Pre/Post tx?` = c(NA, "Post", NA, "Post", NA, NA), `4. Date of    renal scan:` = structure(c(NA, 
1242691200, NA, 1213660800, NA, NA), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), `4. Type of renal scan:` = c(NA, "DTPA", NA, 
"DTPA", NA, NA), `4. GFR mL/1.73 sq.m` = c(NA, 36, NA, 171, NA, 
NA), `4. Pre/Post tx?` = c(NA, "Post", NA, "Post", NA, NA), `5. Date of    renal scan:` = structure(c(NA, 
NA, NA, 1288656000, NA, NA), class = c("POSIXct", "POSIXt"), tzone =  "UTC"), 
    `5. Type of renal scan:` = c(NA, NA, NA, "DTPA", NA, NA), 
    `5. GFR mL/1.73 sq.m` = c(NA, NA, NA, 105, NA, NA), `5. Pre/Post  tx?` = c(NA, 
    NA, NA, "Post", NA, NA), `6. Date of renal scan:` = structure(c(NA, 
    NA, NA, 1323129600, NA, NA), class = c("POSIXct", "POSIXt"
    ), tzone = "UTC"), `6. Type of renal scan:` = c(NA, NA, NA, 
    "DTPA", NA, NA), `6. GFR mL/1.73 sq.m` = c(NA, NA, NA, 103, 
    NA, NA), `6. Pre/Post tx?` = c(NA, NA, NA, "Post", NA, NA
    ), `7. Date of renal scan:` = structure(c(NA, NA, NA, 1355184000, 
    NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    `7. Type of renal scan:` = c(NA, NA, NA, "DTPA", NA, NA), 
    `7. GFR mL/1.73 sq.m` = c(NA, NA, NA, 98, NA, NA), `7. Pre/Post tx?` = c(NA, 
    NA, NA, "Post", NA, NA), `8. Date of renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `8. Type of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `8. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, 
    NA, NA), `8. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), `9. Date of renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `9. Type of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `9. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, 
    NA, NA), `9. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), `10. Date   of renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `10. Type of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `10. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, 
    NA, NA), `10. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), 
    `11. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), `11. Type of  renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `11. GFR mL/1.73 sq.m` = c(NA, NA, NA, 
    NA, NA, NA), `11. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA
    ), `12. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), 
    `12. Type of renal scan:` = c(NA, NA, NA, NA, NA, NA), `12. GFR mL/1.73 sq.m` = c(NA, 
    NA, NA, NA, NA, NA), `12. Pre/Post tx?` = c(NA, NA, NA, NA, 
    NA, NA), `13. Date of renal scan:` = c(NA, NA, NA, NA, NA, 
    NA), `13. Type of renal scan:` = c(NA, NA, NA, NA, NA, NA
    ), `13. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, NA, NA), `13. Pre/Post tx?` = c(NA, 
    NA, NA, NA, NA, NA), `14. Date of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `14. Type of renal scan:` = c(NA, NA, NA, 
    NA, NA, NA), `14. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, NA, 
    NA), `14. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), `15. Date of renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `15. Type of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `15. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, 
    NA, NA), `15. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), 
    `16. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), `16. Type of  renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `16. GFR mL/1.73 sq.m` = c(NA, NA, NA, 
    NA, NA, NA), `16. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA
    ), `17. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), 
    `17. Type of renal scan:` = c(NA, NA, NA, NA, NA, NA), `17. GFR mL/1.73 sq.m` = c(NA, 
    NA, NA, NA, NA, NA), `17. Pre/Post tx?` = c(NA, NA, NA, NA, 
    NA, NA)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

第一个图像是原始的宽幅格式,第二个图像是我想要的图像。由于涉及多个专栏,因此没有其他的答案可以帮助我。

例如id 1010001进行了两次扫描,我需要此扫描一个接一个地列出,而不是彼此并列(参见图2)。

非常感谢您的帮助。

2 个答案:

答案 0 :(得分:3)

此问题之前已经问过几次,例如Reshaping multiple sets of measurement columns (wide format) into single columns (long format)。一种可能的方法是使用data.table的{​​{1}}函数,该函数能够同时重塑多个值列。

但是,这是一个额外的困难,它本身就是一个正确的答案,恕我直言。 列名称有时会包含多余的空格,需要事先删除这些多余的空格,以使列的命名模式保持一致。

melt()
names(df1)
 [1] "id"                        "GFR Scans?"                "1. Date of renal scan:"    "1. Type of renal scan:"   
 [5] "1. GFR mL/1.73 sq.m"       "1. Pre/Post tx?"           "2. Date of renal scan:"    "2. Type of renal scan:"   
 [9] "2. GFR mL/1.73 sq.m"       "2. Pre/Post tx?"           "3. Date of renal scan:"    "3. Type of renal scan:"   
[13] "3. GFR mL/1.73 sq.m"       "3. Pre/Post tx?"           "4. Date of    renal scan:" "4. Type of renal scan:"   
[17] "4. GFR mL/1.73 sq.m"       "4. Pre/Post tx?"           "5. Date of    renal scan:" "5. Type of renal scan:"   
[21] "5. GFR mL/1.73 sq.m"       "5. Pre/Post  tx?"          "6. Date of renal scan:"    "6. Type of renal scan:"   
[25] "6. GFR mL/1.73 sq.m"       "6. Pre/Post tx?"           "7. Date of renal scan:"    "7. Type of renal scan:"   
[29] "7. GFR mL/1.73 sq.m"       "7. Pre/Post tx?"           "8. Date of renal scan:"    "8. Type of renal scan:"   
[33] "8. GFR mL/1.73 sq.m"       "8. Pre/Post tx?"           "9. Date of renal scan:"    "9. Type of renal scan:"   
[37] "9. GFR mL/1.73 sq.m"       "9. Pre/Post tx?"           "10. Date   of renal scan:" "10. Type of renal scan:"  
[41] "10. GFR mL/1.73 sq.m"      "10. Pre/Post tx?"          "11. Date of renal scan:"   "11. Type of  renal scan:" 
[45] "11. GFR mL/1.73 sq.m"      "11. Pre/Post tx?"          "12. Date of renal scan:"   "12. Type of renal scan:"  
[49] "12. GFR mL/1.73 sq.m"      "12. Pre/Post tx?"          "13. Date of renal scan:"   "13. Type of renal scan:"  
[53] "13. GFR mL/1.73 sq.m"      "13. Pre/Post tx?"          "14. Date of renal scan:"   "14. Type of renal scan:"  
[57] "14. GFR mL/1.73 sq.m"      "14. Pre/Post tx?"          "15. Date of renal scan:"   "15. Type of renal scan:"  
[61] "15. GFR mL/1.73 sq.m"      "15. Pre/Post tx?"          "16. Date of renal scan:"   "16. Type of  renal scan:" 
[65] "16. GFR mL/1.73 sq.m"      "16. Pre/Post tx?"          "17. Date of renal scan:"   "17. Type of renal scan:"
library(data.table)
library(magrittr)
# clean up column names: remove surplus whitespace
setDT(df1) %>% setnames(names(.) %>% stringr::str_replace_all("\\s+", " "))
# get name pattern for subsequent melt
cols <- names(df1)[3:6] %>% stringr::str_replace("1. ", "")
# reshape multiple columns from wide to long
long <- melt(df1, measure.vars = patterns(cols), value.name = cols, na.rm = TRUE)[
  # recreate lost POSIXct attribute
  , `Date of renal scan:` := lubridate::as_datetime(`Date of renal scan:`)][]

long

在对 id GFR Scans? variable Date of renal scan: Type of renal scan: GFR mL/1.73 sq.m Pre/Post tx? 1: 1010001 Yes 1 2005-12-07 DTPA 18 Pre 2: 1010002 Yes 1 2007-12-05 DTPA 13 Pre 3: 1010004 Yes 1 2009-03-18 DTPA 68 Post 4: 1010005 Yes 1 2005-08-16 DTPA 117 Post 5: 1010006 Yes 1 2007-10-11 DTPA 46 Pre 6: 1010001 Yes 2 2006-05-02 DTPA 86 Post 7: 1010002 Yes 2 2008-06-27 DTPA 110 Post 8: 1010005 Yes 2 2006-06-27 DTPA 148 Post 9: 1010006 Yes 2 2009-06-26 DTPA 123 Post 10: 1010002 Yes 3 2008-08-19 DTPA 92 Post 11: 1010005 Yes 3 2007-07-10 DTPA 166 Post 12: 1010002 Yes 4 2009-05-19 DTPA 36 Post 13: 1010005 Yes 4 2008-06-17 DTPA 171 Post 14: 1010005 Yes 5 2010-11-02 DTPA 105 Post 15: 1010005 Yes 6 2011-12-06 DTPA 103 Post 16: 1010005 Yes 7 2012-12-11 DTPA 98 Post 的调用中,我们可以设置参数melt()以保留所有数据:

na.rm = FALSE

答案 1 :(得分:2)

这是一个可行的解决方案,不是最佳解决方案,而是可行的。策略是从宽变长到整齐。

当从原始的宽格式转换为长格式时,所有列都转换为最低通用格式(在这种情况下为字符),因此最后需要转换列。

为了删除带有NA的行,我使用complete.cases,因此您的最后一个ID 1010007不在最终输出中。如果出现问题,则应调整NA清理步骤的位置。

library(tidyr)
library(dplyr)

#convert from wide to long
new<-gather(df,key = "key", value = "value", -id, -`GFR Scans?`)
#clean up the key column
new$key<-sub("[0-9]+\\. ", "", new$key)
new$key<-gsub("[ ]+", " ", new$key)

# verify column headings (should only be 4)
unique(new$key)
#remove the rows with NA
new<-new[complete.cases(new),]

#now go from long to slightly wide
answer<-new %>% group_by( id, `GFR Scans?`, key) %>% mutate(testnum=row_number()) %>% spread(key, value)  

#convert the colmns back to the proper type
answer$`Date of renal scan:`<-as.POSIXct(as.numeric(answer$`Date of renal scan:`), origin="1970-01-01", tz="UTC")
answer$`GFR mL/1.73 sq.m`<-as.numeric(answer$`GFR mL/1.73 sq.m`)
answer

# id `GFR Scans?` testnum `Date of renal scan:` `GFR mL/1.73 sq.m` `Pre/Post tx?` `Type of renal scan:`
#     <dbl> <chr>          <int> <dttm>                             <dbl> <chr>          <chr>                
# 1 1010001 Yes                1 2005-12-07 00:00:00                   18 Pre            DTPA                 
# 2 1010001 Yes                2 2006-05-02 00:00:00                   86 Post           DTPA                 
# 3 1010002 Yes                1 2007-12-05 00:00:00                   13 Pre            DTPA                 
# 4 1010002 Yes                2 2008-06-27 00:00:00                  110 Post           DTPA                 
# 5 1010002 Yes                3 2008-08-19 00:00:00                   92 Post           DTPA                 
# 6 1010002 Yes                4 2009-05-19 00:00:00                   36 Post           DTPA                 
# 7 1010004 Yes                1 2009-03-18 00:00:00                   68 Post           DTPA                 
# 8 1010005 Yes                1 2005-08-16 00:00:00                  117 Post           DTPA