如何根据两个条件连接字符串?

时间:2015-07-26 20:42:44

标签: r concatenation

我的数据类似于以下虚拟数据:

> dummy <- structure(list(id = c(1, 1, 2, 2, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
    10, 10), dob = structure(c(1L, 1L, 6L, 6L, 6L, 3L, 9L, 2L, 5L, 
    7L, 4L, 8L, 6L, 6L, 6L), .Label = c("1990-01-01", "1991-11-12", 
    "1998-12-12", "1999-09-09", "2000-07-28", "2001-04-05", "2002-02-02", 
    "2004-12-16", "2012-05-06"), class = "factor"), date = structure(c(4L, 
    4L, 11L, 11L, 12L, 1L, 2L, 10L, 8L, 9L, 7L, 5L, 3L, 3L, 6L), .Label = c("2000-01-01", 
    "2000-01-03", "2002-12-15", "2003-01-06", "2003-04-05", "2003-12-15", 
    "2009-07-28", "2009-09-09", "2011-11-11", "2012-01-03", "2012-12-19", 
    "2012-12-31"), class = "factor"), text = structure(c(6L, 7L, 
    8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L), .Label = c("2aabb", 
    "2ccdd", "2eeff", "2gghh", "2iijj", "aa bb cc", "dd ee ff", "ghi", 
    "jklm", "nop", "qq rr", "sss ttt", "uv", "www xxx", "yy zz"), class = "factor"), 
        gender = structure(c(2L, 2L, 1L, 1L, 1L, 2L, 1L, 3L, 3L, 
        2L, 2L, 1L, 2L, 3L, 3L), .Label = c("f", "m", "mnx"), class = "factor")), .Names = c("id", 
    "dob", "date", "text", "gender"), row.names = c(NA, -15L), class = "data.frame")
> dummy
   id        dob       date     text gender
1   1 1990-01-01 2003-01-06 aa bb cc      m
2   1 1990-01-01 2003-01-06 dd ee ff      m
3   2 2001-04-05 2012-12-19      ghi      f
4   2 2001-04-05 2012-12-19     jklm      f
5   2 2001-04-05 2012-12-31      nop      f
6   3 1998-12-12 2000-01-01    qq rr      m
7   4 2012-05-06 2000-01-03  sss ttt      f
8   5 1991-11-12 2012-01-03       uv    mnx
9   6 2000-07-28 2009-09-09  www xxx    mnx
10  7 2002-02-02 2011-11-11    yy zz      m
11  8 1999-09-09 2009-07-28    2aabb      m
12  9 2004-12-16 2003-04-05    2ccdd      f
13 10 2001-04-05 2002-12-15    2eeff      m
14 10 2001-04-05 2002-12-15    2gghh    mnx
15 10 2001-04-05 2003-12-15    2iijj    mnx

我的目标是最终保留一个保留所有列的数据框,但是如果在ID中有多个行具有相同的日期,我需要在&#39; text&#39;对于那些匹配日期与之间的空格连接,使每个id中的每个日期只出现一次。以下是我的目标数据:

dummy2 <- structure(list(id = c(1, 2, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10), 
    dob = structure(c(1L, 6L, 6L, 3L, 9L, 2L, 5L, 7L, 4L, 8L, 
    6L, 6L), .Label = c("1990-01-01", "1991-11-12", "1998-12-12", 
    "1999-09-09", "2000-07-28", "2001-04-05", "2002-02-02", "2004-12-16", 
    "2012-05-06"), class = "factor"), date = structure(c(4L, 
    11L, 12L, 1L, 2L, 10L, 8L, 9L, 7L, 5L, 3L, 6L), .Label = c("2000-01-01", 
    "2000-01-03", "2002-12-15", "2003-01-06", "2003-04-05", "2003-12-15", 
    "2009-07-28", "2009-09-09", "2011-11-11", "2012-01-03", "2012-12-19", 
    "2012-12-31"), class = "factor"), text = structure(c(5L, 
    6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L), .Label = c("2aabb", 
    "2ccdd", "2eeff 2gghh", "2iijj", "aa bb cc dd ee ff", "ghi jklm", 
    "nop", "qq rr", "sss ttt", "uv", "www xxx", "yy zz"), class = "factor"), 
    gender = structure(c(2L, 1L, 1L, 2L, 1L, 3L, 3L, 2L, 2L, 
    1L, 3L, 3L), .Label = c("f", "m", "mnx"), class = "factor")), .Names = c("id", 
    "dob", "date", "text", "gender"), row.names = c(NA, -12L), class = "data.frame")
> dummy2
   id        dob       date              text gender
1   1 1990-01-01 2003-01-06 aa bb cc dd ee ff      m
2   2 2001-04-05 2012-12-19          ghi jklm      f
3   2 2001-04-05 2012-12-31               nop      f
4   3 1998-12-12 2000-01-01             qq rr      m
5   4 2012-05-06 2000-01-03           sss ttt      f
6   5 1991-11-12 2012-01-03                uv    mnx
7   6 2000-07-28 2009-09-09           www xxx    mnx
8   7 2002-02-02 2011-11-11             yy zz      m
9   8 1999-09-09 2009-07-28             2aabb      m
10  9 2004-12-16 2003-04-05             2ccdd      f
11 10 2001-04-05 2002-12-15       2eeff 2gghh    mnx
12 10 2001-04-05 2003-12-15             2iijj    mnx

我试过了:

dummy$text <- as.character(dummy$text)
test1 <- ddply(dummy, .(id, date), summarise, 
               paste0(unique(unlist(strsplit(text, split=", "))), collapse=", "))

for (i in 1:length(dummy$id)){
  ifelse(dummy$id[i]==dummy$id[i-1],
  (ifelse(dummy$date[i]==dummy$date[i-1],textcon[i]<-     paste(dummy$text[i],dummy$text[i-1]),textcon[i]<-dummy$text[i])),
              (textcon[i]<-dummy$text[i]))
}
test3<-data.frame(dummy,textcon)

还有很多其他变体,但我还不确定如何提出数据,其中id中的任何日期都不重复!这类似于之前的几个关于SO的问题,除了我的问题围绕着必须同时使用两个分组因素而不是一个。

提前感谢您的帮助。

1 个答案:

答案 0 :(得分:2)

使用dplyr

library(dplyr)
dummy %>%
   group_by(id, dob, date, gender) %>%
   summarise(text=paste(text, collapse=' ')) %>%
   select(id:date, text, gender)

data.table

library(data.table)
setDT(dummy)
dummy[, list(text=paste(text, collapse=" ")), list(id, dob, date, gender)]

如果列的顺序很重要,您可以将setcolorder添加到data.table

相关问题