将一个变量拆分为R中的多个变量

时间:2017-11-29 21:41:50

标签: r string variables dataframe split

我对R比较陌生。我的问题并不像标题那么简单。这是df的样子:

id    amenities
1     wireless internet, air conditioning, pool, kitchen
2     pool, kitchen, washer, dryer
3     wireless internet, kitchen, dryer
4     
5     wireless internet

这就是我想要df的样子:

id    wireless internet   air conditioning   pool   kitchen   washer   dryer
1     1                   1                  1      1         0        0
2     0                   0                  1      1         1        1
3     1                   0                  0      1         0        1
4     0                   0                  0      0         0        0
5     1                   0                  0      0         0        0

重现数据的示例代码

df <- data.frame(id = c(1, 2, 3, 4, 5),
      amenities = c("wireless internet, air conditioning, pool, kitchen",  
                    "pool, kitchen, washer, dryer", 
                    "wireless internet, kitchen, dryer", 
                    "", 
                    "wireless internet"), 
      stringsAsFactors = FALSE)

4 个答案:

答案 0 :(得分:3)

使用dplyrtidyr的解决方案。请注意,我将""替换为None,因为以后更容易处理列名。

library(dplyr)
library(tidyr)

df2 <- df %>%
  separate_rows(amenities, sep = ",") %>%
  mutate(amenities = ifelse(amenities %in% "", "None", amenities)) %>%
  mutate(value = 1) %>%
  spread(amenities, value , fill = 0) %>%
  select(-None)
df2
#   id  air conditioning  dryer  kitchen  pool  washer pool wireless internet
# 1  1                 1      0        1     1       0    0                 1
# 2  2                 0      1        1     0       1    1                 0
# 3  3                 0      1        1     0       0    0                 1
# 4  4                 0      0        0     0       0    0                 0
# 5  5                 0      0        0     0       0    0                 1

答案 1 :(得分:3)

FWIW,这里是一个基础R方法(假设df包含问题中显示的数据)

dat <- with(df, strsplit(amenities, ', '))
df2 <- data.frame(id = factor(rep(df$id, times = lengths(dat)),
                              levels = df$id),
                  amenities = unlist(dat))
df3 <- as.data.frame(cbind(id = df$id,
                     table(df2$id, df2$amenities)))

这导致

> df3
  id air conditioning dryer kitchen pool washer wireless internet
1  1                1     0       1    1      0                 1
2  2                0     1       1    1      1                 0
3  3                0     1       1    0      0                 1
4  4                0     0       0    0      0                 0
5  5                0     0       0    0      0                 1

分解正在发生的事情:

  1. dat <- with(df, strsplit(amenities, ', '))amenities上拆分', '变量,结果

    > dat
    [[1]]
    [1] "wireless internet" "air conditioning"  "pool"             
    [4] "kitchen"          
    
    [[2]]
    [1] "pool"    "kitchen" "washer"  "dryer"  
    
    [[3]]
    [1] "wireless internet" "kitchen"           "dryer"            
    
    [[4]]
    character(0)
    
    [[5]]
    [1] "wireless internet"
    
  2. 第二行占用dat并将其转换为向量,我们通过重复原始id值添加和id列的次数与id的便利设施。这导致

    > df2
       id         amenities
    1   1 wireless internet
    2   1  air conditioning
    3   1              pool
    4   1           kitchen
    5   2              pool
    6   2           kitchen
    7   2            washer
    8   2             dryer
    9   3 wireless internet
    10  3           kitchen
    11  3             dryer
    12  5 wireless internet
    
  3. 使用table()功能创建列联表,然后我们添加id列。

答案 2 :(得分:0)

int main (int argc,char *argv[]) { char c[100]; char buffer[100]; FILE *input = fopen(argv[1], "r"); Story *temp = (Story*) malloc(sizeof(Story) * 8); if(input) { int flag = 0; while (fgets(c, sizeof(buffer),input) != NULL) { if(flag == 0) { sscanf(c, "%s", temp->title); } else if(flag == 1) { sscanf(c, "%s", temp->file_x); } else if(flag == 2) { sscanf(c, "%s", temp->file_y); } else { while(!feof(input)) { fread(temp->text, sizeof(Story),1,input); } } flag++; } printf("%s\n%s\n%s\n", temp->title, temp->file_x, temp->file_y); } else if (input == NULL) { printf("ERROR MESSAGE HERE \n"); return 1; } free(temp); fclose(input); return 0; 包在这里很有用。尝试

dummies

答案 3 :(得分:0)

为了完整起见,这里也是一个简洁的data.table解决方案:

library(data.table)
setDT(df)[, strsplit(amenities, ", "), by = id][
  , dcast(.SD, id ~ V1, length)]
   id air conditioning dryer kitchen pool washer wireless internet
1:  1                1     0       1    1      0                 1
2:  2                0     1       1    1      1                 0
3:  3                0     1       1    0      0                 1
4:  5                0     0       0    0      0                 1

强制执行data.table后,amenities", "拆分为每个项目的单独行(长格式)。然后使用length()函数将其重新整理为宽格式。