条件字符串拆分

时间:2014-07-27 11:36:50

标签: r tidyr

我的问题类似于conditional string splitting in R (using tidyr)。但是,我需要拆分超过2列。如果数据集列是

             cost
        reed_cost
   cost of living
        reed cost
 id gene_id locus

如何将其分为四列

col1 col2 col3   col4
                 cost
          reed   cost
     cost   of living
          reed   cost
  id gene   id  locus

我尝试了链接中的解决方案,无法正确使用。

3 个答案:

答案 0 :(得分:1)

dat <- data.frame(V1 = c("cost", "reed_cost", "cost of living", "reed cost", "id gene_id locus")) # Your data

library(stringr)
vars <- str_split_fixed(dat$V1, " |_", max(str_count(dat$V1, " |_") + 1))
dat2 <- data.frame(t(apply(vars, 1, function(x) c(x[x == ""], x[x != ""]))))
names(dat2) <- paste0("col", seq_len(dim(dat2)[2]))

#   col1 col2 col3   col4
# 1                  cost
# 2           reed   cost
# 3      cost   of living
# 4           reed   cost
# 5   id gene   id  locus

答案 1 :(得分:1)

以下两个选项应该可以很好地扩展。您需要&#34; data.table&#34;和&#34; reshape2&#34;已加载,以及my cSplit function

library(data.table)
library(reshape2)
library(devtools)
source_gist(11380733) ## For cSplit

第一个假定您实际上并不需要将值浮动到最右边的列。

cSplit(X, "x", sep = " |_", fixed = FALSE)
#     x_1  x_2    x_3   x_4
# 1: cost   NA     NA    NA
# 2: reed cost     NA    NA
# 3: cost   of living    NA
# 4: reed cost     NA    NA
# 5:   id gene     id locus

第二个假设你想要你所显示的表格中的数据:

dcast.data.table(                       # for long to wide
  cSplit(cbind(rn = 1:nrow(X), X),      # start by splitting into a long form
         "x", sep = " |_", "long", 
         fixed = FALSE)[, 
     n := sequence(.N), by = rn][,      # sequence by row-name
     n := abs(n-max(n))+1],             # ^^ reversed
  rn ~ n, value.var = "x", fill = "")   # formula for casting
#    rn     1      2    3    4
# 1:  1                   cost
# 2:  2              cost reed
# 3:  3       living   of cost
# 4:  4              cost reed
# 5:  5 locus     id gene   id

答案 2 :(得分:0)

这是一个基本解决方案。我们拆分输入并反转每行的元素。然后我们将每条线的长度设置为最大长度并反转它们:

# test data
x <- c("cost", "reed_cost", "cost of living", "reed cost", "id gene_id locus")

s <- lapply(strsplit(x, "[ _]"), rev)
t(sapply(lapply(s, "length<-", max(sapply(s, length))), rev))

给出这个矩阵:

     [,1] [,2]   [,3]   [,4]    
[1,] NA   NA     NA     "cost"  
[2,] NA   NA     "reed" "cost"  
[3,] NA   "cost" "of"   "living"
[4,] NA   NA     "reed" "cost"  
[5,] "id" "gene" "id"   "locus"