Question

我在一个更大的数据框中有两列我难以拆分。我过去使用strsplit时曾尝试使用＆＃34;空格＆＃34;，＆＃34;，＆＃34;或其他一些分隔符。这里最难的部分是我不想丢失任何信息。当我拆分某些部分时，我最终会丢失信息。我想最后得到四列。这是我现在几行的样本。

age-gen  surv-camp
45M      1LC
9F       0
12M      1AC
67M      1LC

这是我最想得到的。

age   gen   surv   camp
45    M     1      LC
9     F     0      
12    M     1      AC
67    M     1      LC

我已经在这里做了很多狩猎，并在Java，C ++，HTML等中找到了许多回复，但我还没有找到任何解释如何在R中做到这一点的回复当你缺少数据时。

我看到this关于在值之间添加空格然后只是在空间上分割，但我不知道这将如何工作1）缺少数据，2）当我没有＆＃39;每行中都有一致的数字或字符值。

Answer 1

我们循环遍历“df1”（sub）列，使用vector在数字子字符串后面创建分隔符，将read.table作为data.frame读取{{1} }，rbind list data.frames并更改输出的列名称。

res <- do.call(cbind, lapply(df1, function(x)
      read.table(text=sub("(\\d+)", "\\1,", x), 
          header=FALSE, sep=",", stringsAsFactors=FALSE)))
colnames(res) <- scan(text=names(df1), sep=".", what="", quiet = TRUE)
res
#  age gen surv camp
#1  45   M    1   LC
#2   9   F    0     
#3  12   M    1   AC
#4  67   M    1   LC

或使用separate

中的tidyr

library(tidyr)
library(dplyr)
separate(df1, age.gen, into = c("age", "gen"), "(?<=\\d)(?=[A-Za-z])", convert= TRUE) %>% 
       separate(surv.camp, into = c("surv", "camp"), "(?<=\\d)(?=[A-Za-z])", convert = TRUE)
#  age gen surv camp
#1  45   M    1   LC
#2   9   F    0 <NA>
#3  12   M    1   AC
#4  67   M    1   LC

或者正如@Frank所提到的，我们可以使用tstrsplit

中的data.table

library(data.table)
setDT(df1)[, unlist(lapply(.SD, function(x) 
    tstrsplit(x, "(?<=[0-9])(?=[a-zA-Z])", perl=TRUE, 
                        type.convert=TRUE)), recursive = FALSE)]

编辑：在convert = TRUE中添加了separate，以便在拆分后更改type列。

数据

df1 <- structure(list(age.gen = c("45M", "9F", "12M", "67M"), surv.camp = c("1LC", 
 "0", "1AC", "1LC")), .Names = c("age.gen", "surv.camp"), 
class = "data.frame", row.names = c(NA, -4L))

拆分字符串而不丢失字符-R

1 个答案:

数据