将数据框中的字符串拆分为两列

时间:2016-06-25 16:46:44

标签: r dataframe

在R中,我将带有ngram 4的DocumentTermMatrix转换为数据帧,现在我想将ngram分成两列,一列是字符串的前三个字,另一列是最后一个字。我可以通过多个步骤来完成这个任务,但考虑到df的大小,我希望能够在线完成它。

这就是我想要完成的事情:

#             str_name           w123   w4 freq
# 1 One Two Three Four One Two Three  Four   10        

这给了我前三个字:

df <- data.frame(str_name = "One Two Three Four", freq = 10)
df %>% separate(str_name, c("w123","w4"), sep = "\\w+$", remove=FALSE)

#             str_name           w123 w4 freq
# 1 One Two Three Four One Two Three       10

这给了我最后一个字,但也包含一个空格:

df <- data.frame(str_name = "One Two Three Four", freq = 10)
df %>% separate(str_name, c("sp","w4"), sep = "\\w+\\s\\w+\\s\\w+", remove=FALSE)

#             str_name sp    w4 freq
# 1 One Two Three Four     Four   10

这是漫长的道路

df <- data.frame(w4 = "One Two Three Four", freq = 10)
df <- df %>% separate(w4, c('w1', 'w2', 'w3', 'w4'), " ")
df$lookup <- paste(df$w1,df$w2,df$w3)

#      w1    w2    w3       w4 freq        lookup
# 1   One   Two Three     Four   10 One Two Three

2 个答案:

答案 0 :(得分:4)

尝试\\s(?=\\w+$)查找要拆分的字符串中最后一个单词之前的空格:

df %>% separate(str_name, into = c("w123", "w4"), sep = "\\s(?=\\w+$)", remove = F)
#             str_name          w123   w4 freq
# 1 One Two Three Four One Two Three Four   10

\\s(?=[\\S]+$)是另一个选项,它比上面的选项更贪婪,它会查找要拆分的字符串中的最后一个空格。

df %>% separate(str_name, into = c("w123", "w4"), sep = "\\s(?=[\\S]+$)", remove = F)
#             str_name          w123   w4 freq
# 1 One Two Three Four One Two Three Four   10

答案 1 :(得分:0)

我们可以使用base R方法来解决此问题

res <- cbind(df, read.table(text=sub("\\s(\\S+)$", ",\\1", df$str_name), 
  sep=",", header=FALSE, col.names = c("w123", "w4"), stringsAsFactors=FALSE))[c(1,3,4,2)]
res
#            str_name          w123   w4 freq
#1 One Two Three Four One Two Three Four   10