一个长字符到多列

时间:2018-05-31 10:00:36

标签: r regex

我有这个数据框

df1 <- data.frame(Note = c("Profit before tax 240 tSEK",
                           "Earnings per share 0.240 " ,
                           "Ali de Margin 37 %"),
                  Line = c(6, 2, 2))

我想要下面的内容

Note                 Val    Unit    Line
Profit before tax    240    tSEK    6
Earnings per share   0.240          2
Ali de Margin        37      %      2

我该怎么做?

3 个答案:

答案 0 :(得分:3)

您可以使用函数tstrsplit,在数字之前或带有数字的数字(带或不带点)之后在空格上拆分变量Note,使用正则表达式和外观:

library(data.table)
setDT(df1)[, c("Note", "Val", "Unit"):=tstrsplit(Note, "( (?=[0-9.]+))|((?<=\\d) )", perl=TRUE)]
df1
#                 Note Line   Val Unit
#1:  Profit before tax    6   240 tSEK
#2: Earnings per share    2 0.240   NA
#3:      Ali de Margin    2    37    %

答案 1 :(得分:1)

你也可以玩regexpr&amp; regmatches函数:

pattern <- regexpr("[[:digit:]]+\\.*[[:digit:]]+", df$note)
note <- substr(df$note, 1, pattern-2)
value <- regmatches(df$note, pattern)
unit <- substr(df$note, 
              pattern+attr(pattern, "match.length")+1,
              nchar(as.character(df$note)))

result <- data.frame(note=note, value=value, unit=unit, line=df$Lines)

#                note value unit line
#1  Profit before tax   240 tSEK    6
#2 Earnings per share 0.240         2
#3      Ali de Margin    37    %    2

答案 2 :(得分:0)

一种解决方案是使用tidyr::extractextract函数提供了定义regex以捕获组并在多列中分隔列的灵活性。

library(tidyr)

extract(df1, Note, into = c("Note", "Val", "Unit"),
                regex = "^([[:alpha:][:blank:]]+)\\s([[:digit:].]+)(.*)")

#                 Note   Val  Unit Line
# 1  Profit before tax   240  tSEK    6
# 2 Earnings per share 0.240          2
# 3      Ali de Margin    37     %    2
**Regex explanation:**

^([[:alpha:][:blank:]]+)  -- Group 1 => Any number of character/spaces 
\\s                       -- Leave a space between Group 1 and Group 2
([[:digit:].]+)           -- Group 2 => Any number of digits/.
(.*)                      -- Gropu 3 => Any thing after 2nd group till end.