我有以下df
df = data.frame(id = c(1,2,3), text = c('Label issues as ISS101 and ISS 201 on label 23 with x203 17','issue as ISS5051 with label 01 as l018','there is nothing here')
我想提取&从df
创建以下数据框id iss label ext1 ext2
1 ISS101 23 x203 17
1 ISS201 23 x203 17
2 ISS5051 01 l018 NA
3 NA NA NA NA
iss的长度可能会有所不同,如示例中所示。它可能会也可能没有#IS;" ISS" &安培;随后的数字,再次在例如 标签的长度,ext1& ext2是固定的。 我已尝试使用stringr&正则表达式的各种可能性。 dplyr。但这些都不接近解决方案&因此值得一提。期待寻求帮助,如果您需要更多详细信息,请与我们联系。
答案 0 :(得分:2)
您可以像这样使用dplyr
和stringr
......
library(dplyr)
library(stringr)
df2 <- df %>% mutate(iss=str_extract_all(str_replace_all(text,"ISS\\s+(\\d+)","ISS\\1"),
"ISS\\d+"), #remove spaces and then extract ISSnnn
label=str_match(text,"label\\s+(\\d+)")[,2], #extract label + nn
ext1=str_match(text,"label\\s+\\d+.*?([a-z]\\d+)")[,2], #extract Xnnn after label
ext2=str_match(text,"\\s(\\d+)$")[,2]) %>% #extract digits at end of string
unnest(iss) %>% #unnest iss (creates one row for each iss)
select(iss,label,ext1,ext2) #select wanted variables
df2
iss label ext1 ext2
1 ISS101 23 x203 17
2 ISS201 23 x203 17
3 ISS5051 01 l018 <NA>
答案 1 :(得分:0)
这可能是一个开始:
do.call(plyr::rbind.fill,
lapply(df$text, function(x) {
as.data.frame(cbind(
iss = unlist(stringr::str_extract_all(x, "(ISS\\s?\\d{3,4})")),
label = unlist(stringr::str_extract_all(x, "(?<=label)\\s?(\\d{1,2})")),
ext1 = unlist(stringr::str_extract_all(x, "((x|l)\\d{3})")),
ext2 = unlist(stringr::str_extract_all(x, "(?<=x|l\\d{3})\\s?\\d{1,3}"))
))}
))
iss label ext1 ext2
1 ISS101 23 x203 203
2 ISS 201 23 x203 203
3 ISS5051 01 l018 <NA>
答案 2 :(得分:0)
我已根据您的描述尽力而为。在没有看到更多数据的情况下,我无法保证这将是可推广的,但它会为您提供的df产生所需的输出,因此它应该是一个良好的开端。
# create data frame
df = data.frame(id = c(1,2,3), text = c('Label issues as ISS101 and ISS 201 on label 23 with x203 17','issue as ISS5051 with label 01 as l018','there is nothing here'))
# parse text into fields
df <- df %>% mutate(
iss = str_extract(text, "ISS\\d+\\D"),
iss_space = str_extract(text, "ISS\\s\\d+\\D"),
label = str_extract(text, "label.+\\D"),
label = str_extract(label, "\\d+\\D"),
ext1 = str_extract(text, "\\s\\D\\d{3}"),
ext2 = str_extract(text, "\\s\\D\\d{3}\\s\\d{2}"),
ext2 = str_extract(ext2, "\\s\\d{2}"))
# clean up into correct format
df <- df %>%
gather(iss, iss_space, key = "type", value = "iss") %>%
select(-text, -type) %>%
distinct() %>%
filter(!(duplicated(id) == T & is.na(iss) == T)) %>%
arrange(id) %>%
select(id, iss, label, ext1, ext2) %>%
mutate(iss = str_replace_all(iss, " ", ""))
df
id iss label ext1 ext2
1 1 ISS101 23 x203 17
2 1 ISS201 23 x203 17
3 2 ISS5051 01 l018 <NA>
4 3 <NA> <NA> <NA> <NA>