缺少某些分隔符时如何使用分隔符提取文本

时间:2018-12-18 12:40:05

标签: r

我试图根据半结构化文本文档中的标题提取文本。

输入

Column<-"Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Report: Need to complete Conclusion: Dud"

此处的输出是

Order     Subject Name           Grade  Report           Conclusion
1223442   History Bilbo Johnson   Bad   Need to complete  Dud

我可以使用以下(凌乱但有效)功能来实现此目的:

dataframeIn<-data.frame(Column,stringsAsFactors=FALSE)
delim<-c("Order","Subject","Name","Grade","Report","Conclusion")


Extractor <- function(dataframeIn, Column, delim) {
  dataframeInForLater<-dataframeIn
  ColumnForLater<-Column
  Column <- rlang::sym(Column)
  dataframeIn <- data.frame(dataframeIn)
  dataframeIn<-dataframeIn %>%
    tidyr::separate(!!Column, into = c("added_name",delim),
                                          sep = paste(delim, collapse = "|"),
                    extra = "drop", fill = "right")
  names(dataframeIn) <- gsub(".", "", names(dataframeIn), fixed = TRUE)

  dataframeIn<-data.frame(dataframeIn)
  #Add the original column back in so have the original reference
  dataframeIn<-cbind(dataframeInForLater[,ColumnForLater],dataframeIn)
  dataframeIn<-data.frame(dataframeIn)
  return(dataframeIn)
}

Extractor(dataframeIn, "Column", delim)

但是,有时分隔符会丢失,例如

Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Conclusion: Dud

在这种情况下,所需的输出是

Order     Subject Name           Grade  Conclusion
1223442   History Bilbo Johnson   Bad    Dud

但实际输出变为:

 Order   Subject            Name   Grade Report Conclusion
:1223442  :History   Bilbo Johnson  : Bad    : Dud       <NA>

我如何解释缺少的定界符,尽管它们的顺序相同(包括上面中间的示例中以及文本中末尾缺少的定界符)?

1 个答案:

答案 0 :(得分:0)

我们可以执行以下操作(这只是文本提取,我将为您构造输出):

library(stringr)
Extractor <- function(x, delim) {
  pattern <- paste0(delim, ":{0,1}(.*?)(", paste(c(delim, "$"), collapse = "|"), ")")
  trimws(str_match(x, pattern)[, 2])
}
Extractor(Column1, delim)
# [1] "1223442"          "History"          "Bilbo Johnson"    "Bad"              "Need to complete" "Dud"
Extractor(Column2, delim)
# [1] "1223442"       "History"       "Bilbo Johnson" "Bad"           NA              "Dud"
Column3 <- "Subject:History Name Bilbo Johnson"
Extractor(Column3, delim)
# [1] NA              "History"       "Bilbo Johnson" NA              NA              NA

由于有了NA,因此很明显缺少了哪些分隔符,没有了。

在您的情况下,它的工作方式是我们有一系列的模式

pattern
# [1] "Order:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"     
# [2] "Subject:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"   
# [3] "Name:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"      
# [4] "Grade:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"     
# [5] "Report:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"    
# [6] "Conclusion:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"

然后str_match nice将(.*?)部分提取到第二个输出列中,我们用trimws除去了任何空格。嗯,我们在(.*?)中使用了惰性匹配,以免匹配过多。