在R中,如何匹配降价列表

时间:2015-07-12 23:05:03

标签: regex r stringr

我正在尝试匹配以下有序和无序列表并提取项目符号/列表点。

library(stringr)
examples <- c(
"* Bullet 1\n* Bullet 2\n* Bullet 3",
"1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
"* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)

我想做的是:

  1. 以编程方式识别它是一个列表
  2. 将每个解析为列表项的文本
  3. 结果将是

    some_str_fun(example,pattern) # or multiples
    "Bullet 1" "Bullet 2" "Bullet 3"
    "Bullet 1" "Bullet 2" "Bullet 3"
    "This is a test 1" "This is a test with some *formatting*" 
    "This is a test with different _formatting_"
    

    我一直在使用以下模式,str_extract / match但似乎无法找到完全正常的功能

    [*]+\\s(.*?)[\n]* # for * Bullet X\n
    [1-9]+[.]\\s(.*?)[\n]* # for 1. Bullet X\n
    

    我在这些模式上尝试了很多不同的迭代,但似乎无法得到我正在寻找的东西。

3 个答案:

答案 0 :(得分:3)

您可以使用gsubfn包中的onClick来匹配整个模式。

strapply

答案 1 :(得分:2)

这是一种不同的方法,但如果您将markdown呈现为HTML,则可以使用一些现有的提取方法来执行您想要的操作:

library(stringr)

examples <- c(
"* Bullet 1\n* Bullet 2\n* Bullet 3",
"1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
"* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)

extract_md_list <- function(md_text) {

  require(rvest)
  require(rmarkdown)

  fil_md <- tempfile()
  fil_html <- tempfile()
  writeLines(md_text, con=fil_md)

  render(fil_md, output_format="html_document", output_file=fil_html, quiet=TRUE)

  pg <- html(fil_html)
  ret <- html_nodes(pg, "li") %>% html_text()

  # cleanup
  unlink(fil_md)
  unlink(fil_html)

  return(ret)

}

extract_md_list(examples)

## [1] "Bullet 1"                                
## [2] "Bullet 2"                                
## [3] "Bullet 3"                                
## [4] "Bullet 1"                                
## [5] "Bullet 2"                                
## [6] "Bullet 3"                                
## [7] "This is a test 1"                        
## [8] "This is a test with some formatting"     
## [9] "This is a test with different formatting"

答案 2 :(得分:1)

这是另一种选择。如果需要,您可以包装unlist

str_extract_all(examples, "[^*1-9\n ]\\w+( ?[\\w*]+)*")
# or 
#str_extract_all(examples, "[^*1-9\n ]\\w+( ?[a-zA-Z0-9_*]+)*")

#[[1]]
#[1] "Bullet 1" "Bullet 2" "Bullet 3"
#
#[[2]]
#[1] "Bullet 1" "Bullet 2" "Bullet 3"
#
#[[3]]
#[1] "This is a test 1"                          
#[2] "This is a test with some *formatting*"     
#[3] "This is a test with different _formatting_"

还有其他几个选项,特别是如果您不关心在单个正则表达式或单行代码中获取所有选项。这是另一种方法。正则表达式更简单,但最终得到"",这需要额外的行:

splits <- unlist(str_split(examples, "\n|\\d+\\. |\\* "))
splits[splits != ""]
#[1] "Bullet 1"                                  
#[2] "Bullet 2"                                  
#[3] "Bullet 3"                                  
#[4] "Bullet 1"                                  
#[5] "Bullet 2"                                  
#[6] "Bullet 3"                                  
#[7] "This is a test 1"                          
#[8] "This is a test with some *formatting*"     
#[9] "This is a test with different _formatting_"