Question

我正在尝试匹配以下有序和无序列表并提取项目符号/列表点。

library(stringr)
examples <- c(
"* Bullet 1\n* Bullet 2\n* Bullet 3",
"1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
"* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)

我想做的是：

以编程方式识别它是一个列表
将每个解析为列表项的文本

结果将是

some_str_fun(example,pattern) # or multiples
"Bullet 1" "Bullet 2" "Bullet 3"
"Bullet 1" "Bullet 2" "Bullet 3"
"This is a test 1" "This is a test with some *formatting*" 
"This is a test with different _formatting_"

我一直在使用以下模式，str_extract / match但似乎无法找到完全正常的功能

[*]+\\s(.*?)[\n]* # for * Bullet X\n
[1-9]+[.]\\s(.*?)[\n]* # for 1. Bullet X\n

我在这些模式上尝试了很多不同的迭代，但似乎无法得到我正在寻找的东西。

Answer 1

您可以使用gsubfn包中的onClick来匹配整个模式。

strapply

Answer 2

这是一种不同的方法，但如果您将markdown呈现为HTML，则可以使用一些现有的提取方法来执行您想要的操作：

library(stringr)

examples <- c(
"* Bullet 1\n* Bullet 2\n* Bullet 3",
"1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
"* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)

extract_md_list <- function(md_text) {

  require(rvest)
  require(rmarkdown)

  fil_md <- tempfile()
  fil_html <- tempfile()
  writeLines(md_text, con=fil_md)

  render(fil_md, output_format="html_document", output_file=fil_html, quiet=TRUE)

  pg <- html(fil_html)
  ret <- html_nodes(pg, "li") %>% html_text()

  # cleanup
  unlink(fil_md)
  unlink(fil_html)

  return(ret)

}

extract_md_list(examples)

## [1] "Bullet 1"                                
## [2] "Bullet 2"                                
## [3] "Bullet 3"                                
## [4] "Bullet 1"                                
## [5] "Bullet 2"                                
## [6] "Bullet 3"                                
## [7] "This is a test 1"                        
## [8] "This is a test with some formatting"     
## [9] "This is a test with different formatting"

Answer 3

这是另一种选择。如果需要，您可以包装unlist：

str_extract_all(examples, "[^*1-9\n ]\\w+( ?[\\w*]+)*")
# or 
#str_extract_all(examples, "[^*1-9\n ]\\w+( ?[a-zA-Z0-9_*]+)*")

#[[1]]
#[1] "Bullet 1" "Bullet 2" "Bullet 3"
#
#[[2]]
#[1] "Bullet 1" "Bullet 2" "Bullet 3"
#
#[[3]]
#[1] "This is a test 1"                          
#[2] "This is a test with some *formatting*"     
#[3] "This is a test with different _formatting_"

还有其他几个选项，特别是如果您不关心在单个正则表达式或单行代码中获取所有选项。这是另一种方法。正则表达式更简单，但最终得到""，这需要额外的行：

splits <- unlist(str_split(examples, "\n|\\d+\\. |\\* "))
splits[splits != ""]
#[1] "Bullet 1"                                  
#[2] "Bullet 2"                                  
#[3] "Bullet 3"                                  
#[4] "Bullet 1"                                  
#[5] "Bullet 2"                                  
#[6] "Bullet 3"                                  
#[7] "This is a test 1"                          
#[8] "This is a test with some *formatting*"     
#[9] "This is a test with different _formatting_"

在R中，如何匹配降价列表

3 个答案: