我正在尝试匹配以下有序和无序列表并提取项目符号/列表点。
library(stringr)
examples <- c(
"* Bullet 1\n* Bullet 2\n* Bullet 3",
"1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
"* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)
我想做的是:
结果将是
some_str_fun(example,pattern) # or multiples
"Bullet 1" "Bullet 2" "Bullet 3"
"Bullet 1" "Bullet 2" "Bullet 3"
"This is a test 1" "This is a test with some *formatting*"
"This is a test with different _formatting_"
我一直在使用以下模式,str_extract / match但似乎无法找到完全正常的功能
[*]+\\s(.*?)[\n]* # for * Bullet X\n
[1-9]+[.]\\s(.*?)[\n]* # for 1. Bullet X\n
我在这些模式上尝试了很多不同的迭代,但似乎无法得到我正在寻找的东西。
答案 0 :(得分:3)
您可以使用gsubfn包中的onClick
来匹配整个模式。
strapply
答案 1 :(得分:2)
这是一种不同的方法,但如果您将markdown呈现为HTML,则可以使用一些现有的提取方法来执行您想要的操作:
library(stringr)
examples <- c(
"* Bullet 1\n* Bullet 2\n* Bullet 3",
"1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
"* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)
extract_md_list <- function(md_text) {
require(rvest)
require(rmarkdown)
fil_md <- tempfile()
fil_html <- tempfile()
writeLines(md_text, con=fil_md)
render(fil_md, output_format="html_document", output_file=fil_html, quiet=TRUE)
pg <- html(fil_html)
ret <- html_nodes(pg, "li") %>% html_text()
# cleanup
unlink(fil_md)
unlink(fil_html)
return(ret)
}
extract_md_list(examples)
## [1] "Bullet 1"
## [2] "Bullet 2"
## [3] "Bullet 3"
## [4] "Bullet 1"
## [5] "Bullet 2"
## [6] "Bullet 3"
## [7] "This is a test 1"
## [8] "This is a test with some formatting"
## [9] "This is a test with different formatting"
答案 2 :(得分:1)
这是另一种选择。如果需要,您可以包装unlist
:
str_extract_all(examples, "[^*1-9\n ]\\w+( ?[\\w*]+)*")
# or
#str_extract_all(examples, "[^*1-9\n ]\\w+( ?[a-zA-Z0-9_*]+)*")
#[[1]]
#[1] "Bullet 1" "Bullet 2" "Bullet 3"
#
#[[2]]
#[1] "Bullet 1" "Bullet 2" "Bullet 3"
#
#[[3]]
#[1] "This is a test 1"
#[2] "This is a test with some *formatting*"
#[3] "This is a test with different _formatting_"
还有其他几个选项,特别是如果您不关心在单个正则表达式或单行代码中获取所有选项。这是另一种方法。正则表达式更简单,但最终得到""
,这需要额外的行:
splits <- unlist(str_split(examples, "\n|\\d+\\. |\\* "))
splits[splits != ""]
#[1] "Bullet 1"
#[2] "Bullet 2"
#[3] "Bullet 3"
#[4] "Bullet 1"
#[5] "Bullet 2"
#[6] "Bullet 3"
#[7] "This is a test 1"
#[8] "This is a test with some *formatting*"
#[9] "This is a test with different _formatting_"