Question

在带有格式标签（例如

）的文本中

data.frame(id = c(1, 2), text = c("something here <h1>my text</h1> also <h1>Keep it</h1>", "<h1>title</h1> another here"))

如果<h1> </h1>中只有文本存在，那么人们怎么才能用逗号分隔选项：

data.frame(text = c("my text, Keep it", "title"), id = c(1, 2))

Answer 1

我们可以使用str_extract_all。使用正则表达式环顾四周，获取标记后的字符，然后遍历list输出和paste提取的字符串

library(stringr)
data.frame(text = sapply(str_extract_all(df1$text, "(?<=<h1>)[^<]+"), 
      paste, collapse=", "), id = df1$id)
#               text id
#1 my text, Keep it  1
#2            title  2

Answer 2

您可以使用网页抓取技能。

Gson

Answer 3

如果要为此使用 quanteda ，则可以将其转换为语料库，然后通过两次corpus_segment()调用进行处理，一次调用获取之前的文本，然后然后选择文本。然后，您可以使用texts(x, groups = docid())并指定spacer = ", "将文本重新分组。

在这里，用您想要的输出：

library("quanteda")
## Package version: 2.1.1

df <- data.frame(
  id = c(1, 2),
  text = c("something here <h1>my text</h1> also <h1>Keep it</h1>", "<h1>title</h1> another here")
)

charvec <- corpus(df, docid_field = "id") %>%
  corpus_segment("</h1>", pattern_position = "after") %>%
  corpus_segment("<h1>", pattern_position = "before") %>%
  texts(groups = docid(.), spacer = ", ")

然后将其转换为所需的data.frame：

data.frame(text = charvec, id = names(charvec))
##               text id
## 1 my text, Keep it  1
## 2            title  2

仅保留标签文本

3 个答案: