Question

我想在解析的文本中计算小于x的数量。

这是给我列表的代码：

library(rvest)
library(reshape2)


td <- read_html(x = "http://www.imdb.com/name/nm1287124/?ref_=tt_ov_dr") 
list<- as.list(td %>% # feed `main.page` to the next step%>% # load the page
    html_nodes(".filmo-row") %>% # isloate the text
    html_text())

是否有人知道如何计算，例如，数量小于2017年？

（完成;最终目标是在某一年之前计算主管学分的数量）

Answer 1

让我们说：

text <- "asdasd8927askdmasjdo89jans1982736djnaos987anksdjnj2008da"

假设数字总是被[0-9]以外的任何内容包围，那么你可以编写一个函数来执行此操作：

idx <- gregexpr("[0-9]+", text)[[1]]
lens <- attr(idx, "match.length")
nums <- lapply(seq_along(idx), function(i) {
  substr(text, idx[i], idx[i] + lens[i] - 1)
})
nums <- as.numeric(nums)

（?grep和?substr进行解释）最后，您可以计算大于2017年的数字。

sum(nums > 2017)

修改（评论）

假设我们只想查看4位数字，然后可以调整正则表达式（以及substr索引）。现在我们搜索＆＃34;不是数字＆＃34; 4次＆＃34;数字＆＃34; ＆＃34;不是数字＆＃34;。所以，只提取＆＃34;数字＆＃34;部分我们稍后开始substr一个位置并提前停止一个位置。

idx <- gregexpr("[^0-9][0-9]{4}[^0-9]", text)[[1]]
lens <- attr(idx, "match.length")
nums <- lapply(seq_along(idx), function(i) {
  substr(text, idx[i] + 1, idx[i] + lens[i] - 2)
})
nums <- as.numeric(nums)

现在nums仅包含2个4位数字。

nums
sum(nums > 2017)

处理已解析的文本

1 个答案: