Question

如何使用REGEX在 title =＆＃34; 11：53 AM - 2018年5月27日＆＃34; 中仅提取日期。

仅供参考，这是来自HTML页面。我想使用R语言将所有这些匹配提取到列表中。

我的输出应该是2018年5月27日。

提前感谢您的时间：）

Answer 1

想出来：

rawHTML <- paste(readLines("D:\\practicum\\CSK.html"), collapse="\n")

b<-unlist(str_match_all(rawHTML, '\\d{2} \\w+ 2018'))

Answer 2

考虑到您要在其中找到日期的页面的HTML代码，最简单的方法是使用正则表达式查找代码的所有部分 title="11:53 AM - 27 May 2018" 然后你可以再次使用正则表达式从字符串中提取日期。我已经写了一个基本代码，您可以根据自己的需要修改它并使用它。

first_match <- regexpr(pattern='title\\s*=\\s*"\\d\\d:\\d\\d\\s*(AM|PM)\\s*-\\s*\\d\\d\\s[a-zA-Z]{3}\\s\\d{4}"', str)`
match_str <- regmatches(str,m)
date_exp <- regexpr(pattern='\\d\\d\\s[a-zA-Z]{3}\\s\\d{4}', match_str)
date <- regmatches(match_str, date_exp)

date是您需要的输出，str是代码字符串。

使用R语言中的regex从HTML页面中提取日期

2 个答案: