Question

我需要解析很多html文件，以便知道哪些文件在title标签中包含特定文本。

我们假设标题是

file1.htm
<title>100 text other text</title>
file2.htm
<title>text 100 text other text</title>
file3.htm
<title>text 1000 text other text</title>
file4.htm
<title>text one hundred text other text</title>

按照我的例子，我需要找到包含100或100的文件名，即文件1,2和4。

我的问题是我不知道如何编写正则表达式

gci "c:\my_folder" | ? {$_.extension -eq ".htm"} | 
select-string -pattern '<title>*100*</title>' |
Select-Object -Unique Path

请注意，如果这对regexp很重要，那么标题标记不是在行的开头，而是在中间。提前谢谢。

Answer 1

这应该这样做。

^.*<title>(.*(100|one\shundred)[^0].*)?</title>.*$

Answer 2

试

<title>(.*[^[:alnum:]])?(100|one hundred)([^[:alnum:]].*)?</title>

用于匹配的模式。模式语法是PCRE（就像在perl中一样），如果需要可以重新构造。

最好的问候，

的Carsten

PS：谨防陷阱 - 评论的所有建议和警告确实存在;仍然，在你的情况下，正则表达式的方法似乎是可行的（主要是因为你正在调查'标题'标签的内容，每个文件应该只有一个，并且将它分散到多行中将是非常愚蠢的imho）。

正则表达式来解析html title标签

2 个答案: