提取并替换命名组正则表达式

时间:2010-12-29 22:52:09

标签: .net regex vb.net

我能够在html字符串中提取锚点的href值。现在,我想要实现的是提取href值并用新的GUID替换该值。我需要返回替换的html字符串和提取的href值列表以及它的相应GUID。

提前致谢。

我现有的代码如下:

Dim sPattern As String = "<a[^>]*href\s*=\s*((\""(?<URL>[^\""]*)\"")|(\'(?<URL>[^\']*)\')|(?<URL>[^\s]* ))"

Dim matches As MatchCollection = Regex.Matches(html, sPattern, RegexOptions.IgnoreCase Or RegexOptions.IgnorePatternWhitespace)

If Not IsNothing(matches) AndAlso matches.Count > 0 Then
    Dim urls As List(Of String) = New List(Of String)

    For Each m As Match In matches
      urls.Add(m.Groups("URL").Value)
    Next
End If

示例HTML字符串:

<html><body><a title="http://www.google.com" href="http://www.google.com">http://www.google.com</a><br /><a href="http://www.yahoo.com">http://www.yahoo.com</a><br /><a title="http://www.apple.com" href="http://www.apple.com">Apple</a></body></html>

1 个答案:

答案 0 :(得分:1)

你可以这样做:

Dim pattern As String = "<a[^>]*href\s*=\s*((\""(?<URL>[^\""]*)\"")|(\'(?<URL>[^\']*)\')|(?<URL>[^\s]* ))"
Dim urls As New Dictionary(Of Guid, String)
Dim evaluator As MatchEvaluator = Function(m)
    Dim g As Guid = Guid.NewGuid()
    Dim url = m.Groups("URL").Value
    urls.Add(g, url)
    Return m.Value.Replace(url, g.ToString())
End Function

Dim newHtml = Regex.Replace(html, pattern, evaluator)

最后,newHtml具有以下值:

<html><body><a title="329eb2c4-ee51-49fa-a8cd-2de319c3dbad" href="329eb2c4-ee51-49fa-a8cd-2de319c3dbad">http://www.google.com</a><br /><a href="77268e2d-87c4-443c-980c-9188e22f8496">http://www.yahoo.com</a><br /><a title="2941f77a-a143-4990-8ad7-3ef56972a8d4" href="2941f77a-a143-4990-8ad7-3ef56972a8d4">Apple</a></body></html>

urls字典包含以下条目:

329eb2c4-ee51-49fa-a8cd-2de319c3dbad: http://www.google.com
77268e2d-87c4-443c-980c-9188e22f8496: http://www.yahoo.com
2941f77a-a143-4990-8ad7-3ef56972a8d4: http://www.apple.com

顺便提一下,请注意regular expressions are not the best option for parsing HTML ...像HTML Agility Pack这样的工具会更合适。