我尝试了很多东西,但似乎没有什么工作正常。我有一个Access DB,我正在VBA中编写代码。我有一串HTML源代码,我有兴趣剥离所有HTML代码和标签,以便我只有纯文本字符串,没有html或标签。这样做的最佳方式是什么?
由于
答案 0 :(得分:8)
尽可能对不良标记具有弹性的一种方法;
with createobject("htmlfile")
.open
.write "<p>foo <i>bar</i> <u class='farp'>argle </zzzz> hello </p>"
.close
msgbox "text=" & .body.outerText
end with
答案 1 :(得分:5)
Function StripHTML(cell As Range) As String
Dim RegEx As Object
Set RegEx = CreateObject("vbscript.regexp")
Dim sInput As String
Dim sOut As String
sInput = cell.Text
With RegEx
.Global = True
.IgnoreCase = True
.MultiLine = True
.Pattern = "<[^>]+>" 'Regular Expression for HTML Tags.
End With
sOut = RegEx.Replace(sInput, "")
StripHTML = sOut
Set RegEx = Nothing
End Function
这可能对你有所帮助,祝你好运。
答案 2 :(得分:3)
这取决于html结构的复杂程度以及您希望从中获取多少数据。
根据您使用正则表达式可能会带来的复杂性,但是对于复杂的标记,尝试使用正则表达式从html解析数据就像尝试用叉子吃汤一样。
您可以使用htmFile对象将平面文件转换为可以与之交互的对象,例如:
Function ParseATable(url As String) As Variant
Dim htm As Object, table As Object
Dim data() As String, x As Long, y As Long
Set htm = CreateObject("HTMLfile")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", url, False
.send
htm.body.innerhtml = .responsetext
End With
With htm
Set table = .getelementsbytagname("table")(0)
Redim data(1 To table.Rows.Length, 1 To 10)
For x = 0 To table.Rows.Length - 1
For y = 0 To table.Rows(x).Cells.Length - 1
data(x + 1, y + 1) = table.Rows(x).Cells(y).InnerText
Next y
Next x
ParseATable = data
End With
End Function
答案 3 :(得分:0)
使用早期绑定:
Public Function GetText(inputHtml As String) As String
With New HTMLDocument
.Open
.write "<p>foo <i>bar</i> <u class='farp'>argle </zzzz> hello </p>"
.Close
StripHtml = .body.outerText
End With
End Function
答案 4 :(得分:0)
对上述其中一项的改进...它找到引号和换行符,并用非HTML等价物替换它们。此外,原始函数有嵌入式UNC引用的问题(即:&lt; \ server \ share \ folder \ file.ext&gt;)。由于&lt;它将删除整个UNC字符串。在开始时&gt;在末尾。此函数修复了这个问题,因此UNC正确地插入到字符串中:
Function StripHTML(strString As String) As String
Dim RegEx As Object
Set RegEx = CreateObject("vbscript.regexp")
Dim sInput As String
Dim sOut As String
sInput = Replace(strString, "<\\", "\\")
With RegEx
.Global = True
.IgnoreCase = True
.MultiLine = True
.Pattern = "<[^>]+>" 'Regular Expression for HTML Tags.
End With
sOut = RegEx.Replace(sInput, "")
StripHTML = Replace(Replace(Replace(sOut, " ", vbCrLf, 1, - 1), """, "'", 1, -1), "\\", "<\\", 1, -1)
Set RegEx = Nothing
End Function
答案 5 :(得分:0)
我找到了一个非常简单的解决方案。我目前运行访问数据库并使用excel表单来更新系统,因为系统限制和共享驱动器权限。当我从Access调用数据时,我使用: 明文( YourStringHere )这将删除所有html部分,只保留文本。
希望这有效。