将Rich Text转换为HTML格式标签

时间:2015-11-09 23:16:46

标签: excel excel-vba vba

我正在使用Excel列表并希望转向:

Quercus agrifolia var。 oxyadenia (Torr。)J.T。豪威尔

进入:

<i>Quercus agrifolia</i> var. <i>oxyadenia</i> (Torr.) J.T. Howell

我有Rich Text格式列表,其中应用了格式但我想将其发送到Access,并在相关文本周围明确包含格式化标记。

3 个答案:

答案 0 :(得分:9)

我当时想做同样的事情,并在MSDN上找到答案: Convert contents of a formatted excel cell to HTML format

我希望这对你有所帮助,它使用了一个excel宏。

编辑: 使用时我需要修改嵌套标签的代码,请在下面找到我对宏的更新:

Function fnConvert2HTML(myCell As Range) As String
    Dim bldTagOn, itlTagOn, ulnTagOn, colTagOn As Boolean
    Dim i, chrCount As Integer
    Dim chrCol, chrLastCol, htmlTxt, htmlEnd As String

    bldTagOn = False
    itlTagOn = False
    ulnTagOn = False
    colTagOn = False
    chrCol = "NONE"
    'htmlTxt = "<html>"
    htmlTxt = ""
    chrCount = myCell.Characters.Count

    For i = 1 To chrCount
    htmlEnd = ""
        With myCell.Characters(i, 1)
            If (.Font.Color) Then
                chrCol = fnGetCol(.Font.Color)
                If Not colTagOn Then
                    htmlTxt = htmlTxt & "<font color=#" & chrCol & ">"
                    colTagOn = True
                Else
                    If chrCol <> chrLastCol Then htmlTxt = htmlTxt & "</font><font color=#" & chrCol & ">"
                End If
            Else
                chrCol = "NONE"
                If colTagOn Then
                    htmlEnd = "</font>" & htmlEnd
                    'htmlTxt = htmlTxt & "</font>"
                    colTagOn = False
                End If
            End If
            chrLastCol = chrCol

            If .Font.Bold = True Then
                If Not bldTagOn Then
                    htmlTxt = htmlTxt & "<b>"
                    bldTagOn = True
                End If
            Else
                If bldTagOn Then
                    'htmlTxt = htmlTxt & "</b>"
                    htmlEnd = "</b>" & htmlEnd
                    bldTagOn = False
                End If
            End If

            If .Font.Italic = True Then
                If Not itlTagOn Then
                    htmlTxt = htmlTxt & "<i>"
                    itlTagOn = True
                End If
            Else
                If itlTagOn Then
                    'htmlTxt = htmlTxt & "</i>"
                    htmlEnd = "</i>" & htmlEnd
                    itlTagOn = False
                End If
            End If

            If .Font.Underline > 0 Then
                If Not ulnTagOn Then
                    htmlTxt = htmlTxt & "<u>"
                    ulnTagOn = True
                End If
            Else
                If ulnTagOn Then
                    'htmlTxt = htmlTxt & "</u>"
                    htmlEnd = "</u>" & htmlEnd
                    ulnTagOn = False
                End If
            End If

            If (Asc(.Text) = 10) Then
                htmlTxt = htmlTxt & htmlEnd & "<br>"
            Else
                htmlTxt = htmlTxt & htmlEnd & .Text
            End If

        End With
    Next

    If colTagOn Then
        htmlTxt = htmlTxt & "</font>"
        colTagOn = False
    End If
    If bldTagOn Then
        htmlTxt = htmlTxt & "</b>"
        bldTagOn = False
    End If
    If itlTagOn Then
        htmlTxt = htmlTxt & "</i>"
        itlTagOn = False
    End If
    If ulnTagOn Then
        htmlTxt = htmlTxt & "</u>"
        ulnTagOn = False
    End If
    'htmlTxt = htmlTxt & "</html>"
    fnConvert2HTML = htmlTxt
End Function

Function fnGetCol(strCol As String) As String
    Dim rVal, gVal, bVal As String
    strCol = Right("000000" & Hex(strCol), 6)
    bVal = Left(strCol, 2)
    gVal = Mid(strCol, 3, 2)
    rVal = Right(strCol, 2)
    fnGetCol = rVal & gVal & bVal
End Function

答案 1 :(得分:1)

这是一个更快的替代解决方案,但是产生的输出更混乱(因为它使用Word的HTML引擎)。您需要为VBA项目添加以下引用:

  • Microsoft HTML对象库
  • Microsoft脚本运行时
  • Microsoft Word 16.0对象库

然后,通过运行eg调用以下代码。 convertToHtml(Range("A1:A100"))在立即窗口中:

' Given a temporary file path, load the HTML in that file
' and return the first paragraph's inner HTML.
Function extractFirstParagraph(filePath As String) As String
    Dim fs As New FileSystemObject, _
        html As New MSHTML.HTMLDocument, _
        par As MSHTML.HTMLGenericElement
    html.body.innerHTML = fs.OpenTextFile(filePath).ReadAll
    Set par = html.getElementsByTagName("P")(0)
    extractFirstParagraph = par.innerHTML
End Function

Sub convertToHtml(rng As Range)
    ' Open a single Word instance.
    Dim w As New Word.Application, d As Word.Document
    Set d = w.Documents.Add

    Dim cell As Range
    Const tempFile As String = "c:\temp\msword.html"
    ' For each cell in the range ...
    For Each cell In rng
        If cell.Value <> "" Then
            ' ... copy it into the Word document ...
            cell.Copy
            d.Range.PasteSpecial xlPasteFormats
            ' ... save the Word document as HTML
            ' in a temporary file ...
            d.SaveAs2 tempFile, wdFormatHTML
            ' ... and extract the first paragraph.
            cell.Value = extractFirstParagraph(tempFile)
            Debug.Print "Cell " & cell.Address & " done."
        End If
    Next cell

    ' Close Word once you're done. Note that if a bug
    ' is encountered, this cleanup won't occur and the 
    ' Word process will need to be killed to release
    ' file locks, otherwise you get an unhelpful error.
    w.Quit False
End Sub

您可以通过添加对Microsoft VBScript正则表达式5.5的引用并运行如下函数来使用正则表达式清理输出:

' Used to avoid duplication in cleanWordHtml.
Function eraseInPlace(ByRef r As RegExp, _
    ByRef s As String, p As String) As String
    r.Pattern = p
    s = r.Replace(s, "")
End Function

' Eliminate junk tags from HTML generated by Word.
Function cleanWordHtml(inputString As String)
    Dim r As New RegExp
    r.Global = True
    eraseInPlace r, inputString, "mso-[^;""]*(; )?"
    eraseInPlace r, inputString, " style="""""
    eraseInPlace r, inputString, "<\?xml[^>]*>"
    eraseInPlace r, inputString, "<\/?o:[^>]*>"
    eraseInPlace r, inputString, "<SPAN><\/SPAN>"
    cleanWordHtml = inputString
End Function

如果您需要将<span>标签转换为<font>标签(我也需要这样做,因为我正在导入不支持CSS的Access RT文本字段),请尝试调用此方法函数并传入在extractFirstParagraph函数中构造的MSHTML对象:

' Given a <p> DOM node, replace any children of the
' form <span style="color: foo"> with <font color="foo">.
Function convertSpanToFont(ByRef par As MSHTML.HTMLGenericElement, _
    ByRef doc As MSHTML.HTMLDocument)
    Dim span As MSHTML.HTMLSpanElement, _
        font As MSHTML.HTMLFontElement
    For Each span In par.getElementsByTagName("span")
        Set font = doc.createElement("font")
        If IsNull(span.Style.Color) _
            Or span.Style.Color <> "" Then
            font.Color = span.Style.Color
            font.innerHTML = span.innerHTML
            span.insertAdjacentElement "afterEnd", font
            span.removeNode True
        End If
    Next span
End Function

我还考虑过将整个电子表格从Excel中保存为HTML,然后使用另一种工具将其保存为Access可以处理的格式,但是Excel的HTML导出会生成CSS类,而不是内联样式。如果仅需要将电子表格的一部分转换为HTML,则此方法也很有用。

答案 2 :(得分:0)

实际上有一个使用 Excel 的内部 XML 表示并使用 XSL 样式表对其进行转换的更简洁、更快速的解决方案。

您可以在 https://github.com/HeimMatthias/Excel-HTML-Tools-Public

上找到 VBA 代码和所需的 XSLT

免责声明: 在使用速度缓慢的 VBA 脚本多年后,我自己编写了这个工具。示例输出可以在此 fiddle 中看到。

有关技术细节:

Range.Value-Property 可以接受一个参数 xlRangeValueXMLSpreadsheet,它返回一个完全格式化的 xml-String,其中包括其内容的 html-object。 在包含

的单元格上运行ActiveSheet.Range("A1").Value(xlRangeValueXMLSpreadsheet) <块引用>

Quercus agrifolia 变种。 oxyadenia (Torr.) J.T.豪尔

返回以下字符串:

<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:o="urn:schemas-microsoft-com:office:office"
 xmlns:x="urn:schemas-microsoft-com:office:excel"
 xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:html="http://www.w3.org/TR/REC-html40">
 <Styles>
  <Style ss:ID="Default" ss:Name="Normal">
   <Alignment ss:Vertical="Bottom"/>
   <Borders/>
   <Font ss:FontName="Calibri" ss:Size="11" ss:Color="#000000"/>
   <Interior/>
   <NumberFormat/>
   <Protection/>
  </Style>
  <Style ss:ID="s62">
   <Font ss:FontName="Calibri" ss:Size="11" ss:Color="#000000"
    ss:Italic="1"/>
  </Style>
 </Styles>
 <Worksheet ss:Name="Tabelle1">
  <Table ss:ExpandedColumnCount="1" ss:ExpandedRowCount="1"
   ss:DefaultColumnWidth="61.714285714285708"
   ss:DefaultRowHeight="14.571428571428571">
   <Row>
    <Cell ss:StyleID="s62"><ss:Data ss:Type="String"
      xmlns="http://www.w3.org/TR/REC-html40"><I><Font html:Color="#000000">Quercus agrifolia</Font></I><Font
       html:Color="#000000"> var. </Font><I><Font html:Color="#000000">oxyadenia</Font></I><Font
       html:Color="#000000"> (Torr.) J.T. Howell</Font></ss:Data></Cell>
   </Row>
  </Table>
 </Worksheet>
</Workbook>

Cell-Tag 围绕着一个 ss:Data-Object,其中包含 - 或多或少 - 干净的 html-Code。在任何情况下,清理这些数据以获得干净的 html 比解析每个字母并访问其样式要简单得多,速度也快得多。 (快得多,我的意思是快 100 倍)。

注意事项: 单元格样式(来自模板和单个单元格)不表示为 html。这并不总是特别明显。在上面的示例中,单元格已自动接收斜体样式,因为第一个单词已被斜体化。由于字符串中还有罗马(直立)词,斜体段落的代码在 html 中表示。但是如果整个单元格都用斜体表示,那么 <i> 标签就会丢失。这尤其令人困惑,因为如果相应的样式具有 <i> 作为属性,您不能只用 ss:Italic="1"-Tag 包围整个 html,您还需要检查它是否在 html 中包含 -Tags -部分。这当然适用于所有单元格样式。