Excel中的格式文本格式(带有格式化标签),用于未格式化的文本

时间:2009-11-04 10:48:39

标签: excel vba parsing rtf

我有约。 excel中包含RTF的12000个单元格(包括格式化标签)。我需要解析它们才能找到未格式化的文本。

这是其中一个带文字的单元格的示例:

{\rtf1\ansi\deflang1060\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset238
Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;}
{\stylesheet{\fs24\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}}
\paperw11908\paperh16833\margl1800\margr1800\margt1440\margb1440\headery720\footery720
\deftab720\formshade\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd\pgwsxn11908\pghsxn16833\marglsxn1800\margrsxn1800\margtsxn1440\margbsxn1440
\headery720\footery720\sbkpage\pgncont\pgndec
\plain\plain\f1\fs24\pard TPR 0160 000\par IPR 0160 000\par OB-R-02-28\par}

我真正需要的就是:

TPR 0160 000
IPR 0160 000
OB-R-02-28

简单地循环遍历单元格并删除不必要的格式化的问题是,并非这些12000单元格中的所有内容都像这样简单。所以我需要手动检查许多不同的版本并编写几个变体;并且最后还会有很多手工工作要做。

但是,如果我将一个单元格的内容复制到空文本文档并将其保存为RTF,然后使用MS Word打开它,它会立即解析文本并得到我想要的内容。不幸的是,对12000个电池来说非常不方便。

所以我在考虑VBA宏,将单元格内容移动到Word,强制解析然后将结果复制回原始单元格。不幸的是,我不确定该怎么做。

有人有任何想法吗?还是一种不同的方法?我将非常感谢解决方案或推动正确的方向。

TNX!

4 个答案:

答案 0 :(得分:7)

如果您确实想要沿着使用Word解析文本的路线,此功能应该可以帮助您解决问题。正如评论所示,您需要引用MS Word对象库。

Function ParseRTF(strRTF As String) As String
Dim wdDoc As Word.Document 'Ref: Microsoft Word 11.0 Object Library'
Dim f     As Integer       'Variable to store the file I/O number'

'File path for a temporary .rtf file'
Const strFileTemp = "C:\TempFile_ParseRTF.rtf"

'Obtain the next valid file I/O number'
f = FreeFile

'Open the temp file and save the RTF string in it'
Open strFileTemp For Output As #f
    Print #f, strRTF
Close #f

'Open the .rtf file as a Word.Document'
Set wdDoc = GetObject(strFileTemp)

'Read the now parsed text from the Word.Document'
ParseRTF = wdDoc.Range.Text

'Delete the temporary .rtf file'
Kill strFileTemp

'Close the Word connection'
wdDoc.Close False
Set wdDoc = Nothing
End Function

您可以使用类似的内容为您的12,000个单元格中的每一个调用它:

Sub ParseAllRange()
Dim rngCell As Range
Dim strRTF  As String

For Each rngCell In Range("A1:A12000")

    'Parse the cell contents'
    strRTF = ParseRTF(CStr(rngCell))

    'Output to the cell one column over'
    rngCell.Offset(0, 1) = strRTF
Next
End Sub

ParseRTF函数需要大约一秒的时间才能运行(至少在我的机器上),因此对于12,000个单元格,这将在大约三个半小时内完成。


在周末考虑过这个问题之后,我确信有一个更好(更快)的解决方案。

我记得剪贴板的RTF功能,并意识到可以创建一个类,将RTF数据复制到剪贴板,粘贴到word文档,然后输出生成的纯文本。这个解决方案的好处是不必为每个rtf字符串打开和关闭单词doc对象;它可以在循环之前打开并在之后关闭。

以下是实现此目的的代码。它是一个名为clsRTFParser的Class模块。

Private Declare Function GlobalAlloc Lib "kernel32" _
                (ByVal wFlags&, ByVal dwBytes As Long) As Long
Private Declare Function GlobalLock Lib "kernel32" _
                (ByVal hMem As Long) As Long
Private Declare Function GlobalUnlock Lib "kernel32" _
                (ByVal hMem As Long) As Long
Private Declare Function lstrcpy Lib "kernel32" _
                (ByVal lpString1 As Any, ByVal lpString2 As Any) As Long

Private Declare Function OpenClipboard Lib "user32" _
                (ByVal Hwnd As Long) As Long
Private Declare Function EmptyClipboard Lib "user32" () As Long
Private Declare Function RegisterClipboardFormat Lib "user32" Alias _
                "RegisterClipboardFormatA" (ByVal lpString As String) As Long
Private Declare Function SetClipboardData Lib "user32" _
                (ByVal wFormat As Long, ByVal hMem As Long) As Long
Private Declare Function CloseClipboard Lib "user32" () As Long

'---'

Dim wdDoc As Word.Document 'Ref: Microsoft Word 11.0 Object Library'

Private Sub Class_Initialize()
Set wdDoc = New Word.Document
End Sub

Private Sub Class_Terminate()
wdDoc.Close False
Set wdDoc = Nothing
End Sub

'---'

Private Function CopyRTF(strCopyString As String) As Boolean
Dim hGlobalMemory  As Long
Dim lpGlobalMemory As Long
Dim hClipMemory    As Long
Dim lngFormatRTF   As Long

'Allocate and copy string to memory'
hGlobalMemory = GlobalAlloc(&H42, Len(strCopyString) + 1)
lpGlobalMemory = GlobalLock(hGlobalMemory)
lpGlobalMemory = lstrcpy(lpGlobalMemory, strCopyString)

'Unlock the memory and then copy to the clipboard'
If GlobalUnlock(hGlobalMemory) = 0 Then
    If OpenClipboard(0&) <> 0 Then
        Call EmptyClipboard

        'Save the data as Rich Text Format'
        lngFormatRTF = RegisterClipboardFormat("Rich Text Format")
        hClipMemory = SetClipboardData(lngFormatRTF, hGlobalMemory)

        CopyRTF = CBool(CloseClipboard)
    End If
End If
End Function

'---'

Private Function PasteRTF() As String
Dim strOutput As String

'Paste the clipboard data to the wdDoc and read the plain text result'
wdDoc.Range.Paste
strOutput = wdDoc.Range.Text

'Get rid of the new lines at the beginning and end of the document'
strOutput = Left(strOutput, Len(strOutput) - 2)
strOutput = Right(strOutput, Len(strOutput) - 2)

PasteRTF = strOutput
End Function

'---'

Public Function ParseRTF(strRTF As String) As String
If CopyRTF(strRTF) Then
    ParseRTF = PasteRTF
Else
    ParseRTF = "Error in copying to clipboard"
End If
End Function

您可以使用类似的内容为您的12,000个单元格中的每一个调用它:

Sub CopyParseAllRange()
Dim rngCell As Range
Dim strRTF  As String

'Create new instance of clsRTFParser'
Dim RTFParser As clsRTFParser
Set RTFParser = New clsRTFParser

For Each rngCell In Range("A1:A12000")

    'Parse the cell contents'
    strRTF = RTFParser.ParseRTF(CStr(rngCell))

    'Output to the cell one column over'
    rngCell.Offset(0, 1) = strRTF
Next
End Sub

我在我的机器上使用示例RTF字符串模拟了这个。对于12,000个细胞,花了两分半钟,这是一个更合理的时间框架!

答案 1 :(得分:2)

您可以尝试使用正则表达式解析每个单元格,只留下您需要的内容。

每个RTF控制代码以“\”开头,以空格结尾,两者之间没有任何额外空格。 “{}”用于分组。如果您的文本不包含任何内容,则可以删除它们(“;”相同)。所以现在你继续使用你的文本和一些不必要的单词作为“Arial”,“Normal”等。你也可以构建字典来删除它们。经过一些调整后,您将只使用所需的文本。

请查看http://www.regular-expressions.info/以获取更多信息以及编写RegExp的绝佳工具(RegexBuddy - 遗憾的是它不是免费的,但它值得花钱.AFAIR还有试用版)。

更新:当然,我不鼓励你为每个细胞手动完成。只需迭代活动范围: 参考这个帖子: SO: About iterating through cells in VBA

就个人而言,我会尝试这个想法:

Sub Iterate()
   For Each Cell in ActiveSheet.UsedRange.Cells
      'Do something
   Next
End Sub

如何在VBA(Excel)中使用RegExp?

参见: Regex functions in ExcelRegex in VBA

基本上你要通过COM使用VBScript.RegExp对象。

答案 2 :(得分:1)

此处的一些解决方案需要引用MS Word对象库。玩我所处理的卡片,我找到了一个不依赖它的解决方案。它在VBA中剥离了RTF标签以及其他类似绒毛的字体表和样式表。它可能对你有所帮助。我在你的数据中运行它,除了空白之外,我得到的输出与你预期的相同。

这是代码。

首先,检查字符串是否为字母数字。给它一个长度为一个字符的字符串。这个函数用于在这里和那里计算出界限。

Public Function Alphanumeric(Character As String) As Boolean
   If InStr("ABCDEFGHIJKKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-", Character) Then
       Alphanumeric = True
   Else
       Alphanumeric = False
   End If
End Function

接下来是删除整个组。我用它来删除字体表和其他垃圾。

Public Function RemoveGroup(RTFString As String, GroupName As String) As String
    Dim I As Integer
    Dim J As Integer
    Dim Count As Integer

    I = InStr(RTFString, "{\" & GroupName)

    ' If the group was not found in the RTF string, then just return that string unchanged.
    If I = 0 Then
        RemoveGroup = RTFString
        Exit Function
    End If

    ' Otherwise, we will need to scan along, from the start of the group, until we find the end of the group.
    ' The group is delimited by { and }. Groups may be nested, so we need to count up if we encounter { and
    ' down if we encounter }. When that count reaches zero, then the end of the group has been found.
    J = I
    Do
        If Mid(RTFString, J, 1) = "{" Then Count = Count + 1
        If Mid(RTFString, J, 1) = "}" Then Count = Count - 1
        J = J + 1
    Loop While Count > 0

    RemoveGroup = Replace(RTFString, Mid(RTFString, I, J - I), "")

End Function

好的,此函数会删除所有标签。

Public Function RemoveTags(RTFString As String) As String
    Dim L As Long
    Dim R As Long
    L = 1
    ' Search to the end of the string.
    While L < Len(RTFString)
        ' Append anything that's not a tag to the return value.
        While Mid(RTFString, L, 1) <> "\" And L < Len(RTFString)
            RemoveTags = RemoveTags & Mid(RTFString, L, 1)
            L = L + 1
        Wend

        'Search to the end of the tag.
        R = L + 1
        While Alphanumeric(Mid(RTFString, R, 1)) And R < Len(RTFString)
            R = R + 1
        Wend
        L = R
    Wend
End Function

我们可以用明显的方式删除花括号:

Public Function RemoveBraces(RTFString As String) As String
    RemoveBraces = Replace(RTFString, "{", "")
    RemoveBraces = Replace(RemoveBraces, "}", "")
End Function

将上述功能复制粘贴到模块中后,您可以创建一个功能,使用它们去掉任何您不需要或不需要的东西。以下在我的案例中完美无缺。

Public Function RemoveTheFluff(RTFString As String) As String
    RemoveTheFluff = Replace(RTFString, vbCrLf, "")
    RemoveTheFluff = RemoveGroup(RemoveTheFluff, "fonttbl")
    RemoveTheFluff = RemoveGroup(RemoveTheFluff, "colortbl")
    RemoveTheFluff = RemoveGroup(RemoveTheFluff, "stylesheet")
    RemoveTheFluff = RemoveTags(RemoveBraces(RemoveTheFluff))
End Function

我希望这会有所帮助。我不会在文字处理器或任何东西中使用它,但它可能会用于抓取数据,如果这就是你正在做的事情。

答案 3 :(得分:0)

您的帖子听起来好像每个RTF文档都存储在一个Excell单元格中。如果是,那么

Solution using .Net Framework RichTextBox control

将每个单元格中的RTF转换为2行代码中的纯文本(在稍微系统配置之后获取正确的.tlb文件以允许引用.Net Framework)。将单元格值放在 rtfsample

Set miracle = New System_Windows_Forms.RichTextBox
With miracle
    .RTF = rtfText
    PlainText = .TEXT
End With