如何通过VBA中的XML读取Word文档的内容

时间:2016-08-24 14:07:01

标签: xml vba excel-vba word-vba excel

上下文

我正在尝试在Excel中构建一个Word文档浏览器来筛选大量文档(大约1000个)。

打开word文档的过程证明是相当慢的(每个文档大约4秒,所以在这种情况下,查看所有项目需要2个小时,这对于单个查询来说太慢了),即使是禁用所有可能减慢开放的事情,因此我打开:

  • 仅供参考
  • 没有打开和修复模式(可能会在某些文档上发生)
  • 禁用文档显示

到目前为止我的尝试

这些文档很难查看,因为有些关键字确实每次出现但不在同一个上下文中(这里不是问题的核心,因为我可以在文本加载到数组中时处理)。因此,在我的情况下,常用Windows explorer解决方案(如此link)不能使用。

目前,我设法有一个工作宏,通过打开它们来分析单词文档的内容。

代码

以下是代码示例。 请注意,我使用了Microsoft Word 14.0 Object Library引用

' Analyzing all the word document within the same folder '
Sub extractFile()

Dim i As Long, j As Long
Dim sAnalyzedDoc As String, sLibName As String
Dim aOut()
Dim oWordApp As Word.Application
Dim oDoc As Word.Document

Set oWordApp = CreateObject("Word.Application")

sLibName = ThisWorkbook.Path & "\"
sAnalyzedDoc = Dir(sLibName)
sKeyword = "example of a word"

With Application
    .DisplayAlerts = False
    .ScreenUpdating = False
End With

ReDim aOut(2, 2)
aOut(1, 1) = "Document name"
aOut(2, 1) = "Text"


While (sAnalyzedDoc <> "")
    ' Analyzing documents only with the .doc and .docx extension '
    If Not InStr(sAnalyzedDoc, ".doc") = 0 Then
        ' Opening the document as mentionned above, in read only mode, without repair and invisible '
        Set oDoc = Word.Documents.Open(sLibName & "\" & sAnalyzedDoc, ReadOnly:=True, OpenAndRepair:=False, Visible:=False)
        With oDoc
            For i = 1 To .Sentences.Count
                ' Searching for the keyword within the document '
                If Not InStr(LCase(.Sentences.Item(i)), LCase(sKeyword)) = 0 Then
                    If Not IsEmpty(aOut(1, 2)) Then
                        ReDim Preserve aOut(2, UBound(aOut, 2) + 1)
                    End If
                    aOut(1, UBound(aOut, 2)) = sAnalyzedDoc
                    aOut(2, UBound(aOut, 2)) = .Sentences.Item(i)
                    GoTo closingDoc ' A dubious programming choice but that works for the moment '
                End If
            Next i
closingDoc:
            ' Intending to make the closing faster by not saving the document '
            .Close SaveChanges:=False
        End With
    End If
    'Moving on to the next document '
    sAnalyzedDoc = Dir
Wend

exitSub:
With Output
    .Range(.Cells(1, 1), .Cells(UBound(aOut, 1), UBound(aOut, 2))) = aOut
End With

With Application
    .DisplayAlerts = True
    .ScreenUpdating = True
End With

End Sub

我的问题

我认为我的想法是通过文档中的 XML 内容直接访问其内容(您可以在重新命名更新版本的文档时访问该内容Word,.zip扩展名,nameOfDocument.zip\word\document.xml}。

这比加载word文档的所有图像,图表和表格要快得多,这在文本搜索中是没有用的。

因此,我想问一下VBA中是否有办法打开像zip文件这样的word文档并访问 XML 文档,然后像VBA中的普通字符串一样处理它,因为我已经有了上述代码的文件路径和名称。

1 个答案:

答案 0 :(得分:2)

请注意,这不是上述问题的简单答案,只要您没有大量要浏览的文档,我的初始问题中唯一的VBA代码就能完美地完成工作,否则请转到另一个工具(有一个Python Dynamic Link Library (DLL)非常好)。

好的,我会尝试尽可能地解释我的答案。

首先,这个问题引导我进入C#和XPath中XML的无限之旅,我选择不去追求。

它将分析文件的时间从大约2小时减少到10秒。

<强>上下文

读取XML文档的主干,因此也就是内部文字XML文档,是Microsoft的OpenXML库。 请记住我上面所说的,我试图实现的方法不能仅在VBA中完成,因此必须以另一种方式完成。 这可能是因为VBA是为Office实施的,因此限制了访问Office文档的核心结构,但我没有关于此限制的信息(欢迎任何信息)。

我将在这里给出的答案是为VBA编写一个C#DLL。 为了在C#中编写DLL并在VBA中引用它,我将您重定向到以下链接,该链接将以更好的方式恢复此特定过程:Tutorial for creating dll in C#

让我们开始

首先,您需要在项目中引用WindowsBase库和DocumentFormat.OpenXML,以使解决方案按照此MSDN文章Manipulate Office Open XML Formats DocumentsOpen and add text to a word processing document (Open XML SDK)中的说明进行操作 这些文章广泛地解释了OpenXML库如何处理word文档。

C#代码

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.IO.Packaging;

namespace BrowserClass
{

    public class SpecificDirectory
    {

        public string[,] LookUpWord(string nameKeyword, string nameStopword, string nameDirectory)
        {
            string sKeyWord = nameKeyword;
            string sStopWord = nameStopword;
            string sDirectory = nameDirectory;

            sStopWord = sStopWord.ToLower();
            sKeyWord = sKeyWord.ToLower();

            string sDocPath = Path.GetDirectoryName(sDirectory);
            // Looking for all the documents with the .docx extension
            string[] sDocName = Directory.GetFiles(sDocPath, "*.docx", SearchOption.AllDirectories);
            string[] sDocumentList = new string[1];
            string[] sDocumentText = new string[1];

            // Cycling the documents retrieved in the folder
            for (int i = 0; i < sDocName.Count(); i++)
            {
                string docWord = sDocName[i];

                // Opening the documents as read only, no need to edit them
                Package officePackage = Package.Open(docWord, FileMode.Open, FileAccess.Read);

                const String officeDocRelType = @"http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";

                PackagePart corePart = null;
                Uri documentUri = null;

                // We are extracting the part with the document content within the files
                foreach (PackageRelationship relationship in officePackage.GetRelationshipsByType(officeDocRelType))
                {
                    documentUri = PackUriHelper.ResolvePartUri(new Uri("/", UriKind.Relative), relationship.TargetUri);
                    corePart = officePackage.GetPart(documentUri);
                    break;
                }

                // Here enter the proper code
                if (corePart != null)
                {
                    string cpPropertiesSchema = "http://schemas.openxmlformats.org/package/2006/metadata/core-properties";
                    string dcPropertiesSchema = "http://purl.org/dc/elements/1.1/";
                    string dcTermsPropertiesSchema = "http://purl.org/dc/terms/";

                    // Construction of a namespace manager to handle the different parts of the xml files
                    NameTable nt = new NameTable();
                    XmlNamespaceManager nsmgr = new XmlNamespaceManager(nt);
                    nsmgr.AddNamespace("dc", dcPropertiesSchema);
                    nsmgr.AddNamespace("cp", cpPropertiesSchema);
                    nsmgr.AddNamespace("dcterms", dcTermsPropertiesSchema);

                    // Loading the xml document's text
                    XmlDocument doc = new XmlDocument(nt);
                    doc.Load(corePart.GetStream());

                    // I chose to directly load the inner text because I could not parse the way I wanted the document, but it works so far
                    string docInnerText = doc.DocumentElement.InnerText;
                    docInnerText = docInnerText.Replace("\\* MERGEFORMAT", ".");
                    docInnerText = docInnerText.Replace("DOCPROPERTY ", "");
                    docInnerText = docInnerText.Replace("Glossary.", "");

                    try
                    {
                        Int32 iPosKeyword = docInnerText.ToLower().IndexOf(sKeyWord);
                        Int32 iPosStopWord = docInnerText.ToLower().IndexOf(sStopWord);

                        if (iPosStopWord == -1)
                        {
                            iPosStopWord = docInnerText.Length;
                        }

                        if (iPosKeyword != -1 && iPosKeyword <= iPosStopWord)
                        {
                            // Redimensions the array if there was already a document loaded
                            if (sDocumentList[0] != null)
                            {
                                Array.Resize(ref sDocumentList, sDocumentList.Length + 1);
                                Array.Resize(ref sDocumentText, sDocumentText.Length + 1);
                            }
                            sDocumentList[sDocumentList.Length - 1] = docWord.Substring(sDocPath.Length, docWord.Length - sDocPath.Length);
                            // Taking the small context around the keyword
                            sDocumentText[sDocumentText.Length - 1] = ("(...) " + docInnerText.Substring(iPosKeyword, sKeyWord.Length + 60) + " (...)");
                        }

                    }
                    catch (ArgumentOutOfRangeException)
                    {
                        Console.WriteLine("Error reading inner text.");
                    }
                }
                // Closing the package to enable opening a document right after
                officePackage.Close();
            }

            if (sDocumentList[0] != null)
            {
                // Preparing the array for output
                string[,] sFinalArray = new string[sDocumentList.Length, 2];

                for (int i = 0; i < sDocumentList.Length; i++)
                {
                    sFinalArray[i, 0] = sDocumentList[i].Replace("\\", "");
                    sFinalArray[i, 1] = sDocumentText[i];
                }
                return sFinalArray;
            }
            else 
            {
                // Preparing the array for output
                string[,] sFinalArray = new string[1, 1];
                sFinalArray[0, 0] = "NO MATCH";
                return sFinalArray;
            }
        }
    }

}

关联的VBA代码

Option Explicit

Const sLibname As String = "C:\pathToYourDocuments\"

Sub tester()

Dim aFiles As Variant
Dim LookUpDir As BrowserClass.SpecificDirectory
Set LookUpDir = New BrowserClass.SpecificDirectory

' The array will contain all the files which contain the "searchedPhrase" '
aFiles = LookUpDir.LookUpWord("searchedPhrase", "stopWord", sLibname)

' Add here any necessary processing if needed '

End Sub

因此,最终你会得到一个工具,可以比VBA中的经典开放式读取关闭方法更快地扫描.docx文档,代价是编写更多的代码。

最重要的是,您可以为想要执行简单搜索的用户提供简单的解决方案,尤其是当有大量word文档时。

注意

解析Word .XML文件在VBA中可能是噩梦,正如@Mikegrann指出的那样。 值得庆幸的是,OpenXML有一个XML解析器C# , xml parsing. get data between tags,它将在C#中为您完成工作,并获取引用该文档文本的<w:t></w:t>标记。虽然我到目前为止找到了这些答案,但无法使它们起作用: Parsing a MS Word generated XML file in C#Reading specific XML elements from XML file

所以我选择了上面代码提供的.InnerText解决方案来访问内部文本,但代价是输入了一些格式化文本(例如\\MERGEFORMAT)。