上下文
我正在尝试在Excel中构建一个Word文档浏览器来筛选大量文档(大约1000个)。
打开word文档的过程证明是相当慢的(每个文档大约4秒,所以在这种情况下,查看所有项目需要2个小时,这对于单个查询来说太慢了),即使是禁用所有可能减慢开放的事情,因此我打开:
到目前为止我的尝试
这些文档很难查看,因为有些关键字确实每次出现但不在同一个上下文中(这里不是问题的核心,因为我可以在文本加载到数组中时处理)。因此,在我的情况下,常用Windows explorer
解决方案(如此link)不能使用。
目前,我设法有一个工作宏,通过打开它们来分析单词文档的内容。
代码
以下是代码示例。
请注意,我使用了Microsoft Word 14.0 Object Library
引用
' Analyzing all the word document within the same folder '
Sub extractFile()
Dim i As Long, j As Long
Dim sAnalyzedDoc As String, sLibName As String
Dim aOut()
Dim oWordApp As Word.Application
Dim oDoc As Word.Document
Set oWordApp = CreateObject("Word.Application")
sLibName = ThisWorkbook.Path & "\"
sAnalyzedDoc = Dir(sLibName)
sKeyword = "example of a word"
With Application
.DisplayAlerts = False
.ScreenUpdating = False
End With
ReDim aOut(2, 2)
aOut(1, 1) = "Document name"
aOut(2, 1) = "Text"
While (sAnalyzedDoc <> "")
' Analyzing documents only with the .doc and .docx extension '
If Not InStr(sAnalyzedDoc, ".doc") = 0 Then
' Opening the document as mentionned above, in read only mode, without repair and invisible '
Set oDoc = Word.Documents.Open(sLibName & "\" & sAnalyzedDoc, ReadOnly:=True, OpenAndRepair:=False, Visible:=False)
With oDoc
For i = 1 To .Sentences.Count
' Searching for the keyword within the document '
If Not InStr(LCase(.Sentences.Item(i)), LCase(sKeyword)) = 0 Then
If Not IsEmpty(aOut(1, 2)) Then
ReDim Preserve aOut(2, UBound(aOut, 2) + 1)
End If
aOut(1, UBound(aOut, 2)) = sAnalyzedDoc
aOut(2, UBound(aOut, 2)) = .Sentences.Item(i)
GoTo closingDoc ' A dubious programming choice but that works for the moment '
End If
Next i
closingDoc:
' Intending to make the closing faster by not saving the document '
.Close SaveChanges:=False
End With
End If
'Moving on to the next document '
sAnalyzedDoc = Dir
Wend
exitSub:
With Output
.Range(.Cells(1, 1), .Cells(UBound(aOut, 1), UBound(aOut, 2))) = aOut
End With
With Application
.DisplayAlerts = True
.ScreenUpdating = True
End With
End Sub
我的问题
我认为我的想法是通过文档中的 XML 内容直接访问其内容(您可以在重新命名更新版本的文档时访问该内容Word,.zip
扩展名,nameOfDocument.zip\word\document.xml
}。
这比加载word文档的所有图像,图表和表格要快得多,这在文本搜索中是没有用的。
因此,我想问一下VBA中是否有办法打开像zip文件这样的word文档并访问 XML 文档,然后像VBA中的普通字符串一样处理它,因为我已经有了上述代码的文件路径和名称。
答案 0 :(得分:2)
请注意,这不是上述问题的简单答案,只要您没有大量要浏览的文档,我的初始问题中唯一的VBA代码就能完美地完成工作,否则请转到另一个工具(有一个Python Dynamic Link Library (DLL)非常好)。
好的,我会尝试尽可能地解释我的答案。
首先,这个问题引导我进入C#和XPath中XML的无限之旅,我选择不去追求。
它将分析文件的时间从大约2小时减少到10秒。
<强>上下文强>
读取XML文档的主干,因此也就是内部文字XML文档,是Microsoft的OpenXML库。 请记住我上面所说的,我试图实现的方法不能仅在VBA中完成,因此必须以另一种方式完成。 这可能是因为VBA是为Office实施的,因此限制了访问Office文档的核心结构,但我没有关于此限制的信息(欢迎任何信息)。
我将在这里给出的答案是为VBA编写一个C#DLL。 为了在C#中编写DLL并在VBA中引用它,我将您重定向到以下链接,该链接将以更好的方式恢复此特定过程:Tutorial for creating dll in C#
让我们开始
首先,您需要在项目中引用WindowsBase库和DocumentFormat.OpenXML,以使解决方案按照此MSDN文章Manipulate Office Open XML Formats Documents和Open and add text to a word processing document (Open XML SDK)中的说明进行操作 这些文章广泛地解释了OpenXML库如何处理word文档。
C#代码
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.IO.Packaging;
namespace BrowserClass
{
public class SpecificDirectory
{
public string[,] LookUpWord(string nameKeyword, string nameStopword, string nameDirectory)
{
string sKeyWord = nameKeyword;
string sStopWord = nameStopword;
string sDirectory = nameDirectory;
sStopWord = sStopWord.ToLower();
sKeyWord = sKeyWord.ToLower();
string sDocPath = Path.GetDirectoryName(sDirectory);
// Looking for all the documents with the .docx extension
string[] sDocName = Directory.GetFiles(sDocPath, "*.docx", SearchOption.AllDirectories);
string[] sDocumentList = new string[1];
string[] sDocumentText = new string[1];
// Cycling the documents retrieved in the folder
for (int i = 0; i < sDocName.Count(); i++)
{
string docWord = sDocName[i];
// Opening the documents as read only, no need to edit them
Package officePackage = Package.Open(docWord, FileMode.Open, FileAccess.Read);
const String officeDocRelType = @"http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";
PackagePart corePart = null;
Uri documentUri = null;
// We are extracting the part with the document content within the files
foreach (PackageRelationship relationship in officePackage.GetRelationshipsByType(officeDocRelType))
{
documentUri = PackUriHelper.ResolvePartUri(new Uri("/", UriKind.Relative), relationship.TargetUri);
corePart = officePackage.GetPart(documentUri);
break;
}
// Here enter the proper code
if (corePart != null)
{
string cpPropertiesSchema = "http://schemas.openxmlformats.org/package/2006/metadata/core-properties";
string dcPropertiesSchema = "http://purl.org/dc/elements/1.1/";
string dcTermsPropertiesSchema = "http://purl.org/dc/terms/";
// Construction of a namespace manager to handle the different parts of the xml files
NameTable nt = new NameTable();
XmlNamespaceManager nsmgr = new XmlNamespaceManager(nt);
nsmgr.AddNamespace("dc", dcPropertiesSchema);
nsmgr.AddNamespace("cp", cpPropertiesSchema);
nsmgr.AddNamespace("dcterms", dcTermsPropertiesSchema);
// Loading the xml document's text
XmlDocument doc = new XmlDocument(nt);
doc.Load(corePart.GetStream());
// I chose to directly load the inner text because I could not parse the way I wanted the document, but it works so far
string docInnerText = doc.DocumentElement.InnerText;
docInnerText = docInnerText.Replace("\\* MERGEFORMAT", ".");
docInnerText = docInnerText.Replace("DOCPROPERTY ", "");
docInnerText = docInnerText.Replace("Glossary.", "");
try
{
Int32 iPosKeyword = docInnerText.ToLower().IndexOf(sKeyWord);
Int32 iPosStopWord = docInnerText.ToLower().IndexOf(sStopWord);
if (iPosStopWord == -1)
{
iPosStopWord = docInnerText.Length;
}
if (iPosKeyword != -1 && iPosKeyword <= iPosStopWord)
{
// Redimensions the array if there was already a document loaded
if (sDocumentList[0] != null)
{
Array.Resize(ref sDocumentList, sDocumentList.Length + 1);
Array.Resize(ref sDocumentText, sDocumentText.Length + 1);
}
sDocumentList[sDocumentList.Length - 1] = docWord.Substring(sDocPath.Length, docWord.Length - sDocPath.Length);
// Taking the small context around the keyword
sDocumentText[sDocumentText.Length - 1] = ("(...) " + docInnerText.Substring(iPosKeyword, sKeyWord.Length + 60) + " (...)");
}
}
catch (ArgumentOutOfRangeException)
{
Console.WriteLine("Error reading inner text.");
}
}
// Closing the package to enable opening a document right after
officePackage.Close();
}
if (sDocumentList[0] != null)
{
// Preparing the array for output
string[,] sFinalArray = new string[sDocumentList.Length, 2];
for (int i = 0; i < sDocumentList.Length; i++)
{
sFinalArray[i, 0] = sDocumentList[i].Replace("\\", "");
sFinalArray[i, 1] = sDocumentText[i];
}
return sFinalArray;
}
else
{
// Preparing the array for output
string[,] sFinalArray = new string[1, 1];
sFinalArray[0, 0] = "NO MATCH";
return sFinalArray;
}
}
}
}
关联的VBA代码
Option Explicit
Const sLibname As String = "C:\pathToYourDocuments\"
Sub tester()
Dim aFiles As Variant
Dim LookUpDir As BrowserClass.SpecificDirectory
Set LookUpDir = New BrowserClass.SpecificDirectory
' The array will contain all the files which contain the "searchedPhrase" '
aFiles = LookUpDir.LookUpWord("searchedPhrase", "stopWord", sLibname)
' Add here any necessary processing if needed '
End Sub
因此,最终你会得到一个工具,可以比VBA中的经典开放式读取关闭方法更快地扫描.docx文档,代价是编写更多的代码。
最重要的是,您可以为想要执行简单搜索的用户提供简单的解决方案,尤其是当有大量word文档时。
注意强>
解析Word .XML文件在VBA中可能是噩梦,正如@Mikegrann指出的那样。
值得庆幸的是,OpenXML有一个XML解析器C# , xml parsing. get data between tags,它将在C#中为您完成工作,并获取引用该文档文本的<w:t></w:t>
标记。虽然我到目前为止找到了这些答案,但无法使它们起作用:
Parsing a MS Word generated XML file in C#,Reading specific XML elements from XML file
所以我选择了上面代码提供的.InnerText
解决方案来访问内部文本,但代价是输入了一些格式化文本(例如\\MERGEFORMAT
)。