Question

我正在尝试使用JavaScript中的ActiveXObject（仅限IE）从Word文档中提取图像。

我无法找到Word对象的任何API参考，只能从互联网上找到一些提示：

var filename = 'path/to/word/doc.docx'
var word = new ActiveXObject('Word.Application')
var doc = w.Documents.Open(filename)
// Displays the text
var docText = doc.Content

如何使用doc.Content？

之类的内容访问Word文档中的图像

此外，如果任何人拥有API的权威来源（最好是来自Microsoft），那将非常有帮助。

Answer 1

因此，经过几周的研究，我发现使用作为Word ActiveXObject一部分的SaveAs函数提取图像是最简单的。如果文件保存为HTML文档，Word将创建一个包含图像的文件夹。

从那里，您可以使用XMLHttp来获取HTML文件并创建可由浏览器查看的新IMG标记（我使用的是IE（9），因为 ActiveXObject仅适用于Internet Explorer ）。

让我们从SaveAs部分开始：

// Define the path to the file
var filepath = 'path/to/the/word/doc.docx'
// Make a new ActiveXWord application
var word = new ActiveXObject('Word.Application')
// Open the document
var doc = word.Documents.Open(filepath)
// Save the DOCX as an HTML file (the 8 specifies you want to save it as an HTML document)
doc.SaveAs(filepath + '.htm', 8)

现在我们应该在同一目录中有一个文件夹，其中包含图像文件。

注意：在Word HTML中，图片使用<v:imagedata>标记，这些标记存储在<v:shape>标记中;例如：

<v:shape style="width: 241.5pt; height: 71.25pt;">
     <v:imagedata src="path/to/the/word/doc.docx_files/image001.png">
         ...
     </v:imagedata>
</v:shape>

我删除了Word保存的无关属性和标记。

要使用JavaScript访问HTML，请使用XMLHttpRequest对象。

 var xmlhttp = new XMLHttpRequest()
 var html_text = ""

因为我正在访问数百个Word文档，所以我发现最好在发送呼叫之前定义XMLHttp的onreadystatechange回调。

// Define the onreadystatechange callback function xmlhttp.onreadystatechange = function() { // Check to make sure the response has fully loaded if (xmlhttp.readyState==4 && xmlhttp.status==200) { // Grab the response text var html_text=xmlhttp.responseText // Load the HTML into the innerHTML of a DIV to add the HTML to the DOM document.getElementById('doc_html').innerHTML=html_text.replace("<html>", "").replace("</html>","") // Define a new array of all HTML elements with the "v:imagedata" tag var images =document.getElementById('doc_html').getElementsByTagName("v:imagedata") // Loop through each image for(j=0;j<images.length;j++) { // Grab the source attribute to get the image name var src = images[j].getAttribute('src') // Check to make sure the image has a 'src' attribute if(src!=undefined) { ...

我在加载正确的src属性时遇到了很多问题，因为IE在将它们加载到innerHTML doc_html div时将其转义为HTML属性，所以在下面的例子中我使用了伪-path和src.split('/')[1]获取图像名称（如果有超过1个正斜杠，此方法将无效！）：

... images[j].setAttribute('src', '/path/to/the/folder/containing/the/images/'+src.split('/')[1]) ...

这是我们使用父级（img对象）父级（恰好是v:shape对象）向HTML div添加新的p标记的位置。我们通过抓取图片中的img属性和src元素中的style信息，将新的v:shape标记附加到innerHTML：

... images[j].parentElement.parentElement.innerHTML+="<img src='"+images[j].getAttribute('src')+"' style='"+images[j].parentElement.getAttribute('style')+"'>" } } } } // Read the HTML Document using XMLHttpRequest xmlhttp.open("POST", filepath + '.htm', false) xmlhttp.send()

虽然它有点具体，但上面的方法能够成功地将img标签添加到它们在原始文档中的HTML中。

如何使用JavaScript从Word文档中提取图像？

1 个答案: