Question

我正在使用一堆word文档，其中我有突出显示的文本（单词）（使用颜色代码，例如黄色，蓝色，灰色），现在我想提取与每种颜色相关联的突出显示的单词。我用Python编程。这是我目前所做的：

使用[python-docx][1]打开word文档，然后转到包含文档中的标记（单词）的<w:r>标记。我使用了以下代码：

#!/usr/bin/env python2.6
# -*- coding: ascii -*-
from docx import *
document = opendocx('test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
for word in words:
  print word

现在我被困在我检查每个单词的部分，如果它有<w:highlight>标签，并从中提取颜色代码，如果它与<w:t>标签内的黄色打印文本匹配。如果有人能指出我从解析文件中提取单词，我将非常感激。

Answer 1

我之前从未与python-docx合作过，但有帮助的是，我在网上找到了一段摘要，其中突出显示的文字的XML结构如何：

 <w:r>
    <w:rPr>
      <w:highlight w:val="yellow"/>
    </w:rPr>
    <w:t>text that is highlighted</w:t>
  </w:r>

从那时起，提出这个问题相对简单：

from docx import *
document = opendocx(r'test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)

WPML_URI = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
tag_rPr = WPML_URI + 'rPr'
tag_highlight = WPML_URI + 'highlight'
tag_val = WPML_URI + 'val'

for word in words:
    for rPr in word.findall(tag_rPr):
        if rPr.find(tag_highlight).attrib[tag_val] == 'yellow':
            print word.find(tag_t).text

在Python中从Word文档（.docx）中提取突出显示的单词

1 个答案: