以此为基础Extract DOCX Comments:
from lxml import etree
import zipfile
ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
def get_comments(docxFileName):
docxZip = zipfile.ZipFile(docxFileName)
commentsXML = docxZip.read('word/comments.xml')
et = etree.XML(commentsXML)
comments = et.xpath('//w:comment',namespaces=ooXMLns)
for c in comments:
# attributes:
print(c.xpath('@w:author',namespaces=ooXMLns))
print(c.xpath('@w:date',namespaces=ooXMLns))
# string value of the comment:
print(c.xpath('string(.)',namespaces=ooXMLns))
如何获取注释所引用的文本?为了清楚起见,我试图提取已注释的文本,而不是注释本身正文中的文本