从pptx,ppt,docx,doc和msg文件python Windows中提取文本

时间:2018-08-21 21:04:57

标签: python powerpoint docx

是否可以从Windows机器上的pptx,ppt,docx,doc和msg文件中提取文本?我有几百个这样的文件,需要一些编程方式。我更喜欢Python。但我愿意接受其他建议

我在网上搜索并看到了一些讨论,但它们适用于linux计算机

1 个答案:

答案 0 :(得分:1)

单词

我尝试用python-docx输入单词,要安装它,请写pip install python-docx。我有一个名为example的单词文档,其中包含4行文本,它们以正确的方式被抓取,如下面的输出所示。 enter image description here

from docx import Document

d = Document("example.docx")

for par in d.paragraphs:
    print(par.text)

输出(example.docx内容):

Titolo
Paragrafo 1 a titolo di esempio
This is an example of text
This is the final part, just 4 rows

将所有docx文本加入一个文件夹中

import os
from docx import Document

files = [f for f in os.listdir() if ".docx" in f]
text_collector = []
whole_text = ''
for f in files:
    doc = Document(f)
    for par in doc.paragraphs:
        text_collector.append(par.text)

for text in text_collector:
    whole_text += text + "\n"

print(whole_text)

同上,但要选择

在此代码中,要求您从文件夹中docx文件出现的列表中选择要加入的文件。

import os
from docx import Document

files = [f for f in os.listdir() if ".docx" in f]

for n,f in enumerate(files):
    print(n+1,f)
print()
print("Write the numbers of files you need separated by space")
inp = input("Which files do you want to join?")

desired = (inp.split())
desired = map(lambda x: int(x), desired)
list_to_join = []
for n in desired:
    list_to_join.append(files[n-1])


text_collector = []
whole_text = ''
for f in list_to_join:
    doc = Document(f)
    for par in doc.paragraphs:
        text_collector.append(par.text)

for text in text_collector:
    whole_text += text + "\n"

print(whole_text)