Question

我正在遍历链接到docx，doc和pdf文件的URL列表。我编写了一个函数，使我可以从docx文件中提取文本并将其附加到新列表中。我对pdf文件没有兴趣，但我也想从同一功能的doc文件中提取文本。

经过一些研究，似乎大多数人建议使用textract下载docx和doc文件。但是，我无法在我的设备上正常运行它，并希望找到其他解决方案。

我尝试将每个doc文件转换为docx，但是（对我来说）要包含在函数中变得很麻烦。

这是函数现在的外观。它会下载所有文件，并在列表中提取docx文件的文本。否则为“空”。

import os.path
import urllib.request
import os
import requests
import docx2txt

l = []
for link in urls:
    link = link.strip()
    name = link.rsplit('/', 1)[-1]
    filename = os.path.join(name)
    quoted_url = urllib.parse.quote(link, safe=":/")

    if not os.path.isfile(filename):
        print('Downloading: ' + filename)
        try:
            urllib.request.urlretrieve(quoted_url, filename)
            try:
                file = docx2txt.process(filename)
                file = file.replace('\n', ' ')
                file = file.replace('\t', ' ')
                l.append(file)
            except:
                print('  no docx file')
                l.append('empty')
        except Exception as inst:
            print(inst)
            print('  Encountered error. Continuing.')
            l.append('empty')

预期的输出将是从列表中的doc和docx文件中提取文本，否则为“空”（对于pdf或错误链接）。

Answer 1

下面的代码读取了.doc文件：

import win32com.client

word = win32com.client.Dispatch("Word.Application")
word.visible = False
wb = word.Documents.Open("myfile.doc")
file= word.ActiveDocument
l.append(file.Range().Text)

在您的代码中，首先检查文件扩展名，如果是.docx.，然后运行您的代码，然后elif是.doc，然后运行上面的代码，如果是pass是.pdf

通过一个功能从doc和docx文件中抓取文本

1 个答案: