Question

我有几百种pdf格式的报纸和一系列关键词。我的最终目标是获得提及特定关键字的文章数量，请记住，一篇pdf可能包含多篇提及相同关键字的文章。

我的问题是，当我将pdf文件转换为纯文本时，我丢失了格式，这使得无法知道文章何时开始以及何时结束。

解决这个问题的最佳方法是什么，因为我现在认为这是不可能的。

我目前正在为这个项目和pdf库pdfminer使用python。这是pdf之一。 http://www.gulf-times.com/PDFLinks/streams/2011/2/27/2_418617_1_255.02.11.pdf

Answer 1

根据文本的格式，你可能能够提出某种标识标题的启发式 - 比方说，它是一条单独的行，少于15个单词并且它不会不包含句号/句号字符。这会像报纸的名称之类的东西混淆，但希望他们不会有大量的“非标题”文字，以便把结果搞得太糟糕。

这依赖于转换为文本，使每篇文章都连续（而不是仅仅翻录原始列并将文章混合起来）。如果他们混在一起，我会说你几乎没有机会 - 即使你能找到一个保持格式化的PDF库，也不一定很容易说出文章的“边界框”是什么。例如，许多论文都提出了标注和其他特征，这些特征甚至可能会混淆一种先进的启发式算法。

实际上进行计数很简单。如果我提到的假设成立，你可能最终看起来像：

import re
import string

non_word_re = re.compile(r"[^-\w']+")

article = ""
for filename in list_of_text_files:
    with open(filename, "r") as fd:
        for line in fd:
            # Split line on non-word characters and lowercase them for matching.
            words = [i.lower() for i in non_word_re.split(line)
                     if i and i[0] in string.ascii_letters]
            if not words:
                continue
            # Check for headline as the start of a new article.
            if len(words) < 15 and "." not in line:
                if article:
                    # Process previous article
                    handle_article_word_counts(article, counts)
                article = line.strip()
                counts = {}
                continue
            # Only process body text within an article.
            if article:
                for word in words:
                    count[word] = count.get(word, 0) + 1
    if article:
        handle_article_word_counts(article, counts)
    article = ""

您需要定义handle_article_word_counts()来对所需数据进行索引编制，但counts中的每个键都是潜在的关键字（包括and和{{ 1}}，所以你可能想放弃最频繁的单词或类似的东西。）

基本上，这取决于您希望结果的准确程度。我认为上面有一些机会给你一个公平的近似值，但它有我已经提到的所有假设和警告 - 例如，如果事实证明标题可以跨越行，然后你需要修改上面的启发式。希望至少可以为你提供一些东西。

搜索关键字的pdf报纸

1 个答案: