Question

所以，该文件有大约57,000本书名，作者姓名和一个ETEXT号。我试图解析该文件只能获得ETEXT NOs

文件是这样的：

TITLE and AUTHOR                                                     ETEXT NO.

Aspects of plant life; with special reference to the British flora,      56900
 by Robert Lloyd Praeger

The Vicar of Morwenstow, by Sabine Baring-Gould                          56899
 [Subtitle: Being a Life of Robert Stephen Hawker, M.A.]

Raamatun tutkisteluja IV, mennessä Charles T. Russell                    56898
 [Subtitle: Harmagedonin taistelu]
 [Language: Finnish]

Raamatun tutkisteluja III, mennessä Charles T. Russell                   56897
 [Subtitle: Tulkoon valtakuntasi]
 [Language: Finnish]

Tom Thatcher's Fortune, by Horatio Alger, Jr.                            56896

A Yankee Flier in the Far East, by Al Avery                              56895
 and George Rutherford Montgomery
 [Illustrator: Paul Laune]

Nancy Brandon's Mystery, by Lillian Garis                                56894

Nervous Ills, by Boris Sidis                                             56893
 [Subtitle: Their Cause and Cure]

Pensées sans langage, par Francis Picabia                                56892
 [Language: French]

Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss     56891
 [Subtitle: A picture of Judaism, in the century
  which preceded the advent of our Savior]

Fra Tommaso Campanella, Vol. 1, di Luigi Amabile                         56890
 [Subtitle: la sua congiura, i suoi processi e la sua pazzia]
 [Language: Italian]

The Blue Star, by Fletcher Pratt                                         56889

Importanza e risultati degli incrociamenti in avicoltura,                56888
 di Teodoro Pascal
 [Language: Italian]

这就是我的尝试：

def search_by_etext():

    fhand = open('GUTINDEX.ALL')
    print("Search by ETEXT:")

    for line in fhand:
        if not line.startswith(" [") and not line.startswith("~"):
            if not line.startswith(" ") and not line.startswith("TITLE"):
                    words = line.rstrip()
                    words = line.lstrip()
                    words = words[-7:]
                    print (words)


search_by_etext()

代码大多有效。然而，对于某些线条，它给了我标题或其他东西的一部分。喜欢：这种输出（），包含'decott'，它是作者姓名的一部分，不应该在这里。 This kind of output 2

为此：惨叫地震，由奥利弗赫福德56765 [字幕：和其他寓言和诗歌]

兰花之家和其他诗歌，作者George Sterling 56764

North Italian Folk，作者：Alice Vansittart Strettel Carr 56763 和Randolph Caldecott [副标题：城镇和乡村生活的草图]

新西兰的野生动物。第1部分，哺乳动物，作者：George M. Thomson，56762 [副标题：新西兰科学与艺术委员会，第2号手册]

环球兄弟会，第13卷，第10期，1899年1月，由Various 56761

De drie steden：Lourdes，门ÉmileZola56760 [语言：荷兰语]

另一个例子：

4

有关 Rhandensche Jongens，门Jan Lens 56702 [插画家：Tjeerd Bottema] [语言：荷兰语]

The Woman's Party的故事，作者：Inez Haynes Irwin 56701

摩门教教义平原与简单，作者Charles W. Penrose 56700 [副标题：或生命树上的叶子]

Burkamukk的石斧，由玛丽格兰特布鲁斯56699 [插画家：J。Macfarlane]

后期先知，由George Q. Cannon 56698撰写 [副标题：约瑟夫史密斯为年轻人写的历史]

这里：生活]不应该在那里。以空格开头的行已经解析出来：

if not line.startswith(" [") and not line.startswith("~"):

但是我仍然在输出结果中获得这些值。

Answer 1

简单的解决方案：regexps救援！

import re
with open("etext.txt") as f:
    for line in f:
        match = re.search(r" (\d+)$", line.strip())
        if match:
            print(match.group(1))

正则表达式(\d+)$将匹配＆＃34;至少一个空格后跟字符串末尾的一个或多个数字＆＃34;，并且仅捕获＆＃34;一个或多个数字＆＃ 34;组。

您最终可以改进正则表达式 - 即如果您知道所有etext代码的长度都是5位数，则可以将正则表达式更改为(\d{5})$。

这适用于您发布的示例文本。如果它没有在您自己的文件上正常工作，那么我们需要足够的真实数据来找出您真正拥有的内容。

Answer 2

可能是那些未被过滤掉的额外行以空格而不是＆＃34; ＆＃34; char，例如一个标签。作为可能有效的最小变化，请尝试过滤以任何空格开头的行，而不是特别是空格char？

要检查一般的空格而不是空格字符，您需要使用regular expressions。试试if not re.match(r'^\s', line) and ...

用Python解析一个非常大的文本文件？

2 个答案: