删除非英语子标题和段落

时间:2016-06-09 06:13:09

标签: python python-2.7 wikipedia wikipedia-api non-english

您好我有一个脚本可以删除副标题和段落,但我无法删除带有非英文副标题和单词的段落。

例如,(原始文字)

=== Personal finance ===
Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)

=== Corporate finance ===
Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.

== External links ==
Business acronyms and abbreviations
Business acronyms

== Kūrybinės Industrijos ==
Kūrybinės industrijos apima sritį ekonominių veiksnių, susitelkusių ties žinių ir informacijos generavimu arba tyrimu. 

我从代码中得到的(结果)是:

Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)

Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.

Kūrybinės industrijos apima sritį ekonominių veiksnių, susitelkusių ties žinių ir informacijos generavimu arba tyrimu.

这是我希望实现(期望的结果)

Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)

Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.

脚本如下:

import re
from subprocess import call

f1 = open('asd.text', 'r') # read file that contains the orginal text
f2 = open('NoRef.text', 'w') # write to new file

section_title_re = re.compile("^=+\s+.*\s+=+$")

content = []
skip = False
for l in f1.read().splitlines():
    line = l.strip()

    if "== external links ==" in line.lower():
        skip = True  
        continue

    if section_title_re.match(line):
        skip = False
        continue
    if skip:
        continue
    content.append(line)

content = '\n'.join(content) + '\n'
f2.write(content+"\n")
f2.close()

问题: 到目前为止,我的代码能够删除带有已知名称的子标题的段落,如“外部链接”。

但是我会删除那些非英语的副标题和段落吗?

谢谢。

1 个答案:

答案 0 :(得分:1)

如果您只想检测一个字符串是否包含非英文字符,那很简单:只需尝试将其解码为$data = array( 'post_params'=>[ 'Name'=>'Foo', 'LastName'=>'Bar' ], 'post_params'=>[ 'Name'=>'Foo', 'LastName'=>'Bar' ] ); :如果失败,则包含代码大于127的字符:

ascii

如果你想检测它是否包含非英语单词,这是一个更复杂的问题,你应该想知道是否要接受写得不好的英语单词,例如try: utxt = txt.decode('ascii') except: # txt contains non "english" characters ... 。祝你好运,如果你想这样做......

相关问题