使用Python剥离文本文件

时间:2015-10-02 00:37:40

标签: python

我正试图从文本文件中取出所有标点符号。有没有更有效的方法来做到这一点?

这是我的代码:

fname = open("text.txt","r")

stripped = ""
for line in fname:
    for c in line:
        if c in '!,.?-':
            c = ""
        stripped = stripped + c
print(stripped)

3 个答案:

答案 0 :(得分:0)

import re
with open("text.txt","r") as r:
    text = r.read()


with open("text.txt","w") as w:
    w.write(re.sub(r'[!,.?-]', '', text))

这个怎么样?

或者没有正则表达式的方法:

with open("text.txt","r") as r:
    text = r.read()

with open("text.txt","w") as w:
    for i in '!,.?-':
        text = text.replace(i, '')

    w.write(text)

答案 1 :(得分:0)

您可以尝试使用正则表达式,用空字符串替换任何标点符号:

import re
with open('text.txt', 'r') as f:
    for line in f:
        print(re.sub(r'[.!,?-]', '', line)

答案 2 :(得分:0)

通常 比正则表达式更快或单个字符串操作或构造正在使用str.translate

# Python 2 solution
with open("text.txt","r") as fname:
    stripped = fname.read().translate(None, '!,.?-')

请注意,这不是所有标点符号。获取所有ASCII标点符号的最佳方法是import string并使用string.punctuation

在Python 3中,你可以这样做:

# Read as text and translate with str.translate
delete_punc_table = str.maketrans('', '', '!,.?-') # If you're using the table more than once, always define once, use many times
with open("text.txt","r") as fname:
    stripped = fname.read().translate(delete_punc_table)

# Read as bytes to use Py2-like ultra-efficient translate then decode
with open("text.txt", "rb") as fname:
    stripped = fname.read().translate(None, b'!,.?-').decode('ascii')  # Or some other ASCII superset encoding
    # If you use string.punctuation for the bytes approach
    # you'd need to encode it, e.g. translate(None, string.punctuation.encode('ascii'))

在Python 3.4之前,“读取字节,翻译,然后解码”方法荒谬更好,在3.4+中它可能仍然稍微快一点,但不足以产生巨大的差异。

计算机上各种方法的计时(使用适用于Windows的Python 3.5 x64):

# Make random ~100KB input
data = ''.join(random.choice(string.printable) for i in range(100000))

# Using re.sub (with a compiled regex to minimize overhead)
>>> min(timeit.repeat('trans.sub("", data)', 'from __main__ import re, string, data; trans = re.compile(r"[" + re.escape(string.punctuation) + r"]")', number=1000))
17.47419076158849

# Using iterative str.replace
>>> min(timeit.repeat('d2 = data\nfor l in punc: d2 = d2.replace(l, "")', 'from __main__ import string, data; punc = string.punctuation', number=1000))
13.51673370949311

# Using str.translate
>>> min(timeit.repeat('data.translate(trans)', 'from __main__ import string, data; trans = str.maketrans("", "", string.punctuation)', number=1000))
1.5299288690396224

# Using bytes.translate then decoding as ASCII (without the decode, this is close to how Py2 would behave)
>>> bdata = data.encode("ascii")
>>> min(timeit.repeat('bdata.translate(None, trans).decode("ascii")', 'from __main__ import string, bdata; trans = string.punctuation.encode("ascii")', number=1000))
1.294337291624089

时间是在3次测试运行中运行1000次转换循环的最佳时间(采用最小值被认为是避免影响结果的时间抖动的最佳方法),以秒为单位,输入100,000个随机可打印的事物{{1 (甚至预编译)甚至没有关闭。 re.sub方法都可以(translate可能更快,但代码也更复杂)。如果要替换的事物较小(仅使用bytes.translate而不是所有标点符号将其降低到~3秒),str.replace会更具竞争力,但是对于任何合理数量的字符来说,它都会变慢,并且不像'!,.?-'那样缩放。