Question

我正试图从文本文件中取出所有标点符号。有没有更有效的方法来做到这一点？

这是我的代码：

fname = open("text.txt","r")

stripped = ""
for line in fname:
    for c in line:
        if c in '!,.?-':
            c = ""
        stripped = stripped + c
print(stripped)

Answer 1

import re
with open("text.txt","r") as r:
    text = r.read()


with open("text.txt","w") as w:
    w.write(re.sub(r'[!,.?-]', '', text))

这个怎么样？

或者没有正则表达式的方法：

with open("text.txt","r") as r:
    text = r.read()

with open("text.txt","w") as w:
    for i in '!,.?-':
        text = text.replace(i, '')

    w.write(text)

Answer 2

您可以尝试使用正则表达式，用空字符串替换任何标点符号：

import re
with open('text.txt', 'r') as f:
    for line in f:
        print(re.sub(r'[.!,?-]', '', line)

Answer 3

通常比正则表达式更快或单个字符串操作或构造正在使用str.translate：

# Python 2 solution
with open("text.txt","r") as fname:
    stripped = fname.read().translate(None, '!,.?-')

请注意，这不是所有标点符号。获取所有ASCII标点符号的最佳方法是import string并使用string.punctuation。

在Python 3中，你可以这样做：

# Read as text and translate with str.translate
delete_punc_table = str.maketrans('', '', '!,.?-') # If you're using the table more than once, always define once, use many times
with open("text.txt","r") as fname:
    stripped = fname.read().translate(delete_punc_table)

# Read as bytes to use Py2-like ultra-efficient translate then decode
with open("text.txt", "rb") as fname:
    stripped = fname.read().translate(None, b'!,.?-').decode('ascii')  # Or some other ASCII superset encoding
    # If you use string.punctuation for the bytes approach
    # you'd need to encode it, e.g. translate(None, string.punctuation.encode('ascii'))

在Python 3.4之前，“读取字节，翻译，然后解码”方法荒谬更好，在3.4+中它可能仍然稍微快一点，但不足以产生巨大的差异。

计算机上各种方法的计时（使用适用于Windows的Python 3.5 x64）：

# Make random ~100KB input
data = ''.join(random.choice(string.printable) for i in range(100000))

# Using re.sub (with a compiled regex to minimize overhead)
>>> min(timeit.repeat('trans.sub("", data)', 'from __main__ import re, string, data; trans = re.compile(r"[" + re.escape(string.punctuation) + r"]")', number=1000))
17.47419076158849

# Using iterative str.replace
>>> min(timeit.repeat('d2 = data\nfor l in punc: d2 = d2.replace(l, "")', 'from __main__ import string, data; punc = string.punctuation', number=1000))
13.51673370949311

# Using str.translate
>>> min(timeit.repeat('data.translate(trans)', 'from __main__ import string, data; trans = str.maketrans("", "", string.punctuation)', number=1000))
1.5299288690396224

# Using bytes.translate then decoding as ASCII (without the decode, this is close to how Py2 would behave)
>>> bdata = data.encode("ascii")
>>> min(timeit.repeat('bdata.translate(None, trans).decode("ascii")', 'from __main__ import string, bdata; trans = string.punctuation.encode("ascii")', number=1000))
1.294337291624089

时间是在3次测试运行中运行1000次转换循环的最佳时间（采用最小值被认为是避免影响结果的时间抖动的最佳方法），以秒为单位，输入100,000个随机可打印的事物{{1 （甚至预编译）甚至没有关闭。 re.sub方法都可以（translate可能更快，但代码也更复杂）。如果要替换的事物较小（仅使用bytes.translate而不是所有标点符号将其降低到~3秒），str.replace会更具竞争力，但是对于任何合理数量的字符来说，它都会变慢，并且不像'!,.?-'那样缩放。

使用Python剥离文本文件

3 个答案: