从文件文本中删除非ASCII字符

时间:2015-11-03 23:49:49

标签: python unicode

Python专家:

我有一句话:     "this time air\u00e6\u00e3o was filled\u00e3o"    我希望删除非Ascii unicode字符。    我可以使用以下代码和函数:

def removeNonAscii(s): 
    return "".join(filter(lambda x: ord(x)<128, s))          

sentence = "this time air\u00e6\u00e3o was filled\u00e3o"   
sentence = removeNonAscii(sentence)
print(sentence)

然后显示:"this time airo was filledo",非常适合删除“\ 00 ..” 但是当我在一个文件中写下这个句子,然后把它读成一个循环:

def removeNonAscii(s):
    return "".join(filter(lambda x: ord(x)<128, s))

hand = open('test.txt')
for sentence in hand:
    sentence = removeNonAscii(sentence)
    print(sentence)

显示"this time air\u00e6\u00e3o was filled\u00a3o" 它根本不起作用。这里发生了什么?如果该功能有效,则不应该 那样......

1 个答案:

答案 0 :(得分:2)

我有一种感觉,你的文件中的文字实际上是显示字符的utf-8序列而不是实际的non-ascii字符,而不是你认为的任何字符,它实际上是代码\u00--等等,当你运行代码时,它会读取每个字符并看到它们完全正常,因此过滤器会离开它们。

如果是这种情况,请使用:

import re
def removeNonAscii(s):
    return re.sub(r'\\u\w{4}','',s)

它将带走&#39; \ u ----&#39;

的所有实例

示例:

>>> with open(r'C:\Users\...\file.txt','r') as f:
    for line in f:
        print(re.sub(r'\\u\w{4}','',line))
this time airo was filledo

其中file.txt包含:

  

这次air \ u00e6 \ u00e3o被填满了\ u00a3o