Question

Python专家：

我有一句话： "this time air\u00e6\u00e3o was filled\u00e3o" 我希望删除非Ascii unicode字符。我可以使用以下代码和函数：

def removeNonAscii(s): 
    return "".join(filter(lambda x: ord(x)<128, s))          

sentence = "this time air\u00e6\u00e3o was filled\u00e3o"   
sentence = removeNonAscii(sentence)
print(sentence)

然后显示："this time airo was filledo"，非常适合删除“\ 00 ..” 但是当我在一个文件中写下这个句子，然后把它读成一个循环：

def removeNonAscii(s):
    return "".join(filter(lambda x: ord(x)<128, s))

hand = open('test.txt')
for sentence in hand:
    sentence = removeNonAscii(sentence)
    print(sentence)

显示"this time air\u00e6\u00e3o was filled\u00a3o" 它根本不起作用。这里发生了什么？如果该功能有效，则不应该那样......

Answer 1

我有一种感觉，你的文件中的文字实际上是显示字符的utf-8序列而不是实际的non-ascii字符，而不是你认为的任何字符，它实际上是代码\u00--等等，当你运行代码时，它会读取每个字符并看到它们完全正常，因此过滤器会离开它们。

如果是这种情况，请使用：

import re
def removeNonAscii(s):
    return re.sub(r'\\u\w{4}','',s)

它将带走＆＃39; \ u ----＆＃39;

的所有实例

示例：

>>> with open(r'C:\Users\...\file.txt','r') as f:
    for line in f:
        print(re.sub(r'\\u\w{4}','',line))
this time airo was filledo

其中file.txt包含：

这次air \ u00e6 \ u00e3o被填满了\ u00a3o

从文件文本中删除非ASCII字符

1 个答案: