Question

您好我从HTML替换所有文本时遇到问题。我想用BeautifulSoup进行谴责，但它没有替换内容，当我打印内容时我得到了错误（并非所有HTML文本都被打印出来）

words = ['Shop','Car','Home','Generic','Elements']
page = urllib.urlopen("html1/index.html").read()
soup = BeautifulSoup(page, 'html.parser')
texts = soup.findAll(text=True)
for i in texts :
    if i == words :
       i = '***'
    print i

任何人都知道如何修复它？

错误：

Traceback (most recent call last):
File "replacing.py", line 28, in <module>
print i
File "F:\Python\Python27\lib\encodings\cp852.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 25: character maps to <undefined>

Answer 1

这里有两个主要问题。第一个是编码问题，您尝试打印不可打印的字符。为此，您可以使用以下找到的答案：

UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function

或者，更深入的解释：

Python, Unicode, and the Windows console（现在我更多地看到它可能已经过时，但仍然是一个有趣的读物）。

但是，您的代码也存在逻辑问题。

if i == words:

此行不会检查单词中i是否找到，而是将i与单词列表进行比较，这不是什么你要。我建议进行以下更改：

words = {'Shop','Car','Home','Generic','Elements'}

for i in texts:
    if i in words:
        i = '***'

将words转换为set可以进行平均O(1)查询，并使用if i in words检查是否在单词中找到i。

Answer 2

在python用于打印消息的编解码器中找不到您尝试打印的字符之一。即你有一个角色的数据，但你不知道它应该是什么符号，所以你不能打印它。将HTML简单转换为unicode格式可以解决您的问题。

关于如何做到这一点的好问题：

Convert HTML entities to Unicode and vice versa

Python替换文本

2 个答案: