Question

我正在阅读网页内容并使用变音符号检查单词。该单词出现在页面内容中。但是python find('ü')函数找不到这个词。

import urllib2
opener = urllib2.build_opener()
page_content = opener.open(url).read() 
page_content.find('ü')

我试图用u'ü'转换搜索字符串。然后错误是

'SyntaxError: (unicode error) 'utf8' codec can't decode byte 0xfc in position 0'

我在.py文件中使用了＃ - - coding：utf-8 - 。

我打印了page_content。在那里，变形金刚ü正在转变为'ü'。如果我尝试使用page_content.find（'ü'），它工作正常。如果有更好的解决方案，请告诉我。

我非常感谢任何建议。

Answer 1

你的Python试图将源文件（或控制台输入）解析为UTF-8，但它实际上是用Latin-1编码的。你可以尝试放一个

# coding: iso-8859-1

在源文件的顶部发表评论，或者更好，使用支持UTF-8的编辑器/终端模拟器并以该编码保存脚本。

Answer 2

如果你在文件的顶部定义UTF-8编码，则应该有效。请注意，coding行必须是第一行，或者是hashbang之后的第二行。

#!/usr/bin/python
# coding: utf-8

import urllib2

url = 'http://en.wikipedia.org/wiki/Germanic_umlaut'
opener = urllib2.build_opener()
page_content = opener.open(url).read() 
page_content.find(u'ü')

使用umlauts错误的Python URL编码

2 个答案: