Question

我编写了在网页中查找表达式的脚本：

import sre, urllib2, sys, BaseHTTPServer
# -*- coding: utf-8 -*-    
address = sys.argv[1]
web_handle = urllib2.urlopen(address)
website_text = website_handle.read()    
matches = sre.findall(u"עברית", website_text)
for item in matches:
    print iten

如果我使用“常规”正则表达式（没有希伯来语字符）并且如果我使用它们则不匹配，则此脚本有效。我究竟做错了什么？

修改的例： url = https://en.wikipedia.org/wiki/Category:Countries

Answer 1

您需要确保输入字符串也是UTF8格式。

使用unicode函数和utf-8作为第二个参数：

website_text = unicode(website_text, "utf-8")

所有内容都应采用一致的编码方式，以便在Python 2中使用unicode。

使用正则表达式的unicode（希伯来字符）

1 个答案: