Question

我正在尝试使用html文件中的正则表达式获取数据，方法是实现以下代码：

import urllib.request
def extract_words(wdict, urlname):
  uf = urllib.request.urlopen(urlname)
  text = uf.read()
  print (text)
  match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)

返回错误：

File "extract.py", line 33, in extract_words
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
File "/usr/lib/python3.1/re.py", line 192, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object

在IDLE中进一步试验后，我注意到uf.read（）确实在我第一次调用时返回了html源代码。但随后，它返回a - b''。有没有办法解决这个问题？

Answer 1

uf.read（）只会读取一次内容。然后你必须关闭它并重新打开它再读一遍。对于任何类型的流都是如此。然而，这不是问题。

问题是，从任何类型的二进制源（例如文件或网页）读取都会将数据作为bytes类型返回，除非您指定了编码。但是您的正则表达式未指定为bytes类型，它被指定为unicode str。

re模块将非常合理地拒绝在字节数据上使用unicode模式，反之亦然。

解决方案是使正则表达式模式成为字节字符串，您可以通过在其前面放置一个b来完成。因此：

match = re.findall(b"<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)

应该有效。另一种选择是解码文本，因此它也是一个unicode str：

encoding = uf.headers.getparam('charset')
text = text.decode(encoding)
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)

（另外，要从HTML中提取数据，我会说lxml是更好的选择。）

使用python3.1 urllib.request从html文件中提取源代码

1 个答案: