Question

我尝试做的是拉取HTML内容并找到我知道存在的特定字符串

import urllib.request
import re

response = urllib.request.urlopen('http://ipchicken.com/')

data = response.read()

portregex = re.compile('Remote[\s]+Port: [\d]+')

port = portregex.findall(str(data))

print(data)
print(port)

现在在我的情况下，该网站包含Remote Port: 50880，但我根本无法想出合适的正则表达式！任何人都可以找到我的错误吗？

我在Windows上使用python 3.4

Answer 1

您错误地使用了方括号而不是圆括号：

portregex = re.compile(r'Remote\s+Port: (\d+)')

这可确保re.findall()的结果仅包含匹配的数字（因为re.findall()仅返回捕获组匹配时的匹配）：

>>> s = "Foo Remote Port: 12345 Bar Remote    Port: 54321"
>>> portregex.findall(s)
['12345', '54321']

Answer 2

您需要使用原始字符串：

portregex = re.compile(r'Remote[\s]+Port: [\d]+')

或双反斜杠：

portregex = re.compile('Remote[\\s]+Port: [\\d]+')

请注意，不需要方括号。

Answer 3

在这种情况下，我使用HTML解析器。使用BeautifulSoup的示例：

import urllib.request
from bs4 import BeautifulSoup

response = urllib.request.urlopen('http://ipchicken.com/')
soup = BeautifulSoup(response)

print(soup.find(text=lambda x: x.startswith('Remote')).text)

找不到合适的正则表达式

3 个答案: