Question

请帮助修复脚本。

import urllib
import re
import os
import pprint

import requests
import bs4

stringHtml = urllib.request.urlopen('http://forum.saransk.ru/user/2018-sergey-kalinin/').read().decode('utf-8')
#print(stringHtml)
stringPattern = 'url\suid"\shref="http://vkontakte.ru/id10550933"'
result = re.search(stringPattern, stringHtml)
if result:
    print(result.group())
else:
    print('no result')

问题是脚本显示“无结果”。正确编译正则表达式。请帮助找到错误

Answer 1

为什么不使用导入的bs4？

如果您希望href元素的打印a属性为uid和url，则可以使用select method (which accept css selector)。

import urllib.request

import bs4

stringHtml = urllib.request.urlopen('http://forum.saransk.ru/user/2018-sergey-kalinin/').read()#.decode('utf-8')
soup = bs4.BeautifulSoup(stringHtml)
for a in soup.select('a.url.uid'):
    print(a.get('href'))

# If you want to check whether the a tag with `href="http://vkontakte..."` exist,
#   use following lines instead.
# (CSS Selector `a.url.uid[href="..."]` does not work with bs4.
#  bs4 supports most commonly-used CSS selectors, not all of them)
#print(any(a.get('href') == 'http://vkontakte.ru/id10550933'
#      for a in soup.select('a.url.uid')))

输出：

http://vkontakte.ru/id10550933

Answer 2

我很确定你的正则表达式中有错误。您正在寻找文本：

url uid“href：//vkontakte.ru/id10550933”

看起来像空白错误？

Answer 3

页面来源显示

<a class="url uid" rel="external me" href="http://vkontakte.ru/id10550933">http://vkontakte.ru/id10550933</a>

所以你想要的就是

import bs4
import requests

url = 'http://forum.saransk.ru/user/2018-sergey-kalinin/'
html = requests.get(url).content
page = bs4.BeautifulSoup(html)
link = page.find("a", {"class": "url uid"})
print(link["href"])

给出了

http://vkontakte.ru/id10550933

如何使用正则表达式搜索短语？

3 个答案: