BeautifulSoup无法在<form method =“ GET” ...>内刮擦<g>

时间:2019-10-08 08:24:42

标签: python-3.x web-scraping beautifulsoup python-3.6

我正在尝试在y6内抓取文本<g class="jbfraglines">,但从selectfind_all中得到一个空白列表。

HTML(简体)

<form method="GET" enctype="application/x-www-form-urlencoded" action="peptide_view.pl">
  <div id="xi:container">
    <svg id="xi:svg-container" xmlns="http://www.w3.org/2000/svg" version="1.1" baseProfile="full" width="800" height="400" style="width: 800px; height: 400px; background: white; border: 1px solid black; overflow: hidden; cursor: default;">
      <defs><filter x="0" y="0" width="100%" height="100%" id="opaqueBackground"><feFlood flood-color="#ffffff" flood-opacity="1" result="bg"></feFlood><feMerge><feMergeNode in="bg"></feMergeNode><feMergeNode in="SourceGraphic"></feMergeNode></feMerge></filter></defs>
      <g class="view-label"><text text-anchor="start" x="742" y="352" transform="rotate(0)" id="id1" style="undefined" class="label">observed</text></g>
      <g class="jbresidue"></g><g class="jbresidue"><text text-anchor="middle" x="35" y="60" dx="0" dy="0" transform="rotate(0 35,60)" id="id19" class="aa">A</text></g>
      <g class="jbresidue"><text text-anchor="middle" x="55" y="60" dx="0" dy="0" transform="rotate(0 55,60)" id="id20" class="aa">A</text></g>
      <g class="jbfraglines"><line x1="45" y1="67" x2="45" y2="37" id="id21" stroke="#000000" stroke-width="1"></line><line x1="45" y1="37" x2="55" y2="27" id="id22" stroke="#000000" stroke-width="1"></line><line x1="45" y1="67" x2="35" y2="77" id="id23" stroke="#999999" stroke-width="1" style="visibility: hidden;"></line>
        <text text-anchor="start" x="45" y="30" dx="0" dy="0" transform="rotate(-45 45,30)" id="id24" fill="#000000">y6</text>
        <text text-anchor="start" x="38" y="90" dx="0" dy="0" transform="rotate(-45 38,90)" id="id25" style="visibility: hidden;">b1</text></g>
…

我的代码

import requests
import urllib.request
import time
from bs4 import BeautifulSoup
...
print(soup.select('g[class="jbfraglines"]'))
>> []
print(soup.find_all('g[class="jbfraglines"]'))
>> []

由于g位于<div[id="xi:container"]><form method="GET" ...>中,因此我尝试select进行操作,但它们还返回了空白列表或错误。

print(soup.find_all('div[id="xi:container"]'))
>> []
print(soup.select('div[id="xi:container"]'))
>> UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 114: illegal multibyte sequence
print(soup.select('form'))
>> UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 114: illegal multibyte sequence
print(soup.find_all('form'))
>> UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 114: illegal multibyte sequence

response = requests.get(link)返回了<Response [200]>,我确信可以进入正确的页面。怎么了我需要做些什么来在formsvg内抓取文字吗?

我注意到的另一件事是此HTML具有<script type="text/javascript" src="../templates/peptide_view.js?2.006001"></script>。我不确定问题是否与js有关?。

0 个答案:

没有答案