如何解析和找到下一个' td'在HTML中的特定文本之后

时间:2016-10-16 03:33:44

标签: beautifulsoup html-parsing

我在数据库中搜索了大约1850个html文章的子集,并尝试解析它们以找到四(4)个限定符:设备ID,位置,检查器和注释。我有两个解决方案让我分道扬but,但我遇到的问题是最后一个组件是循环数据并返回每篇文章的唯一信息(4个限定符)。请滚动到底部以获取子集。

此代码为一个限定符提供唯一信息 - >

import urllib
from bs4 import BeautifulSoup

fname = raw_input("Enter file name: ")
if len(fname) < 1 : fname = "badhtmlsubset.txt"
hand = open(fname).read()

soup = BeautifulSoup(hand, "html.parser")

i = 1
for stuff in soup.findAll(text="Equipment ID:"):
    print i
    print "Equipment ID:", stuff.findNext('td').text,
    #print "Location", stuff.find(text="Location:").findNext('td')   <--Traceback TypeError: find() takes no keyword arguments

    i = i + 1

此代码返回以下内容,但我无法获得位置,检查器或注释。 设备编号:V-2 3 设备编号:79井

此代码打印出正确的格式,但只是反复重复相同的信息,并不是唯一的。

import urllib
from bs4 import BeautifulSoup

fname = raw_input("Enter file name: ")
if len(fname) < 1 : fname = "badhtmlsubset.txt"
hand = open(fname).read()

soup = BeautifulSoup(hand, "html.parser")

#This code prints out the correct format, but does not print unique information for each loop. Just repeats the same information over and over.
i=1
for stuff in soup.findAll(text="Equipment ID:"):
    print "Count=", i
    equipid = soup.find(text="Equipment ID:").findNext('td')
    location = soup.find(text="Location:").findNext('td')
    inspector = soup.find(text="Inspector:").findNext('td')
    body = soup.find(text="Comments:").findNext('td')
    print "Equipment ID:", equipid.text,"Location:", location.text,"Inspector:", inspector.text
    print "Comments:", body.text
    i = i + 1

我希望获得迭代数据的以下输出:

Equipment ID: Well 53
Location: NMWSS
Inspector: Bob Bobberson
Comments: THE SHELL AND BOTTOM HEAD HAVE PITTING AND GENERAL CORROSION THAT IS BELOW
          THE T MIN FOR THE DESIGN PRESSURE OF THIS VESSEL AS AN ALTERNATIVE TO KEEP THE
          VESSEL IN SERVICE A NEW T MIN FOR THE SHELL AND HEADS CAN BE ASSUMED. THE
          DEEPEST PITS COULD BE REPAIRED AND THE SHELL T MIN SET AT
          0.400&Eacute;?&ugrave; AND THE BOTTOM HEAD T MIN SET AT 0.640&Eacute;?&ugrave;
          WHICH WOULD GIVE A SLIGHT AMOUNT OF CORROSION ALLOWANCE. AT THESE NEW VALUES
          THE VESSEL COULD BE OPERATED AT 92 PSI MAWP. THIS WOULD BE AT THE OWNERS
          DISCRETION. IT APPEARS THAT THE PRV ON THIS VESSEL IS SET AT 50 PSI.

此处的子集代表2篇文章:

我将此命名为badhtmlsubset.txt

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta name="generator" content=
  "HTML Tidy for Linux (vers 25 March 2009), see www.w3.org" />

  <title></title>
</head>

<body>
  "

  <p text-align:=""><img src="" alt="" panfee="" softenever="" width="" height="" /></p>

  <table width:="" border="1">
    <tbody>
      <tr>
        <td width:="">Field:</td>

        <td>Pan Fee</td>
      </tr>

      <tr>
        <td>Location:</td>

        <td>NMWSS</td>
      </tr>

      <tr>
        <td>Equipment ID:</td>

        <td>V-2</td>
      </tr>

      <tr>
        <td>Date:</td>

        <td>07/17/2009</td>
      </tr>

      <tr>
        <td>Inspector:</td>

        <td>Ray Rankin</td>
      </tr>

      <tr>
        <td rowspan="">Report(s):</td>

        <td>
          <p>{rsfiles
          path=""data/pan_fee/field/api510/softener_v_2/2009/v2_summary_071709.pdf""}</p>

          <p>{rsfiles
          path=""data/pan_fee/field/api510/softener_v_2/2009/v2_data_071709.pdf""}</p>

          <p>{rsfiles path=""data/pan_fee/field/api510/softener_v_2/2009/v1_calcs.pdf""}
          (same calcs as V-1)</p>

          <p>{rsfiles
          path=""data/pan_fee/field/api510/softener_v_2/2009/v2_acad_071709.pdf""}</p>
        </td>
      </tr>

      <tr>
        <td>{rsfiles
        path=""data/pan_fee/field/api510/softener_v_2/2009/v2_u1a.pdf""}</td>
      </tr>

      <tr>
        <td>{rsfiles
        path=""data/pan_fee/field/api510/softener_v_2/2009/panfee_v2_ticketclosed_012010.pdf""}</td>
      </tr>

      <tr>
        <td>Comments:</td>

        <td>
          <p>THE SHELL AND BOTTOM HEAD HAVE PITTING AND GENERAL CORROSION THAT IS BELOW
          THE T MIN FOR THE DESIGN PRESSURE OF THIS VESSEL AS AN ALTERNATIVE TO KEEP THE
          VESSEL IN SERVICE A NEW T MIN FOR THE SHELL AND HEADS CAN BE ASSUMED. THE
          DEEPEST PITS COULD BE REPAIRED AND THE SHELL T MIN SET AT
          0.400&Eacute;?&ugrave; AND THE BOTTOM HEAD T MIN SET AT 0.640&Eacute;?&ugrave;
          WHICH WOULD GIVE A SLIGHT AMOUNT OF CORROSION ALLOWANCE. AT THESE NEW VALUES
          THE VESSEL COULD BE OPERATED AT 92 PSI MAWP. THIS WOULD BE AT THE OWNERS
          DISCRETION. IT APPEARS THAT THE PRV ON THIS VESSEL IS SET AT 50 PSI.<br />
          <br />
          FOR FULL CODE COMPLIANCE THE VESSEL SHALL BE DE-RATED IN ACCORDANCE WITH THE
          CALIFORNIA OCCUPOATIONAL SAFETY - PRESSURE VESSEL UNIT <a href="" target=
          "">CIRCULAR LETTER PV-2006-2</a> AND <a href="" target="">CIRCULAR LETTER
          PV-2001-1</a>.</p>

          <p>&nbsp;</p>

          <p><strong color:="">OUT OF SERVICE, TYE HAMMOND, 10/05/2009</strong></p>
        </td>
      </tr>

      <tr>
        <td>UltraPipe Unit ID</td>

        <td>PAN FEE</td>
      </tr>

      <tr>
        <td>UltraPipe Circuit ID</td>

        <td>7888</td>
      </tr>
    </tbody>
  </table>

  <p>&nbsp;</p>

  <p>&nbsp;</p>

  <p>&nbsp;</p>

  <p>&nbsp;</p>" "

  <p text-align:=""><img src="" border="border" /></p>

  <p text-align:="">?&#711;</p>

  <p text-align:=""><strong>Please select the desired piping, at the Ethel D location,
  from the submenu.</strong><br />
  <span class="">(future location of data - added for presentation)</span></p>

  <p text-align:="">?&#711;</p>

  <h2 text-align:=""><span class="">There are <strong class="">134</strong> active wells
  listed with DOGGR.</span></h2>" "

  <div text-align:="">
    {vsig}/etheld/api570/flowlines/well_79_fl/2014{/vsig}
  </div>

  <p text-align:="">&nbsp;</p>

  <table width:="" border="1">
    <tbody text-align:="">
      <tr text-align:="">
        <td width:="" text-align:="">Field:</td>

        <td text-align:="">Ethel D</td>
      </tr>

      <tr text-align:="">
        <td text-align:="">Location:</td>

        <td text-align:="">SMWSS</td>
      </tr>

      <tr text-align:="">
        <td text-align:="">Equipment ID:</td>

        <td text-align:="">Well 79</td>
      </tr>

      <tr text-align:="">
        <td text-align:="">Inspection Dates:</td>

        <td text-align:="">Last: 07/30/2014 - Next: 07/30/2019</td>
      </tr>

      <tr text-align:="">
        <td text-align:="">Inspector:</td>

        <td text-align:="">Ronnie Harleston</td>
      </tr>

      <tr text-align:="">
        <td text-align:="" rowspan="">Report(s):</td>

        <td text-align:="">
          <p><strong><span text-decoration:="">2014 INSPECTION
          DATA:</span></strong><br />
          {rsfiles
          path=""data/etheld/api570/flowlines/well79/2014/etheld_well79_flowline_report_073014.pdf""}</p>

          <p>{rsfiles
          path=""data/etheld/api570/flowlines/well79/2014/etheld_well79_flowline_ultrapipe_073014.pdf""}</p>

          <p>{rsfiles
          path=""data/etheld/api570/flowlines/well79/2014/etheld_well79_flowline_field_drawing_073014.pdf""}<br />

          <br />
          <strong><span text-decoration:="">2009 INSPECTION DATA:</span></strong><br />
          {rsfiles
          path=""data/etheld/api570/flowlines/well79/2009/well79_summary_073009.pdf""}<br title=""
          inspection="" /></p>

          <p>{rsfiles
          path=""data/etheld/api570/flowlines/well79/2009/well79_data_073009.pdf""}</p>
        </td>
      </tr>

      <tr text-align:="">
        <td text-align:="">Find in <a href="" target="">Virtual Tour</a></td>
      </tr>

      <tr text-align:="">
        <td text-align:="">Comments:</td>

        <td text-align:="">
          <p><span text-decoration:=""><strong>Ultrasonic A-Scan Thickness
          Inspection:</strong></span><br />
          Ultrasonic A-Scan thickness measurements were completed in accordance with
          Applus RTD established procedures. 3 thickness measurement locations (TMLs)
          were established and thickness measurements were taken at these locations. All
          thickness readings have been entered into Ultra Pipe and after review of the
          data next inspection will be 07/30/2019. The predicted retirement date of the
          circuit is 07/30/2033 based on calculated corrosion rates and a 2 mil per year
          default corrosion rate. There are 0 caution TMLs per the current thickness
          survey taken on 07/30/2014.</p>

          <p><br />
          <span text-decoration:=""><strong>API 570 Visual
          Inspection:</strong></span></p>

          <p>Visual inspection found this piping circuit to be in fair condition. This
          piping circuit externally is insulated. The insulation is found to be in poor
          condition, with missing or damage insulation, and in the areas the insulation
          is missing the surface condition of the piping is covered with light to
          moderate surface rust. The piping supports surface was covered with light to
          moderate surface rust and no corrosion was present. Support hangers were found
          to be in fair condition. The piping was inspected for code compliance issues
          and to identify possible leaks, stresses and any condition that might reduce
          the life of the piping circuit. All piping will be put on a maximum 5 year
          thickness inspection interval per API (class 2 piping) requirements. Some
          piping TML locations may require re-inspection prior to the maximum interval.
          See UltraPIPE data reports for all required inspection dates.</p>

          <p><br />
          Recommendations: Repair insulation as needed.</p>

          <p>All piping will require an API 570 Visual inspection in 5 years.</p>
        </td>
      </tr>

      <tr text-align:="">
        <td text-align:="">UltraPipe Unit ID</td>

        <td text-align:="">ETHEL_D</td>
      </tr>

      <tr text-align:="">
        <td text-align:="">UltraPipe Circut ID</td>

        <td text-align:="">WELL 79</td>
      </tr>
    </tbody>
  </table>

  <p>&nbsp;</p>

  <p text-align:="">&nbsp;</p>

  <p>&nbsp;</p>" beautifulsoup html-parsing
</body>
</html>

0 个答案:

没有答案
相关问题