从html文本中提取标记信息

时间:2016-12-09 12:07:44

标签: python web-scraping beautifulsoup mechanize

我正在试图抓取网页。我得到了以下文字。如何从下面的字符串中提取src信息。任何人都可以告诉我过程如何从文本中提取任何关键值数据

<img id="imgsglx2" onerror="this.alt=not select the picture or pictures cannot be displayed" src="http://114.255.167.200:8092/cidasEN/extend/sglx_images/UTYP/221.jpg" style=" border: 0; padding: 0; margin: 0;height:110px;width:110px; "/>

和textarea标签内的文字。

  <textarea id="sgmsbck" name="sgms" style="width:98%;height:120px">On August. 19, 2014\uff0c08:30, Mr. Xiao who drove lu K9**** MPV from south to north along the TaiShang south Road, when Mr. Xiao drove lu K9**** MPV turn west at the crossing of Chengshan road and TaiShang south road, RongCheng City. Due to wrong behavior towards pedestrians at pedestrian crossings, the left part of the lu K9**** MPV impacted with Mr. Song(Pedestrian) from south to north across ChengShan Road of the pedestrian crossings. Causing the lu K9**** MPV damaged, Mr. Song injured.</textarea>

2 个答案:

答案 0 :(得分:0)

由于您在代码中提到了beautifulsoup,我假设您要使用它来解析您的HTML内容。

import bs4

content = """<img id="imgsglx2" onerror="this.alt=not select the picture or pictures cannot be displayed" src="http://114.255.167.200:8092/cidasEN/extend/sglx_images/UTYP/221.jpg" style=" border: 0; padding: 0; margin: 0;height:110px;width:110px; "/>
<textarea id="sgmsbck" name="sgms" style="width:98%;height:120px">On August. 19, 2014\uff0c08:30, Mr. Xiao who drove lu K9**** MPV from south to north along the TaiShang south Road, when Mr. Xiao drove lu K9**** MPV turn west at the crossing of Chengshan road and TaiShang south road, RongCheng City. Due to wrong behavior towards pedestrians at pedestrian crossings, the left part of the lu K9**** MPV impacted with Mr. Song(Pedestrian) from south to north across ChengShan Road of the pedestrian crossings. Causing the lu K9**** MPV damaged, Mr. Song injured.</textarea>
"""

soup = bs4.BeautifulSoup(content, 'lxml')

img = soup.find('img') # locate img tag
text_area = soup.find('textarea') # locate textarea tag

print img['id'] # print value of 'id' attribute in img tag
print img['src'] # print value of 'src' attribute
print text_area.text # print content in this tag

答案 1 :(得分:0)

beautifulsoup可以提供帮助:

标签可以包含任意数量的属性。标签有一个属性“class”,其值为“boldest”。您可以通过将标记视为字典来访问标记的属性:

tag['class']

# u'boldest'

您可以直接以.attrs:

的形式访问该词典
tag.attrs
# {u'class': u'boldest'}

你可以通过.text

获取标签
tag.text