在美丽的汤中提取属性值

时间:2013-08-11 13:53:48

标签: python beautifulsoup

以下是我试图从中提取视频标题的网站的部分

</div>
<div class="yt-lockup-content">
        <h3 class="yt-lockup-title">
<a class="yt-uix-sessionlink yt-uix-tile-link yt-uix-contextlink 
      yt-ui-ellipsis yt-ui-ellipsis-2"
    dir="ltr"
      title="Harder Polynomials"
    data-sessionlink="ei=fYsHUvSLA8uzigLq74CABQ&amp;ved=CB8Qvxs&amp;feature=c4-videos-u"
    href="/watch?v=LHvQeBRLFn8"
  >
    Harder Polynomials
</a>

我希望从中提取视频标题(Harder Polynomials)。我尝试了以下代码:

import requests
from bs4 import BeutifulSoup

resp=requests.get('http://www.youtube.com/user/sachinabey/videos')

a=soup.findAll('a', attrs={'class': 'yt-uix-sessionlink yt-uix-tile-link yt-uix-  contextlink yt-ui-ellipsis yt-ui-ellipsis-2'})

a是空的,我做错了什么。 从这里如何提取标题

2 个答案:

答案 0 :(得分:0)

这是一个可以从页面打印所有视频节目的有效解决方案:

import requests
from bs4 import BeautifulSoup

resp = requests.get('http://www.youtube.com/user/sachinabey/videos')

soup = BeautifulSoup(resp.text)
for title in soup.findAll('h3', attrs={'class': 'yt-lockup-title'}):
    print title.find('a').text.strip()

打印:

Harder Polynomials
Summing tan inverse
iGraph tutorial
Integrate e^(-x^2)
Chord of Contact to Ellipse
Equation of Tangents of an Ellipse or Hyperbola
Motion and Air Resistance
Projectile Motion
Regression in R
EM Algorithm Derivation
Cosine Rule proof
R writing functions
Proof of square root 2 being irrational
R for loops and while loops
Chi Squared Hypothesis Testing
Integration of Trignometric Functions
Sequences and Series Examples
ARCH GARCH Model Motivation
Integration by Parts
Differentiate Inverse Trigonometry
Simple Harmonic Motion Examples Part II
Simple Harmonic Motion Examples Part I
Simple Harmonic Motion -  Introduction
HSC Solutions 2009 3 Unit Q4
HSC 3 Unit Solutions 2009 Q2
HSC 3 Unit Maths 2009 Solutions
Parallel For Loops
Change of Base for Logarithms - Examples
Divisibility by 3 or 9
Multiplying by 11

答案 1 :(得分:0)

我认为错误在于yt-uix- contextlink。我认为这应该是一个错字。如果纠正它,它会起作用。

演示:

>>> s
'<div class="yt-lockup-content">\n        <h3 class="yt-lockup-title">\n<a class="yt-uix-sessionlink yt-uix-tile-link yt-uix-contextlink \n      yt-ui-ellipsis yt-ui-ellipsis-2"\n    dir="ltr"\n      title="Harder Polynomials"\n    data-sessionlink="ei=fYsHUvSLA8uzigLq74CABQ&amp;ved=CB8Qvxs&amp;feature=c4-videos-u"\n    href="/watch?v=LHvQeBRLFn8"\n  >\n    Harder Polynomials\n</a>'
>>> soup=BeautifulSoup(s)
>>> soup.findAll('a', attrs={'class': 'yt-uix-sessionlink yt-uix-tile-link yt-uix-contextlink yt-ui-ellipsis yt-ui-ellipsis-2'})
[<a class="yt-uix-sessionlink yt-uix-tile-link yt-uix-contextlink yt-ui-ellipsis yt-ui-ellipsis-2" data-sessionlink="ei=fYsHUvSLA8uzigLq74CABQ&amp;ved=CB8Qvxs&amp;feature=c4-videos-u" dir="ltr" href="/watch?v=LHvQeBRLFn8" title="Harder Polynomials">
    Harder Polynomials
</a>]

或许您可以传递一个类列表。

>>> soup.findAll('a', attrs={'class': ['yt-uix-sessionlink', 'yt-uix-tile-link', 'yt-uix-contextlink',  'yt-ui-ellipsis yt-ui-ellipsis-2']})
[<a class="yt-uix-sessionlink yt-uix-tile-link yt-uix-contextlink yt-ui-ellipsis yt-ui-ellipsis-2" data-sessionlink="ei=fYsHUvSLA8uzigLq74CABQ&amp;ved=CB8Qvxs&amp;feature=c4-videos-u" dir="ltr" href="/watch?v=LHvQeBRLFn8" title="Harder Polynomials">
    Harder Polynomials
</a>]
相关问题