Question

我正在尝试在网站上抓取文章。并希望获取图像的src。我做了几次尝试就很擅长了，我的代码似乎无法获取所有这些src。

我正在将Selenium 3.141.0与Python 3.7一起使用。我想得到4件事：图片的src，链接到全文，标题，文章摘录。我可以成功刮除其余部分，但不能成功刮除src。我想将所有这些详细信息转储到熊猫数据框中。

这是我要抓取的网站代码。

<article class="w29" data-minarticles="1.00">
    <a href="something.html">
        <figure class="left ">
            <span class="img-a is-loaded">
                <img alt="stock image" title="stock image" width="245" height="135" src="pic.JPG" class="">
                <noscript>
                  "<img src="pic.JPG" alt="stock image" title="stock image" width="245" height="135" />"
                </noscript>
             </span>
          </figure>
        <h2>
            <span>
            Article Title
            </span>
        </h2>
        <p>
          "Article snippet"
        </p>
      </a>
      ::after
</article>
<article class="w29" data-minarticles="1.00">
    <a href="something2.html">
        <figure class="left ">
            <span class="img-a is-loaded">
                <img alt="stock image2" title="stock image2" width="245" height="135" src="pic2.JPG" class="">
                <noscript>
                  "<img src="pic2.JPG" alt="stock image2" title="stock image2" width="245" height="135" />"
                </noscript>
             </span>
          </figure>
        <h2>
            <span>
            Article Title 2
            </span>
        </h2>
        <p>
          "Article snippet 2"
        </p>
      </a>
</article>
<article class="w29" data-minarticles="1.00">
    <a href="something3.html">
        <figure class="left ">
            <span class="img-a is-loaded">
                <img alt="stock image3" title="stock image3" width="245" height="135" src="pic3.JPG" class="">
                <noscript>
                  "<img src="pic3.JPG" alt="stock image3" title="stock image3" width="245" height="135" />"
                </noscript>
             </span>
          </figure>
        <h2>
            <span>
            Article Title 3
            </span>
        </h2>
        <p>
          "Article snippet 3"
        </p>
      </a>
</article>

这是我的代码：

driver.get(url)

# get sub posts
sub_posts = driver.find_elements_by_class_name("w29")

# get details
sub_list = []
for post in sub_posts:
    # Get the link to the full article
    sub_source = post.find_element_by_tag_name('a').get_attribute('href')
    # Get the src of the post 
    sub_photo = post.find_element_by_tag_name('img').get_attribute('src')
    # Get headline
    sub_headline = post.find_element_by_tag_name('h2').text
    # Get article snippet
    sub_snippet = post.find_element_by_tag_name('p').text
    sub_list.append([sub_photo, sub_source, sub_headline, sub_snippet])

post_df = pd.DataFrame(sub_list, columns=["photo", "source", "headline", "snippet"])

这是我尝试过的，也是我在数据框中得到的结果，重点放在代码行上以获取帖子的src：

尝试1

sub_photo = post.find_element_by_tag_name('img').get_attribute('src')

尝试1的结果

无论出于何种原因，它都会刮除第一个src并在其余文章中返回None。

photo      source           headline         snippet
pic.JPG    something.html   Article Title    Article Snippet
None       something2.html  Article Title 2  Article Snippet 2
None       something3.html  Article Title 3  Article Snippet 3

尝试2

sub_photo = post.find_element_by_xpath('//*[@id="content"]/div[6]/div[1]/div[2]/article/a/figure/span/img').get_attribute('src')

尝试2的结果

它抓取了第一个src并将相同的第一个src返回到其余文章。

photo      source           headline         snippet
pic.JPG    something.html   Article Title    Article Snippet
pic.JPG    something2.html  Article Title 2  Article Snippet 2
pic.JPG    something3.html  Article Title 3  Article Snippet 3

尝试3

sub_photo = post.find_element_by_css_selector('a>figure>span>img').get_attribute('innerHTML')

尝试3的结果

它抓取了第一个innerHTML，并在其余文章中返回了相同的第一个innerHTML。

photo       source           headline         snippet
\n<img...   something.html   Article Title    Article Snippet
\n<img...   something2.html  Article Title 2  Article Snippet 2
\n<img...   something3.html  Article Title 3  Article Snippet 3

这就是我要寻找的：

photo      source           headline         snippet
pic.JPG    something.html   Article Title    Article Snippet
pic2.JPG   something2.html  Article Title 2  Article Snippet 2
pic3.JPG   something3.html  Article Title 3  Article Snippet 3

如果有人能指出我正确的方向，将不胜感激。谢谢。

Answer 1

最初仅渲染几张图像，因此您可以滚动页面到底部以提取所有@src值，也可以提取@src（对于可见图像）或@data-src（对于隐藏的图片）：

sub_photo = post.find_element_by_tag_name('img').get_attribute('src') or post.find_element_by_tag_name('img').get_attribute('data-src')

如果不是@src，它将返回None的值；如果@data-src是@src，则返回None的值

Answer 2

对于第一个帖子，数据位于src属性中，然后位于data-src（在您的代码中）。参见以下示例

for post in sub_posts:   
    ele = post.find_element_by_tag_name('img')
    val = ele.get_attribute('data-src') if ele.get_attribute('data-src') is not None else ele.get_attribute('src')
    print(val)

如何在Python中使用Selenium成功获取嵌套在span标记下的图像的所有src

2 个答案: