如何抓取javascript哈希链接内容?

时间:2018-10-18 20:22:32

标签: javascript node.js web-scraping puppeteer

您好,我是使用Puppeter进行网页抓取的新手,目前我正面临下一个问题:

在我试图提取信息的站点中,我有一个带有典型js分页的引导表,例如以下示例: https://getbootstrap.com/docs/4.1/components/pagination/

当我使用Chrome Inspector检查页面html时,我只能看到 2 ,当我检查链接位置时,我可以看到

https://webpage.com/works#

我怎么知道总共有多少页?以及我如何单击它们?我不明白如何访问这种分页的每一页。

谢谢!

2 个答案:

答案 0 :(得分:0)

没有万无一失的方法,但是我按此顺序处理分页,

  • 等待目标元素出现
  • 从目标收集数据
  • 删除目标元素
  • 单击下一步按钮
  • ...循环浏览,直到没有下一个按钮或内容即使等待也没有加载

概念证明:

目标HTML代码:

<!-- Copied from: https://jsfiddle.net/solodev/yw7y4wez -->
<!DOCTYPE html>
<html>

<head>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
    <title>Pagination Example</title>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
    <meta name="robots" content="noindex, nofollow">
    <meta name="googlebot" content="noindex, nofollow">
    <meta name="viewport" content="width=device-width, initial-scale=1">

    <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
    <link rel="stylesheet" type="text/css" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
    <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
    <script type="text/javascript" src="https://www.solodev.com/assets/pagination/jquery.twbsPagination.js"></script>
    <style type="text/css">
        .container {
            margin-top: 20px;
        }
        
        .page {
            display: none;
        }
        
        .page-active {
            display: block;
        }
    </style>

    <script type="text/javascript">
        window.onload = function() {

            $('#pagination-demo').twbsPagination({
                totalPages: 5,
                // the current page that show on start
                startPage: 1,

                // maximum visible pages
                visiblePages: 5,

                initiateStartPageClick: true,

                // template for pagination links
                href: false,

                // variable name in href template for page number
                hrefVariable: '{{number}}',

                // Text labels
                first: 'First',
                prev: 'Previous',
                next: 'Next',
                last: 'Last',

                // carousel-style pagination
                loop: false,

                // callback function
                onPageClick: function(event, page) {
                    $('.page-active').removeClass('page-active');
                    $('#page' + page).addClass('page-active');
                },

                // pagination Classes
                paginationClass: 'pagination',
                nextClass: 'next',
                prevClass: 'prev',
                lastClass: 'last',
                firstClass: 'first',
                pageClass: 'page',
                activeClass: 'active',
                disabledClass: 'disabled'

            });

        }
    </script>

</head>

<body>
    <div class="container">
        <div class="jumbotron page" id="page1">
            <div class="container">
                <h1 class="display-3">Adding Pagination to your Website</h1>
                <p class="lead">In this article we teach you how to add pagination, an excellent way to navigate large amounts of content, to your website using a jQuery Bootstrap Plugin.</p>
                <p><a class="btn btn-lg btn-success" href="https://www.solodev.com/blog/web-design/adding-pagination-to-your-website.stml" role="button">Learn More</a></p>
            </div>
        </div>
        <div class="jumbotron page" id="page2">
            <h1 class="display-3">Not Another Jumbotron</h1>
            <p class="lead">Cras justo odio, dapibus ac facilisis in, egestas eget quam. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus.</p>
            <p><a class="btn btn-lg btn-success" href="#" role="button">Sign up today</a></p>
        </div>
        <div class="jumbotron page" id="page3">
            <h1 class="display-3">Data. Data. Data.</h1>
            <p>This example is a quick exercise to illustrate how the default responsive navbar works. It's placed within a <code>.container</code> to limit its width and will scroll with the rest of the page's content.
            </p>
            <p>
                <a class="btn btn-lg btn-primary" href="../../components/navbar" role="button">View navbar docs »</a>
            </p>
        </div>
        <div class="jumbotron page" id="page4">
            <h1 style="-webkit-user-select: auto;">Buy Now!</h1>
            <p class="lead" style="-webkit-user-select: auto;">Cras justo odio, dapibus ac facilisis in, egestas eget quam. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet.</p>
            <p style="-webkit-user-select: auto;"><a class="btn btn-lg btn-success" href="#" role="button" style="-webkit-user-select: auto;">Get
                    started today</a></p>
        </div>
        <div class="jumbotron page" id="page5">
            <h1 class="cover-heading">Cover your page.</h1>
            <p class="lead">Cover is a one-page template for building simple and beautiful home pages. Download, edit the text, and add your own fullscreen background photo to make it your own.</p>
            <p class="lead">
                <a href="#" class="btn btn-lg btn-primary">Learn more</a>
            </p>
        </div>
        <ul id="pagination-demo" class="pagination-lg pull-right"></ul>
    </div>

    <script>
        // tell the embed parent frame the height of the content
        if (window.parent && window.parent.parent) {
            window.parent.parent.postMessage(["resultsFrame", {
                height: document.body.getBoundingClientRect().height,
                slug: "yw7y4wez"
            }], "*")
        }
    </script>
</body>

</html>

这是示例代码的工作版本,

const puppeteer = require('puppeteer');

async function runScraper() {
  let browser = {};
  let page = {};
  const url = 'http://localhost:8080';

  // open the page and wait
  async function navigate() {
    browser = await puppeteer.launch({ headless: false });
    page = await browser.newPage();
    await page.goto(url);
  }

  async function scrapeData() {
    const headerSel = 'h1';
    // wait for element
    await page.waitFor(headerSel);
    return page.evaluate((selector) => {
      const target = document.querySelector(selector);

      // get the data
      const text = target.innerText;

      // remove element so the waiting function works
      target.remove();
      return text;
    }, headerSel);
  }

  // this is a sample concept of pagination
  // it will vary from page to page because not all site have same type of pagination

  async function paginate() {
    // manually check if the next button is available or not
    const nextBtnDisabled = !!(await page.$('.next.disabled'));
    if (!nextBtnDisabled) {
      // since it's not disable, click it
      await page.evaluate(() => document.querySelector('.next').click());

      // just some random waiting function
      await page.waitFor(100);
      return true;
    }
    console.log({ nextBtnDisabled });
  }

  /**
   * Scraping Logic
   */
  await navigate();

  // Scrape 5 pages
  for (const pageNum of [...Array(5).keys()]) {
    const title = await scrapeData();
    console.log(pageNum + 1, title);
    await paginate();
  }
}

runScraper();

结果:

Server running at 8080
1 'Adding Pagination to your Website'
2 'Not Another Jumbotron'
3 'Data. Data. Data.'
4 'Buy Now!'
5 'Cover your page.'
{ nextBtnDisabled: true }

我没有共享服务器代码,基本上是上面的html代码段。

答案 1 :(得分:0)

使用属性 footerTemplate displayHeaderFooter 来显示最初使用操纵符API的显示页面

await page.pdf({
  path: 'hacks.pdf',
  format: 'A4',
  displayHeaderFooter: true,
  footerTemplate: '<div><div class='pageNumber'></div> <div>/</div><div class='totalPages'></div></div>'
});

https://github.com/puppeteer/puppeteer/blob/master/docs/api.md#pagepdfoptions

footerTemplate 用于打印页脚的HTML模板。

//应该是有效的HTML标记,其中包含以下用于插入打印值的 CSS类

//-日期格式化的打印日期

//-标题文档标题

//- url 文档位置

//- pageNumber 当前页号

//-文档中的 totalPages 页总数