从多个HTML文件中抓取特定内容的最佳方法是什么?

时间:2017-11-05 18:05:13

标签: html web-scraping

我有很多网页的HTML文件,其中包含许多信息。我试图提取一些内容并将其放入xml文件或excel电子表格。所有网页在设计上都非常相似,信息放在所有页面的相同位置。有人知道有什么办法吗?

2 个答案:

答案 0 :(得分:2)

有许多刮刀库可以帮助您从html页面中提取数据

网页抓取和抓取并不总是那么简单,所以这取决于你想要实现的目标。不同的产品,SDK,库等,专注于抓取或爬行的不同方面。以下是您可以查看的一些内容:

Apify - (以前称为Apifier)是一个基于云的网络抓取工具,可以使用几行简单的JavaScript从任何网站中提取结构化数据。

Diffbot - 自动从网页中提取数据并返回结构化JSON。 `

Espion   - 无头浏览器,可让您将JavaScript代码直接注入目标网页。

此外,如果您了解Node Js,那么node-osmosis真的很酷且易于使用库

答案 1 :(得分:1)

我强烈推荐你这个库:

http://sourceforge.net/projects/simplehtmldom/

/**
 * Website: http://sourceforge.net/projects/simplehtmldom/
 * Acknowledge: Jose Solorzano (https://sourceforge.net/projects/php-html/)
 * Contributions by:
 *     Yousuke Kumakura (Attribute filters)
 *     Vadim Voituk (Negative indexes supports of "find" method)
 *     Antcs (Constructor with automatically load contents either text or file/url)
 *
 * all affected sections have comments starting with "PaperG"
 *
 * Paperg - Added case insensitive testing of the value of the selector.
 * Paperg - Added tag_start for the starting index of tags - NOTE: This works but not accurately.
 *  This tag_start gets counted AFTER \r\n have been crushed out, and after the remove_noice calls so it will not reflect the REAL position of the tag in the source,
 *  it will almost always be smaller by some amount.
 *  We use this to determine how far into the file the tag in question is.  This "percentage will never be accurate as the $dom->size is the "real" number of bytes the dom was created from.
 *  but for most purposes, it's a really good estimation.
 * Paperg - Added the forceTagsClosed to the dom constructor.  Forcing tags closed is great for malformed html, but it CAN lead to parsing errors.
 * Allow the user to tell us how much they trust the html.
 * Paperg add the text and plaintext to the selectors for the find syntax.  plaintext implies text in the innertext of a node.  text implies that the tag is a text node.
 * This allows for us to find tags based on the text they contain.
 * Create find_ancestor_tag to see if a tag is - at any level - inside of another specific tag.
 * Paperg: added parse_charset so that we know about the character set of the source document.
 *  NOTE:  If the user's system has a routine called get_last_retrieve_url_contents_content_type availalbe, we will assume it's returning the content-type header from the
 *  last transfer or curl_exec, and we will parse that and use it in preference to any other method of charset detection.
 *
 * Found infinite loop in the case of broken html in restore_noise.  Rewrote to protect from that.
 * PaperG (John Schlick) Added get_display_size for "IMG" tags.
 *
 * Licensed under The MIT License
 * Redistributions of files must retain the above copyright notice.
 *
 * @author S.C. Chen <me578022@gmail.com>
 * @author John Schlick
 * @author Rus Carroll
 * @version 1.5 ($Rev: 196 $)
 * @package PlaceLocalInclude
 * @subpackage simple_html_dom
 */
/**
 * All of the Defines for the classes below.
 * @author S.C. Chen <me578022@gmail.com>
 */

这是一个例子

$html = file_get_html($ad_bachecubano_url);
//Proceder a capturar el texto
            $anuncio['header'] = $html->find('.headingText', 0)->plaintext;
            $anuncio['body'] = $html->find('.showAdText', 0)->plaintext;
            $precio = $html->find('#lineBlock');

            foreach ($precio as $possibleprice) {
                $item = $possibleprice->find('.headingText2', 0)->plaintext;
                $precio = 0;
                if ($item == "Precio:  ") {
                    $precio = $possibleprice->find('.normalText', 0)->plaintext;
                    $anuncio['price'] = $this->getFinalPrice($precio);
                } else {
                    continue;
                }
            }

            $contactbox = $html->find('#contact');

            foreach ($contactbox as $contact) {
                $boxes = $contact->find('#lineBlock');
                foreach ($boxes as $box) {
                    $key = $box->find('.headingText2', 0)->plaintext;
                    $value = $box->find('.normalText', 0)->plaintext;
                    if ($key == "Nombre:  ") {
                        $anuncio['nombre'] = $value;
                    }
                    if ($key == "Teléfono:  ") {
                        $anuncio['phone'] = $value;
                    }
                }
            }

            $anuncio['email'] = scrapeemail($anuncio['body'])[0][0];
            if (!isset($anuncio['email']) || $anuncio['email'] == '') {
                $anuncio['email'] = "";
            }