在两个标签之间获取HTML

时间:2014-06-23 07:40:34

标签: javascript jquery node.js scrape

尝试从内部论坛获取一些html源代码。 为了独立,我们使用nodejs,express和类似的东西。

当我直接打开页面时,我得到以下html:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <meta http-equiv="content-type" content="text/html; charset=us-ascii" />
    <meta name="description" content="myForum" />
    <meta name="viewport" content="width=320; user-scalable=no" />
    <title>myForum</title>
</head>

<body>
        <table>
            <tr>
                <td align="left" valign="top" width="100%">
                    <center>
                        <h1><img class="banner" src=
                        "./img/myForum.jpg" width="730"
                        height="117" border="0" alt="myForum" /></h1>
                    </center>
                    <hr />

                    <center>
                        [ <a href="answer.php?id=975710">Antworten</a> ]&nbsp;&nbsp;[
                        <a href="index.php">Forum</a> ]&nbsp;&nbsp;[ <a href=
                        "newEntries.php">Neue Beitr&auml;ge</a> ]
                    </center>
                    <hr />

                    <h1>sCHween</h1>geschrieben von&nbsp;<font color=
                    "#FFFFFF">User1</font>&nbsp;&nbsp;am&nbsp;18.06.2014&nbsp;um&nbsp;21:26:15
                    <hr />
                    This is my text! It could contain images and links!
                    <img src="http://images.google.ch/intl/en_ALL/images/srpr/logo11w.png" /><br />
                    <a href="http://www.google.com/">Google</a>
                    <br />
                    <hr />
                    <b>Antworten:</b><br />
                    <a href="thread.php?id=9752">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User2</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;22:56:27<br />
                    &nbsp;&nbsp;&nbsp;&nbsp;<a href="showentry.php?id=9756">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User2</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;23:14:44<br />
                    &nbsp;&nbsp;&nbsp;&nbsp;<a href="showentry.php?id=9753">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User1</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;23:02:21<br />
                    <a href="showentry.php?id=975713">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User1</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;21:46:13<br />
                    &nbsp;&nbsp;&nbsp;&nbsp;<a href="showentry.php?id=9720">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User3</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;22:22:25<br />
                    &nbsp;&nbsp;&nbsp;&nbsp;<a href="showentry.php?id=9755">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User4</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;21:52:51<br />
                    <hr />
                    <span>
                        <a href="answer.php?id=975">Antworten</a><br />
                        <a href="recent.php">Neue Beitr&auml;ge</a><br />
                    </span>
                    <hr />
                </td>
            </tr>
        </table>
</body>
</html>

我们想要得到的是两个hr标签之间的事物的html源:

This is my text! It could contain images and links!
<img src="http://images.google.ch/intl/en_ALL/images/srpr/logo11w.png" /><br />
<a href="http://www.google.com/">Google</a>

是否有一种简单的方法可以在两个hr标签之间获取源代码,或者提取此内容的最简洁方法是什么?

2 个答案:

答案 0 :(得分:0)

jsdom是在节点中进行DOM解析的绝佳工具。由于您希望文本节点和常规元素都转换为字符串,因此我们必须区分两者:

var jsdom = require("jsdom");

jsdom.env(
  'http://example.com',
  ['http://code.jquery.com/jquery.js'],
  function (errors, window) {
    var $hr = window.$('hr'),
        node = $hr.get(2).nextSibling,
        endNode = $hr.get(3),
        html = '';

    while (node && node !== endNode) {
        if (node.nodeType === 3) {
            html += node.textContent;
        } else {
            html += node.outerHTML;
        }
        node = node.nextSibling;
    }

  }
);

现在html具有以下值:

This is my text! It could contain images and links!
<img src="http://images.google.ch/intl/en_ALL/images/srpr/logo11w.png"><br>
<a href="http://www.google.com/">Google</a>
<br>

答案 1 :(得分:0)

不确定如果这是你想要的:

Jquery的:

var AllContent = $("td").contents();
var hrCount = 0;
var addContent = false;
var result="";
AllContent.each(function(){
    if ($(this).prop('tagName') == "HR"){
        hrCount++;
        if (hrCount ==3){
            addContent = true;
        }
        if (hrCount ==4){
            addContent = false;
        }
    }else{
        if(addContent){
            if (typeof $(this).html() != "undefined"){
                result+=$(this)[0].outerHTML;
            }else{
                result+=$(this).text();
           }
       }
    }   

});

alert(result);

必须是更清洁的解决方案......