清理HTML

时间:2011-12-06 12:56:40

标签: php html codeigniter

我需要将大量HTML代码(包含<tables>, <div>, <p>, <span id='something'>)清理成更标准化的HTML,而没有原始HTML代码的样式,没有<tables> <div> <span>。一种清理方法可以是删除除<ul> <li> <ol> <br> <strong>之外的所有原始HTML代码,而<td> <tr>将替换为<p>。剩下的标签将被剥离类,ID和属性。

我该怎么做?目前,我正在使用strip_tags()删除所有标签,但将所有剩余部分挤压成一行,这使得难以阅读。

要清理的HTML代码示例

<table cellpadding="0" cellspacing="0" width="100%">
<tr><td align="center">
<table cellpadding="0" cellspacing="0" width="850">
<tr valign="top">
<td width="50%" style="padding-left:30px;">
<img border="0" src="http://www.newpads.info/l/AC-000-078.gif"><br>2000 Massachusetts Ave., Cambridge, MA 02140<br>Phone: (617) 498-0011 - Fax: (617) 498-0044<br><a href="http://www.windsorrealty.net" rel="nofollow">http://www.windsorrealty.net</a></td>
<td width="35%" style="border-left: 1px solid gray; padding-left: 15px">
<div><span style="font-weight:bold;">Sugandha Singh</span></div>
<div style="padding:10px;">
<div style="padding:2px;"><img src="http://www.newpads.info/img/phone.gif"> 781 985 4489</div>
<div style="padding:2px;"><img src="http://www.newpads.info/img/email2.gif"> ssinghrealty@gmail.com</div>
<div><img src="http://www.newpads.info/img/question.gif"> <a href="http://ag006436.speedhatch.com/rentals/CAM-058-197/inquiry" rel="nofollow"><font size="3">Ask Me A Question</font></a></div><div><img src="http://www.newpads.info/img/magnet.png">  <a href="http://ag006436.speedhatch.com" rel="nofollow"><font size="3">Search My Apartments</font></a></div></div>
</td>
</tr>
</table>
<br><table width="850">
<tr><td colspan="2" height="2" bgcolor="#275c7d"></td></tr><tr><tr><td colspan="2"><div style="font-weight:bold;"><font size="3">HARVARD LAW / SQUARE. HEAT+HOTWATER INCL. JAN 1. 1/2 FEE</font></div></td></tr></tr><tr valign="top"><td><img src="http://maps.google.com/maps/api/staticmap?center=42.38047,-71.121008&amp;path=weight:4|42.37847,-71.118008|42.37847,-71.124008|42.38247,-71.124008|42.38247,-71.118008|42.37847,-71.118008&amp;zoom=15&amp;size=335x225&amp;sensor=false" style="width:275px;"></td><td><font size="2"><table style="width:100%;height:100%;"><tr valign="top"><td width="50%"><table cellpadding="3" style="width:100%;"><tr><td colspan="2" style="font-weight:bold;">Basic Info</td></tr><tr><td style="width:45%;">Referral ID:</td><td>CAM-058-197</td></tr><tr><td>Beds: 1</td><td>Baths: 1</td></tr><tr><td>Rent:</td><td>$1800</td></tr><tr><td>Broker Fee:</td><td>Half Month</td></tr><tr><td>Date Avail:</td><td>January 1st</td></tr><tr><td>Rent Includes:</td><td>Heat, Hot Water</td></tr><tr><td>Pet Policy:</td><td>Cat Ok</td></tr><tr><td colspan="2">on Langdon St., Cambridge - Harvard Square</td></tr></table></td><td width="50%"><table cellpadding="5" style="width:100%;"><tr><td colspan="2" style="font-weight:bold;">Apartment Features</td></tr><tr><td width="50%">- Gas Range</td><td width="50%">- HT&HW</td></tr><tr><td width="50%">- Modern Bath</td><td width="50%">- Modern Kitchen</td></tr><tr><td width="50%">- Storage - Basement</td><td width="50%"></td></tr></table></td></tr><tr><td colspan="2"></td></tr></table></font></td></tr><tr><td colspan="3"><table width="100%" border="0" cellspacing="0" cellpadding="3"><tr><td colspan="2" align="center"><b>Transportation options</b></td></tr><tr><td width="50%"><div><div><div style="text-align:center;text-decoration:underline;">Subway Lines and Stops</div><ul><li>RED - Harvard Square (11 min)</li></ul></td><td width="50%"><div style="text-align:center;text-decoration:underline;">Bus Routes and Stops</div><ul><li>74 - Waterhouse St & Massachusetts Ave (5 min)</li><li>72 - Waterhouse St & Massachusetts Ave (5 min)</li><li>77 - Massachusetts Ave & Waterhouse St (5 min)</li><li>75 - Waterhouse St & Massachusetts Ave (5 min)</li><li>71 - Waterhouse St & Massachusetts Ave (5 min)</li><li>And More...</li></ul></div></div></td></tr></table></td></tr><tr><td colspan="3"><div><b><font size="2">Apartment Description:</font></b></div><div style="padding:5px;"><font size="2">Recent Renovations. Great Location. Easy Walk to Harvard Law School or Harvard Square.<br>All Hardwood Floors, Kitchen w/Dining Area, Good Closet Space, Laundry Facilities.<br>(pics. of similar unit in the bldg)<br>HEAT and HOT WATER is INCLUDED in the RENT!<br>Available January 1.</font></div><br></td></tr><tr><td colspan="3"><div><strong>Similar Properties</strong></div><div>1 Bd on Huron Ave., $1835, NO FEE, Include Util., Avail Now</div><div>1 Bd on Huron Ave., $1810, Include Util., NO FEE, Avail Now</div></td></tr><tr><td colspan="2" height="2" bgcolor="#275c7d"></td></tr></table><br><table width="850" cellpadding="0" cellspacing="0" border="0"><tr><td style="text-align:center;width:50%;"><img src="http://www.newpads.info/p/2373415.jpg" width="400" border="0"></td><td style="text-align:center;width:50%;"><img src="http://www.newpads.info/p/2373416.jpg" width="400" border="0"></td></tr><tr><tr><td height="10"></td></tr><td style="text-align:center;width:50%;"><img src="http://www.newpads.info/p/2373417.jpg" width="400" border="0"></td><td style="text-align:center;width:50%;"><img src="http://www.newpads.info/p/2373418.jpg" width="400" border="0"></td></tr><tr><tr><td height="10"></td></tr><td style="text-align:center;width:50%;"><img src="http://www.newpads.info/p/2373419.jpg" width="400" border="0"></td><td style="text-align:center;width:50%;"><img src="http://www.newpads.info/p/2373420.jpg" width="400" border="0"></td></tr><tr><tr><td height="10"></td></tr></tr></table><table width="100%"><tr><td height="20"></td></tr><tr><td align="center"><font size="4">Contact <strong>Sugandha Singh</strong> at 781 985 4489 or ssinghrealty@gmail.com.</font></td></tr></table><table width="100%" cellspacing="0" cellpadding="0"><tr><td height="25"></td></tr><tr><td align="center"><div style="font-family: Verdana, sans-serif;"><font size="0.6">Equal Housing Opportunity - Windsor Realty is not responsible for any errors or omissions. Terms, conditions and rent are subject to change without prior notice. The information gathered is from third party sources including the owner and public records and is not guaranteed.</font></div></td></tr></table></td></tr></table><img src="http://www.newpads.info/CLAD/904329.gif">

strip_tags()后:

2000 Massachusetts Ave., Cambridge, MA 02140Phone: (617) 498-0011 - Fax: (617) 498-0044http://www.windsorrealty.netSugandha SinghPhone: 781 985 4489Email: ssinghrealty@gmail.com Ask Me A Question  Search My Apartments1 Bd on Concord Ave., HT/HW, Avail 01/01$1500 / MonthApartment DetailsApartment FeaturesReferral ID:WES-008-561Available:January 1stRent:$1500Bed(s):1Bath(s):1Rent Includes:Heat, Hot WaterFee:One MonthSubway Lines and StopsRED - Harvard Square (13 min)Bus Routes and Stops75 - Garden St Opp Mason St (7 min)74 - Garden St Opp Mason St (7 min)72 - Garden St Opp Mason St (7 min)78 - Concord Ave & Huron Ave (7 min)77 - Massachusetts Ave & Waterhouse St (8 min)And More...Contact Sugandha Singh at 781 985 4489 or ssinghrealty@gmail.com.Equal Housing Opportunity - Windsor Realty is not responsible for any errors or omissions. Terms, conditions and rent are subject to change without prior notice. The information gathered is from third party sources including the owner and public records and is not guaranteed.

注意:我正在使用Codeigniter,如果这对它有任何解析功能有帮助。

4 个答案:

答案 0 :(得分:1)

$doc = new DOMDocument();
$doc->loadHTML(...);
$xpath = new DOMXpath($doc);
$nodes = $xpath->query("//*");

$rtn = array();
foreach ($nodes as $node)
{
    switch ($node->nodeName)
    {
        case "ul":
        case "li":
        case "ol":
        case "br":
        case "strong":
            $rtn[] = $node->nodeValue;
            break;
    }
}

答案 1 :(得分:1)

您可以告诉strip_tags()要保留哪些标签。这样做只需解决用<td>替换<tr><p>元素的问题。

通过运行xpath查询选择元素并用<p>元素替换它们,同时接管原始元素子元素,可以使用DOMDocument类来完成。

可以在previous answer(问题:Extract all the text and img tags from HTML in PHP.)中找到一些相关代码,为了让孩子们感动,有another answer(Qustion:How do you remove duplicate, nested DOM elements in PHP?)表明

答案 2 :(得分:0)

在这种情况下,

PhpQuery是一个不错的选择。强烈推荐。

答案 3 :(得分:0)