刮痧和divs

时间:2012-03-24 19:47:58

标签: html screen-scraping

我是PHP的新手,我正在尝试从我正在使用正则表达式的网站上抓取数据,但在div中查找内容租借和详细信息是一个问题,这是我的代码。有人可以帮助我吗?

    <?php
header('content-type: text/plain');
$contents= file_get_contents('http://www.hassconsult.co.ke/index.php?option=com_content&view=article&id=22&Itemid=29');
$contents = preg_replace('/\s(1,)/','',$contents);
$contents = preg_replace('/&nbsp;/','',$contents);

//print $contents."\n";
$records = preg_split('/<span class="style8"/',$contents);

for ($ix=1; $ix < count($records); $ix++){
$tmp = $records[$ix];

preg_match('/href="(.*?)"/',$tmp, $match_url);
preg_match('/>(.*?)<\/span>/',$tmp,$match_name);
preg_match('/<div[^>]+class ?= ?"style10"[^>]*>(\s*(<div.*(?2).*<\/div>\s*)*)<\/div>/Us',$tmp,$match_rental);//error is here 
print_r($match_url);
print_r($match_name);
print_r($match_rental);
print $tmp."\n";
exit ();
}
//print count($records)."\n";
//print_r($records);
//if ($contents===false)
//print 'FALSE';
//print_r(htmlentities($contents));

?> 

以下是内容的示例

    >HILLVIEW CROSSROADS4 BED HOUSE</span></div></td>
                </tr>
                <tr>
                  <td width="57%" style="padding-left:20px;"><div align="left" class="style10" style="color:#007AC7;">
                      <div align="left">
                                            Rental; 
                        USD                     4,500 
                        </div>
                  </div></td>
                  <td width="43%" align="right"><div align="right" class="style10" style="color:#007AC7;">
                      <div align="right">

                      No.             
                      834 

                       </div>
                  </div></td>
                </tr>
                <tr>
                  <td colspan="2" style="padding-left:20px;color:#000000;">
                  <div align="justify" style="font-family:Arial, Helvetica, sans-serif;font-size:11px;color:333300;">Artistically designed 4 bed (all
ensuite) house on half acre of well-tended gardens. Lounge with fireplace opening to terrace, opulent master suite, family room, study. Good finishes, SQ, carport, extra water storage
and generator.                                <a href="/index.php?option=com_content&amp;view=article&amp;id=27&amp;Itemid=74&amp;send=5&amp;ref_no=834/II&amp;t=2">....Details</a>               </div></td>
                </tr>
            </table></td>
          </tr>
</table>
<br>

1 个答案:

答案 0 :(得分:2)

那个网站没有好的css选择器,但是仍然不难用xpath获取它:

$dom = new DOMDocument();
@$dom->loadHTMLFile('http://www.hassconsult.co.ke/index.php?option=com_content&view=article&id=22&Itemid=29');
$xpath = new DOMXPath($dom);

foreach($xpath->query("//div[@id='ad']/table") as $table) {
  // title
  echo $xpath->query(".//span[@class='style8']", $table)->item(0)->nodeValue . "\n";
  // price
  echo $xpath->query(".//div[@class='style10']/div", $table)->item(0)->nodeValue . "\n";
  // description
  echo $xpath->query(".//div[@align='justify']", $table)->item(0)->nodeValue . "\n";
}
相关问题