无法从锚标记中提取href值

时间:2013-09-12 14:37:21

标签: php regex

尝试获取此HTML的href

<a class="list-item clearfix" href="/en/rolex/submariner-date--id2334149.htm" id="watch-2334149" style="background-color: rgb(255, 255, 255);">

      <span onclick="_gaq.push(['first._trackEvent','Click','search','watch-image-click']);_gaq.push(['second._trackEvent','Click','search','watch-image-click']);" class="pic ">
        <span style="position:absolute">

          <img width="100" height="100" alt="Rolex Submariner Date" src="" class="photo">
        </span>
      </span>

  <span class="disc">
    <span onclick="_gaq.push(['first._trackEvent','Click','search','watch-headline-click']);_gaq.push(['second._trackEvent','Click','search','watch-headline-click']);" class="watch-headline"><span class="underline">Rolex Submariner Date</span></span>

        <span class="spec">


          <span onmouseover="$('#infobox-title').text('Germany');$('#infobox-text').text('This dealer is from Augsburg, Germany.')" style="width: 21px;" class="flag">

          <img width="16" height="16" alt="" src="http://cdn.chrono24.com/images/flags-icons/DE.png">&nbsp;
            </span>
            <span class="icon i-hasnostore"></span>
                    <span onmouseover="$('#infobox-title').text('Trusted Seller since 2004');$('#infobox-text').text('We have no knowledge about pending/unsolved disputes or complaints about this seller.')" class="icon i-trusted"></span>

                        <span onmouseover="$('#infobox-title').text('Retailer recommendations');$('#infobox-text').text('This watch retailer is recommended on Chrono24 by 1 other watch retailers.')" class="i-buddies">
                          <span class="icon buddie-count">1</span>
                          <span class="icon i-star-blue"></span>
                        </span>


              <span onmouseover="$('#infobox-title').text('Trusted Seller since 2004');$('#infobox-text').text('We have no knowledge about pending/unsolved disputes or complaints about this seller.')" class="trustedseller">
                    <script type="text/javascript">
                        // &lt;![CDATA[
                        document.write('Trusted Seller since 2004');
                        // ]]&gt;
                    </script>Trusted Seller since 2004
                  </span>    


                  <span style="width: 2px;" class="icon"></span>
                  <span onmouseover="$('#infobox-title').text('Premium Seller');$('#infobox-text').text('The Chrono24 Premium Seller Package is only available for Trusted Sellers who frequently use Chrono24.')" class="icon i-premium"></span>
                <span onmouseover="$('#infobox-title').text('Premium Seller');$('#infobox-text').text('The Chrono24 Premium Seller Package is only available for Trusted Sellers who frequently use Chrono24.')" class="premiumseller">Premium</span>

            </span>
            <span onclick="_gaq.push(['first._trackEvent','Click','search','watch-desc-click']);_gaq.push(['second._trackEvent','Click','search','watch-desc-click']);" class="description">
              Ref. No. 116610 LN; Steel; Automatic; Condition 0 (unworn); Year 2013; With Box; With Papers; Location: Germany, Augsburg; The current, the manufacturer's recommended retail price is 6800 Euro
            </span>


              <span class="availability">Availability: Available immediately</span>



  </span>
  <span class="pricebox">
    <span onclick="_gaq.push(['first._trackEvent','Click','search','watch-price-click']);_gaq.push(['second._trackEvent','Click','search','watch-price-click']);" class="amount price"><span class="large">$&nbsp;7,961</span>
    </span>

    <span class="buttonbox">
      <span onclick="_gaq.push(['first._trackEvent','Click','search','watch-button-click']);_gaq.push(['second._trackEvent','Click','search','watch-button-click']);" class="button-blue">
         <span>
          Watch details
         </span>
      </span>
    </span>


  </span>             

</a>
preg_match_all('#<a href="(.+)">#',$html,$urlarr);

这根本没有提供href值,不知道这有什么问题。

4 个答案:

答案 0 :(得分:2)

Don't use Regular Expressions on HTML; HTML is not regular

你应该看看SimpleXML和XPath,它们是完成这项工作的最佳选择:http://php.net/manual/en/simplexmlelement.xpath.php

E.g:

$xml   = new SimpleXMLElement($html);

// Select all "a" tags with href attributes
$links = $xml->xpath("//a[@href]");
// You probably want the first one
$href = $links[0]["href"]

答案 1 :(得分:1)

如果是regexp:

,则应使用domdocument
 $dom = new domDocument;
    $dom->loadHTML($html);
    $dom->preserveWhiteSpace = false;
    $link  = $dom->getElementsByTagName("a");
    $links = array();
    for($i = 0; $i < $link->length; $i++) {
       $links[] = $link->item($i)->getAttribute("href");
    }

答案 2 :(得分:1)

所有使用DOM的方法都应该有效。如果你想使用正则表达式,你可以试试这个:

preg_match_all('~<a (?>[^>h]++|\Bh|h(?!ref\b))*href\s*=\s*["\']?\K[^"\'>\s]++~i', $html, $matches);

如果您只想匹配具有list-item clearfix作为类属性值的标记中的href,则可以执行以下操作:

$pattern = <<<'LOD'
~
(?(DEFINE)
    (?<class> \b class \s* = \s* (["']) list-item \s+ clearfix \g{-1} )
    (?<href_value> [^"'\s>]++ )
    (?<href_start> \b href \s*=\s* ["']? )
    (?<href_end> ['"\s] )
    (?<content> (?> [^>hc]++ | \B[hc] | h(?!ref\b) | c(?!lass\b) )* )

)
    <a \s+
    \g<content>
    (?J)
    (?>
        \g<class> \g<content> \g<href_start> (?<href> \g<href_value> )
      |
        \g<href_start> (?<href> \g<href_value> ) \g<href_end> \g<content> \g<class>
    )
~xi
LOD;

preg_match_all($pattern, $html, $matches, PREG_SET_ORDER); 

foreach($matches as $match) {
    echo '<br>' . $match['href'];
}

请记住,使用XPath要容易得多:

$doc = new DOMDocument();
@$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$hrefs = $xpath->query("//a[@class='list-item clearfix']/@href");
foreach($hrefs as $href) {
    print_r($href->nodeValue);
}

答案 3 :(得分:0)

使用正则表达式解析HTML是一个坏主意(至少在这种情况下)。为此目的使用SimpleHTMLDOM等DOMParser:

这很容易:

$html = str_get_html('...');
foreach($html->find('a') as $element) 
    echo $element->href;

或者,您也可以从文件中加载它:

$html = file_get_html('...');
foreach($html->find('a') as $element) 
    echo $element->href;

使用内置DOM也可以这样做:

$dom = new DOMDocument();
$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a"); //all <a> tags
$urlArray = array();

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $urlArray[] = $href->getAttribute('href');
}

See it in action!