PHP - 检查 url 是否有效

时间:2021-07-03 05:14:16

标签: php url curl status

我正在检查网址并返回“有效”,如果网址状态代码“200”和“无效”,如果它在“404",

url 是重定向到某个页面 (url) 的链接,我需要检查该页面 (url) 的状态,以根据其状态代码确定其是否有效。

<?php

// From URL to get redirected URL
$url = 'https://www.shareasale.com/m-pr.cfm?merchantID=83483&userID=1860618&productID=916465625';
  
// Initialize a CURL session.
$ch = curl_init();
  
// Grab URL and pass it to the variable.
curl_setopt($ch, CURLOPT_URL, $url);
  
// Catch output (do NOT print!)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
  
// Return follow location true
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
$html = curl_exec($ch);
  
// Getinfo or redirected URL from effective URL
$redirectedUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
  
// Close handle
curl_close($ch);
echo "Original URL:   " . $url . "<br/> </br>";
echo "Redirected URL: " . $redirectedUrl . "<br/>";

 function is_url_valid($url) {
  $handle = curl_init($url);
  curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($handle, CURLOPT_NOBODY, true);
  curl_exec($handle);
 
  $httpCode = intval(curl_getinfo($handle, CURLINFO_HTTP_CODE));
  curl_close($handle);
 
  if ($httpCode == 200) {
    return 'valid link';
  }
  else {
    return 'invalid link';
  }
}

// 
echo "<br/>".is_url_valid($redirectedUrl)."<br/>";

如您所见,上面的链接状态为 400,但仍显示“有效” 我正在使用上面的代码,有什么想法或更正吗?为了让它按预期工作? 似乎该站点有不止一个重定向的 url 和脚本检查只有一个,这就是它显示有效的原因。 任何想法如何解决它?

这是我正在检查的链接

问题 -

例如 - 如果我查看此链接 https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518 然后在浏览器中它继续 "404" 但在脚本 o/p 中它的 "200"

5 个答案:

答案 0 :(得分:2)

我为此使用了 get_headers() 函数。如果我在数组中找到状态 2xx,则 URL 没问题。

function urlExists($url){
  $headers = @get_headers($url);
  if($headers === false) return false;
  return preg_grep('~^HTTP/\d+\.\d+\s+2\d{2}~',$headers) ? true : false;
}

答案 1 :(得分:2)

这是我对这个问题的看法。基本上,要点是:

  1. 您不需要提出多个请求。使用 CURLOPT_FOLLOWLOCATION 将为您完成所有工作,最后,您将获得的 http 响应代码是在发生/某些重定向的情况下来自最终调用的代码。
  2. 由于您使用的是 CURLOPT_NOBODY,因此请求将使用 HEAD 方法并且不会返回任何内容。因此,CURLOPT_RETURNTRANSFER 毫无用处。
  3. 我冒昧地使用了我自己的编码风格(无意冒犯)。
  4. 由于我从 Phpstorm 的 Scratch 文件中运行代码,所以我添加了一些 PHP_EOL 作为换行符来格式化输出。随意删除它们。

...

<?php

$linksToCheck = [
    'https://click.linksynergy.com/link?id=GsILx6E5APM&offerid=547531.5112&type=15&murl=https%3A%2F%2Fwww.peopletree.co.uk%2Fwomen%2Fdresses%2Fanna-checked-dress',
    'https://click.linksynergy.com/link?id=GsILx6E5APM&offerid=330522.2335&type=15&murl=https%3A%2F%2Fwww.wearethought.com%2Fagnetha-black-floral-print-bamboo-dress-midnight-navy%2F%2392%3D1390%26142%3D198',
    'https://click.linksynergy.com/link?id=GsILx6E5APM&offerid=330522.752&type=15&murl=https%3A%2F%2Fwww.wearethought.com%2Fbernice-floral-tunic-dress%2F%2392%3D1273%26142%3D198',
    'https://click.linksynergy.com/link?id=GsILx6E5APM&offerid=330522.6863&type=15&murl=https%3A%2F%2Fwww.wearethought.com%2Fjosefa-smock-shift-dress-in-midnight-navy-hemp%2F%2392%3D1390%26142%3D208',
    'https://www.shareasale.com/m-pr.cfm?merchantID=16570&userID=1860618&productID=546729471',
    'https://www.shareasale.com/m-pr.cfm?merchantID=53661&userID=1860618&productID=680698793',
    'https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518',
    'https://www.shareasale.com/m-pr.cfm?merchantID=83483&userID=1860618&productID=916465625',
];

function isValidUrl($url) {
    echo "Original URL:   " . $url . "<br/>\n";

    $handle = curl_init($url);

    // Follow any redirection.
    curl_setopt($handle, CURLOPT_FOLLOWLOCATION, TRUE);

    // Use a HEAD request and do not return a body.
    curl_setopt($handle, CURLOPT_NOBODY, true);

    // Execute the request.
    curl_exec($handle);

    // Get the effective URL.
    $effectiveUrl = curl_getinfo($handle, CURLINFO_EFFECTIVE_URL);
    echo "Effective URL:   " . $effectiveUrl . "<br/> </br>";

    $httpResponseCode = (int) curl_getinfo($handle, CURLINFO_HTTP_CODE);

    // Close this request.
    curl_close($handle);

    if ($httpResponseCode == 200) {
        return '✅';
    }
    else {
        return '❌';
    }
}

foreach ($linksToCheck as $linkToCheck) {
    echo PHP_EOL . "Result: " . isValidUrl($linkToCheck) . PHP_EOL . PHP_EOL;
}

答案 2 :(得分:2)

注意:我们使用 CURLOPT_NOBODY 只是检查连接而不是获取整个正文。

  $url = "Your URL";
  $curl = curl_init($url);
  curl_setopt($curl, CURLOPT_NOBODY, true);
  $result = curl_exec($curl);
 if ($result !== false)
 {
    $statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);  
 if ($statusCode == 404)
 {
   echo "URL Not Exists"
 }
 else
 {
   echo "URL Exists";
  }
 }
else
{
  echo "URL not Exists";
}

答案 3 :(得分:1)

下面的代码运行良好,但是当我将 url 放入数组并测试相同的功能时,它没有给出正确的结果? 任何想法为什么? 此外,如果任何主体想要更新答案以使其在某种意义上是动态的(当提供一组 url 时,应一次检查多个 url)。

  <?php
    
    // URL to check
    $url = 'https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518';
      
    $ch = curl_init(); // Initialize a CURL session.
    curl_setopt($ch, CURLOPT_URL, $url); // Grab URL and pass it to the variable.
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Catch output (do NOT print!)
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Return follow location true
    $html = curl_exec($ch);
    $redirectedUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL); // Getinfo or redirected URL from effective URL
    curl_close($ch); // Close handle
    
    $get_final_url = get_final_url($redirectedUrl);
    if($get_final_url){
        echo is_url_valid($get_final_url);
    }else{
        echo $redirectedUrl ? is_url_valid($redirectedUrl) : is_url_valid($url);
    }
    
    function is_url_valid($url) {
      $handle = curl_init($url);
      curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
      curl_setopt($handle, CURLOPT_NOBODY, true);
      curl_exec($handle);
     
      $httpCode = intval(curl_getinfo($handle, CURLINFO_HTTP_CODE));
      curl_close($handle);
      echo $httpCode;
      if ($httpCode == 200) {
        return '<b> Valid link </b>';
      }
      else {
        return '<b> Invalid link </b>';
      }
    }
    
    function get_final_url($url) {
            $ch = curl_init();
            if (!$ch) {
                return false;
            }
            $ret = curl_setopt($ch, CURLOPT_URL,            $url);
            $ret = curl_setopt($ch, CURLOPT_HEADER,         1);
            $ret = curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
            $ret = curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            $ret = curl_setopt($ch, CURLOPT_TIMEOUT,        30);
            $ret = curl_exec($ch);
    
            if (!empty($ret)) {
                $info = curl_getinfo($ch);
                curl_close($ch);
                return false;
            if (empty($info['http_code'])) {
                return false;
            } else {
                preg_match('#(https:.*?)\'\)#', $ret, $match);
                $final_url = stripslashes($match[1]);
                return stripslashes($match[1]);
            }
        }
    } 

答案 4 :(得分:1)

看,这里的问题是你想跟随 JAVASCRIPT 重定向, 您抱怨的网址 https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518 确实重定向到响应 HTTP 200 OK 的网址,并且该页面包含 javascript

<script LANGUAGE="JavaScript1.2">
                window.location.replace('https:\/\/www.tenthousandvillages.com\/bicycle-statue?sscid=71k5_4yt9r ')
                </script>

所以你的浏览器,它理解 javascript,遵循 javascript 重定向,而 js 重定向是一个 404 页面..不幸的是,没有从 PHP 做到这一点的好方法,你最好的选择可能是无头网络浏览器,例如 PhantomJS 或 puppeteer 或 Selenium 或类似的东西。

仍然,您可以在正则表达式中搜索 javascript 重定向并希望获得最佳效果,例如

<?php
function is_url_valid(string $url):bool{
    if(0!==strncasecmp($url,"http",strlen("http"))){
        // file:///etc/passwd and stuff like that aren't considered valid urls right?
        return false;
    }
    $ch=curl_init();
    if(!curl_setopt_array($ch,array(
        CURLOPT_URL=>$url,
        CURLOPT_FOLLOWLOCATION=>1,
        CURLOPT_RETURNTRANSFER=>1
    ))){
        // best guess: the url is so malformed that even CURLOPT_URL didn't accept it.
        return false;
    }
    $resp= curl_exec($ch);
    if(false===$resp){
        return false;
    }
    if(curl_getinfo($ch,CURLINFO_RESPONSE_CODE) != 200){
        // only HTTP 200 OK is accepted
        return false;
    }
    // attempt to detect javascript redirects... sigh
    // window.location.replace('https:\/\/www.tenthousandvillages.com\/bicycle-statue?sscid=71k5_4yt9r ')
    $rex = '/location\.replace\s*\(\s*(?<redirect>(?:\'|\")[\s\S]*?(?:\'|\"))/';
    if(!preg_match($rex, $resp, $matches)){
        // no javascript redirects detected..
        return true;
    }else{
        // javascript redirect detected..
        $url = trim($matches["redirect"]);
        // javascript allows both ' and " for strings, but json only allows " for strings
        $url = str_replace("'",'"',$url);
        $url = json_decode($url, true,512,JSON_THROW_ON_ERROR); // we extracted it from javascript, need json decoding.. (well, strictly speaking, it needs javascript decoding, but json decoding is probably sufficient, and we only have a json decoder nearby)
        curl_close($ch);
        return is_url_valid($url);
    }
}
var_dump(

    is_url_valid('https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518'),
    is_url_valid('http://example.org'),
    is_url_valid('http://example12k34jr43r5ehjegeesfmwefdc.org'),
    
);

但委婉地说,这是一个狡猾的hacky解决方案..

相关问题