Question

我正在尝试创建一个程序，该程序将打开一个文本文件，其中包含由|分隔的网址。然后它将获取文本文档的第一行，抓取该URL并将其从文本文件中删除。每个网址都要由基本的抓取工具抓取。我知道爬虫部分有效，因为如果我输入引号中的一个URL，而不是文本文件中的变量，它将起作用。我现在处于不会返回任何内容的地步，因为该网址根本不会被接受。

这是我的代码的基本版本，因为我不得不将其分解以解决问题。

$urlarray = explode("|", $contents = file_get_contents('urls.txt'));

$url = $urlarray[0];
$dom = new DOMDocument('1.0');
@$dom->loadHTMLFile($url);

$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
    $title = $element->getAttribute('title');
    $class = $element->getAttribute('class');
    if($class == 'result_link')
    {
        $title = str_replace('Synonyms of ', '', $title);
        echo $title . "<br />";
    }
}`

Answer 1

下面的代码就像使用您的示例数据测试的冠军一样：

<?php
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));

$url = $urlarray[0];

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);

$dom = new DOMDocument();
@$dom->loadHTML($html);

$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
    $title = $element->getAttribute('title');
    $class = $element->getAttribute('class');
    if($class == 'result_link')
    {
        $title = str_replace('Synonyms of ', '', $title);
        echo $title . "<br />";
    }
}
?>

ALMOST FORGOT：现在让我们通过所有网址循环：

<?php
    $urlarray = explode("|", $contents = file_get_contents('urls.txt'));

    $url = $urlarray[0];
    foreach($urlarray as $url) {
        if(!empty($url)) {
            $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

            $ch = curl_init();
            curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
            curl_setopt($ch, CURLOPT_URL,trim($url));
            curl_setopt($ch, CURLOPT_FAILONERROR, true);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
            curl_setopt($ch, CURLOPT_AUTOREFERER, true);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
            curl_setopt($ch, CURLOPT_TIMEOUT, 10);
            $html = curl_exec($ch);

            $dom = new DOMDocument();
            @$dom->loadHTML($html);

            $anchors = $dom->getElementsByTagName('a');
            foreach($anchors as $element)
            {
                $title = $element->getAttribute('title');
                $class = $element->getAttribute('class');
                if($class == 'result_link')
                {
                    $title = str_replace('Synonyms of ', '', $title);
                    echo $title . "<br />";
                }
            }
            echo '<hr />';
        }
    }
?>

Answer 2

所以如果你手动输入一个URL $ url ='http://www.mywebsite.com';每件事都按预期工作？

如果是这样，这里有问题： $ urlarray = explode（“|”，$ contents = file_get_contents（'urls.txt'））;

你确定urls.txt正在加载吗？你确定它包含http://a.com|http://b.com等吗？

我会var转储在explode语句之前$ contents = file_get_contents（'urls.txt'）以查看它是否正在加载。

如果是，那么我会将其分解为$ urlarray和var dump $ urlarray [0]

如果它看起来正确，我会修剪它，然后用trim发送到dom（$ urlarray [0]）

我甚至可以使用有效的正则表达式来确保这些URL在发送到dom之前实际上是URL。

让我知道结果，我会尝试进一步提供帮助，或发布所有示例代码，包括URLS.txt

我将在本地运行

php dom不接受网址

2 个答案: