Question

我从网站（维基文章）中提取关键字时遇到问题，提取的关键字，不完全是关键字，是从html中获取的文字，而不是网站上的文字。

我使用以下代码：

include("Extkeys.php");
[...]
if (empty($keywords)){
$ekeywords = new KeyPer;
$keywords = $ekeywords->Keys($webhtml);
}

“Extkeys”的代码是：

<?php
class Extkeys {
function Keys($webhtml) { 
$webhtml = $this->clean($webhtml); 
$blacklist='de,la,los,las,el,ella,nosotros,yo,tu,el,te,mi,del,ellos'; 
$sticklist='test'; 
$minlength = 3; 
$count = 17; 

$webhtml = preg_replace('/[\.;:|\'|\"|\`|\,|\(|\)|\-]/', ' ', $webhtml); 
$webhtml = preg_replace('/¡/', '', $webhtml); 
$webhtml = preg_replace('/¿/', '', $webhtml);

$keysArray = explode(" ", $webhtml); 
$keysArray = array_count_values(array_map('strtolower', $keysArray)); 
$blackArray = explode(",", $blacklist); 

foreach($blackArray as $blackWord){ 
if(isset($keysArray[trim($blackWord)])) 
unset($keysArray[trim($blackWord)]); 
} 
arsort($keysArray); 
$i = 1; 
$keywords = ""; 
foreach($keysArray as $word => $instances){ 
if($i > $count) break; 
if(strlen(trim($word)) >= $minlength && is_string($word)) { 
$keywords .= $word . ", "; 
$i++; 
} 
} 

$keywords = rtrim($keywords, ", "); 

return $keywords=$sticklist.''.$keywords; 
} 

function clean($webhtml) { 

$regex = '/(([_A-Za-z0-9-]+)(\\.[_A-Za-z0-9-]+)*@([A-Za-z0-9-]+)(\\.[A-Za-z0-9-]+)*)/iex'; 
$desc = preg_replace($regex, '', $webhtml); 
$webhtml = preg_replace( "''si", '', $webhtml ); 
$webhtml = preg_replace( '/]*>([^<]+)<\/a>/is', '\2 (\1)', $webhtml ); 
$webhtml = preg_replace( '//', '', $webhtml ); 
$webhtml = preg_replace( '/{.+?}/', '', $webhtml ); 
$webhtml = preg_replace( '/ /', ' ', $webhtml ); 
$webhtml = preg_replace( '/&/', ' ', $webhtml ); 
$webhtml = preg_replace( '/"/', ' ', $webhtml ); 
$webhtml = strip_tags( $webhtml ); 
$webhtml = htmlspecialchars($webhtml); 
$webhtml = str_replace(array("\r\n", "\r", "\n", "\t"), " ", $webhtml); 

while (strchr($webhtml," ")) { 
$webhtml = str_replace(" ", "",$webhtml); 
} 

for ($cnt = 1; 
$cnt < strlen($webhtml)-1; $cnt++) {
if (($webhtml{$cnt} == '.') || ($webhtml{$cnt} == ',')) { 
if ($webhtml{$cnt+1} != ' ') { 
$webhtml = substr_replace($webhtml, ' ', $cnt + 1, 0); 
} 
} 
} 
return $webhtml; 
} 
}
?>

这是提取的关键字的一个示例：

testfalse，lang，{mw，loader，window，function，true，vector，user，gadget，mediawiki，legacy，options，usebetatoolbar，implement，resourceloader，default

文章中： http://en.wikipedia.org/wiki/Searchengine

代码“Extkeys”，它是教程中代码的副本，适合我使其正常运行。

如何让代码可以提取网站的关键字，而不是html？

祝你好运！

Answer 1

假设我理解你的问题，我认为只需做以下就是你正在寻找的解决方案。

这将从URL读取HTML（例如http://www.whatever.com/page.html）并使用它来生成密钥，而不是将HTML作为参数。

function Keys($url) { 
    $webhtml = file_get_contents($url);

Answer 2

您想首先从页面中提取内容，然后搜索关键字。这意味着您想要从页面中找到实际内容并将内容作为侧边栏，页脚等进行删除。只是谷歌的HTML内容提取，有很多关于此的文章。

我曾经在java中做过一次，有一个名为boilerpipe的库我不确定是否有一个PHP端口/界面快速谷歌搜索没有透露任何内容。但我确信有类似的PHP库。

最简单的方法是摆脱HTML而不是专门搜索页面内容，将使用正则表达式来删除所有html，例如s/<[^>]+>//g。然而，对于一个搜索引擎来说，这可能不是最好的方法，因为你最终会得到很多废话，可能会破坏你的密钥提取。

编辑：以下是关于content extraction with PHP的文章。

关键字错误，从网站中提取内容。 OOP

2 个答案: