Question

我有一个包含文本文件的String变量（例如.html），使用fopen（）然后我去strip_tags（）这样我就可以使用那个未经处理的文本进行文章预览，但在此之前，我需要获得h1 nodeValue，以及它的计数字符，所以我可以替换下面代码中的零使用该值，并以150+该值结束。

$f = fopen($filepath,"r");
$WholeFile = fread($f, filesize($filepath));
fclose($f);
$StrippedFile=strip_tags($WholeFile);
$TextExtract = mb_substr("$StrippedFile", 0,150);

我去的最佳方式是什么？解析器的答案是什么？由于这只是情况[到目前为止]我将从html标签中提取值

Answer 1

当你有结构化文本（如HTML，XML，json，YAML等）时，你应该总是使用正确的解析器，除非你有充分的理由不这样做。

在这种情况下，您可能能够使用正则表达式，但是您将拥有一个非常脆弱的解决方案，并且可能存在与字符编码，实体或空白相关的问题。以上所有解决方案都会巧妙地破解。例如，如果你有这样的输入：

<html><head><meta http-equiv="content-type" content="text/html;charset=utf-8" />
<title>Page title</title></head>
<body><div><h1 title="attributes or the space in the closing tag may confuse code"
>Title &mdash;    maybe emdash counted as 7 characters</h1 >
<p> and      whitespace counted excessively too. And here's
a utf-8 character that may get split in the middle: ©; creating  
an invalid string.</p></div></body></html>

以下是使用DOMDocument和DOMXPath的解决方案，该解决方案应该适用于除最差HTML之外的所有HTML，并且始终会为您提供150个字符（不是字节，字符） utf-8回复所有实体归一化到其字符值。

$html = '<html><head><meta http-equiv="content-type" content="text/html;charset=utf-8" />
<title>Page title</title></head>
<body><div><h1 title="attributes or the space in the closing tag may confuse code"
>Title &mdash;    maybe emdash counted as 7 characters</h1 >
<p> and      whitespace counted excessively too. And here\'s
a utf-8 character that may get split in the middle: ©; creating  
an invalid string.</p></div></body></html>';


$doc = new DOMDocument();
$doc->loadHTML($html);
// if you have a url or filename, you can use this instead:
// $doc->loadHTMLFile($url);
$xp = new DOMXPath($doc);

// you can easily modify the xquery to match the "title" of different documents
$titlenode = $xp->query('/html/body//h1[1]');

$xpath = 'normalize-space(substring(
        concat(
            normalize-space(.),
            " ",
            normalize-space(./following-sibling::*)
        ), 0, 150))';


$excerpt = null;
if ($titlenode->length) {
    $excerpt = $xp->evaluate($xpath, $titlenode->item(0));
}

var_export($excerpt);

此代码将输出：

'Title — maybe emdash counted as 7 characters and whitespace counted excessively too. And here\'s a utf-8 character that may get split in the middle: ©'

这里的基本思想是将您的h1（或任何标题元素）与XPath匹配，然后获取该元素和所有后续元素的字符串值，并使用XPath截断150个字符。保留XPath中的所有内容可以避免使用PHP处理所有混乱的字符集和实体问题。

Answer 2

如果您确定要处理的文件的内容，并且知道标题位于H1中，则可能会将您获得的字符串切片。 </h1>位置（例如，使用strstr()，但有很多方法可以做到这一点），分为两个字符串。

然后，您可以在第一个上删除标签以获取第二个上的标签和剥离标签以获取内容。这假设您的文件只包含一个包含标题的h1，在包含文章内容的dom元素之前。

请记住，这不是在线解析各种文章的最佳方法，对于更通用的解决方案，我会研究一个专用的解析器类。

以下是代码示例：

代码示例

$f = fopen($filepath,"r");
$WholeFile = fread($f, filesize($filepath));
fclose($f);
// Modified part
$content = strip_tags(strstr($WholeFile, '</h1>'));
$title = strip_tags(strstr($WholeFile, '</h1>', true)); // Valid with PHP 5.3.0 only I think
$TextExtract = mb_substr($content, 0,150);

PHP nodeValue和字符数

2 个答案: