使用PHP Simple HTML DOM Parser查找带有类的div,它是纯文本

时间:2018-05-05 15:15:04

标签: php html parsing web-scraping simple-html-dom

我想在工作经验教育和培训之间找到 ft00 课程,并提取包含来自给定html的日期的课程文本

public Collection<V> values() {
    Collection<V> vs = values;
    if (vs == null) {
        vs = new Values();
        values = vs;
    }
    return vs;
}

到目前为止,我可以提取的是工作经验教育和培训之间的所有数据,并且它正常运行,代码如下: -

values()

我很接近,但似乎无法找到解决方案,任何帮助将不胜感激谢谢: - )

2 个答案:

答案 0 :(得分:2)

使用DOMDocumentDOMXPath,您可以像下面这样做,我从未使用过简单的HTML DOM解析器,但我认为它有XPath。

<?php
$dom = new DOMDocument();

$dom->loadHtml('
<p class = "ft00">Introduction</p>
<p class = "ft00">John Smith</p>
<p class = "ft02">Email:</p>
<p class = "ft00">John@gmail.com</p>
<p class = "ft00">Work Experience</p>
<p class = "ft00">27 July 2017</p>
<p class = "ft02">ABC Company</p>
<p class = "ft00">19 May 2018</p>
<p class ="ft02">XYZ Company</p>
<p class = "ft00">EDUCATION AND TRAINING</p>
', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);

$result = [];
$matching  = false;
foreach ($xpath->query("//p[contains(@class, 'ft00') or contains(@class, 'ft02')]/text()") as $p) {
    if ($p->nodeValue === 'Work Experience' || $matching) {
        $result[] = $p->nodeValue;
        $matching = true;
    }
    if ($p->nodeValue === 'EDUCATION AND TRAINING') {
        break;
    }
}

print_r($result);

<强>结果:

Array
(
    [0] => Work Experience
    [1] => 27 July 2017
    [2] => ABC Company
    [3] => 19 May 2018
    [4] => XYZ Company
    [5] => EDUCATION AND TRAINING
)

https://3v4l.org/0nvr4

答案 1 :(得分:1)

这是正确的工作代码: -

$test = array();
$matching  = false;
$collection = $html->find('p.ft00');
foreach ($collection as $tkey) {
    if ($tkey->plaintext == "WORK EXPERIENCE" || $matching ) {
        $test[] = $tkey->plaintext;
        $matching = true;
    }
    if ( $tkey->plaintext == "EDUCATION AND TRAINING") {
        break;
    }

    }
    var_dump($test);    

输出: -

Array
(
    [0] => Work Experience
    [1] => 27 July 2017
    [2] => 19 May 2018
    [3] => EDUCATION AND TRAINING
)
相关问题