从文本中删除所有HTML标记+内容

时间:2014-02-15 10:13:08

标签: php html regex parsing dom

好的,尽管看起来很简单,但我仍然无法做好。我已经尝试过使用RegEx,我甚至尝试过DOM解析,但仍然无法做到正确。

基于我之前的一个问题(Trying to remove HTML tags (+ content) from String)中的答案,这就是我最终的结果:

   public static function removeHtmlTags($str) { 
        $dom = new DOMDOcument();
        $errorState = libxml_use_internal_errors(true);
        $dom->loadHTML($str);

        $xpath = new DOMXPath($dom);
        $node = $xpath->query('//body/p/text()')->item(0);

        if (isset($node->textContent)) $ret = $node->textContent;
        else $ret="";

        libxml_use_internal_errors($errorState);

        return $ret;
    }

似乎大多数情况下 这个伎俩,不过这是抓住......

这(如果您无法识别它是什么,它是维基百科信息框):

|conventional_long_name = Italian Republic
|native_name = {{lang|it|''Repubblica italiana<!--italiana is without uppercase; see Italian wiki-->''}}
|common_name = Italy
|nickname(s) = Il Belpaese
|image_flag = Flag of Italy.svg
|image_coat = Italy-Emblem.svg
|symbol_type = Emblem
|image_map = EU-Italy.svg
|map_caption = {{map caption |location_color=dark green |region=Europe |region_color=dark grey |subregion=the [[European Union]] |subregion_color=green |legend=EU-Italy.svg}}
|national_anthem = {{native name|it|[[Il Canto degli Italiani]]}}<br/>{{small|''The Song of the Italians''}} [[File:Inno di Mameli instrumental.ogg|center]]
|official_languages = [[Italian language|Italian]]<sup>a</sup>
|Religion= [[Roman Catholic]]
|capital = {{Coat of arms|Rome}}
|latd=41 |latm=54 |latNS=N |longd=12 |longm=29 |longEW=E
|largest_city = capital
|largest_metropolitan area = {{hlist |[[Milan]] |[[Naples]]}}
|demonym = [[Italians|Italian]]
|government_type = [[Unitary state|Unitary]] [[parliamentary system|parliamentary]] [[constitutional republic]]
|leader_title1 = [[President of Italy|President]]
|leader_name1 = [[Giorgio Napolitano]]
|leader_title2 = [[Prime Minister of Italy|Prime Minister]]
|leader_name2 = [[Enrico Letta]]
|leader_title3 = [[List of Presidents of the Senate of Italy|President of the Senate]]
|leader_name3 = [[Pietro Grasso]]
|leader_title4 = [[List of Presidents of the Italian Chamber of Deputies|President of the Chamber of Deputies]]
|leader_name4 = [[Laura Boldrini]]
|legislature = [[Parliament of Italy|Parliament]]
|upper_house = [[Italian Senate|Senate of the Republic]]
|lower_house = [[Italian Chamber of Deputies|Chamber of Deputies]]
|accessionEUdate = 25 March 1957 (founding member)
|EUseats = 78
|area_rank = 72nd
|area_magnitude = 1 E11
|area_km2 = 301,338
|area_sq_mi = 116,347 <!--Do not remove per [[WP:MOSNUM]]-->
|percent_water = 2.4
|population_census = 59,433,744<ref name="Istat">{{cite web |url=http://www.istat.it/it/files/2012/12/volume_popolazione-legale_XV_censimento_popolazione.pdf|title=Census 2011 - final results |publisher=[[National Institute of Statistics (Italy)|ISTAT]] |accessdate=19 December 2012}}</ref>
|population_census_year = 2011
|population_census_rank = 23rd
|population_estimate = 59,685,227<ref>{{cite web |url=http://www.istat.it/en/archive/94537|title=Resident population and population change|publisher=[[National Institute of Statistics (Italy)|ISTAT]] |accessdate=25 June 2013}}</ref>
|population_estimate_year = 2012
|population_estimate_rank = 23rd
|population_density_rank = 63rd
|population_density_km2 = 197.7
|population_density_sq_mi = 511.6 <!--Do not remove per [[WP:MOSNUM]]-->
|GDP_PPP = $1.848 trillion<ref name=autogenerated1 >{{cite web |url=http://www.imf.org/external/pubs/ft/weo/2013/02/weodata/weorept.aspx?pr.x=25&pr.y=1&sy=2013&ey=2013&scsm=1&ssd=1&sort=country&ds=.&br=1&c=136&s=NGDPD%2CNGDPDPC%2CPPPGDP%2CPPPPC&grp=0&a= |title=Italy |publisher=International Monetary Fund |accessdate=17 October 2013}}</ref>
|GDP_PPP_rank = 11th
|GDP_PPP_year = 2014
|GDP_PPP_per_capita = $30,218<ref name=autogenerated1/>
|GDP_PPP_per_capita_rank = 34th
|GDP_nominal = $2.148 trillion<ref name=autogenerated1/>
|GDP_nominal_rank = 9th
|GDP_nominal_year = 2014
|GDP_nominal_per_capita = $35,123<ref name=autogenerated1/>
|GDP_nominal_per_capita_rank = 27th
|sovereignty_type = [[History of Italy|Formation]]
|established_event1 = [[Italian unification|Unification]]
|established_date1 = 17 March 1861
|established_event2 = [[Italian constitutional referendum, 1946|Republic]]
|established_date2 = 2 June 1946
|Gini_year = 2011
|Gini_change =  <!--increase/decrease/steady-->
|Gini = 31.9 <!--number only-->
|Gini_ref = <ref name=eurogini>{{cite web|title=Gini coefficient of equivalised disposable income (source: SILC)|url=http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=ilc_di12|publisher=Eurostat Data Explorer|accessdate=13 August 2013}}</ref>
|Gini_rank =
|HDI_year = 2013
|HDI_change = increase <!--increase/decrease/steady-->
|HDI = 0.881 <!--number only-->
|HDI_ref = <ref name="HDI">{{cite web |url=http://hdr.undp.org/en/media/HDR_2011_EN_Table1.pdf |title=Human Development Report 2011 |year=2011 |publisher=United Nations |accessdate=5 November 2011}}</ref>
|HDI_rank = 25th
|currency = Euro ([[Euro sign|€]])<sup>b</sup>
|currency_code = EUR
|country_code =
|time_zone = [[Central European Time|CET]]
|utc_offset = +1
|time_zone_DST = [[Central European Summer Time|CEST]]
|utc_offset_DST = +2
|drives_on = right
|calling_code = [[Telephone numbers in Italy|39]]<sup>c</sup>
|cctld = [[.it]]<sup>d</sup>
|footnote_a = <span style="font-size:100%;">French is co-official in the [[Aosta Valley]]; [[Slovene language|Slovene]] is co-official in the [[province of Trieste]] and the [[province of Gorizia]]; German and [[Ladin language|Ladin]] are co-official in [[South Tyrol]].</span>

|footnote_b = <span style="font-size:100%;">Before 2002, the [[Italian lira|Italian Lira]]. The euro is accepted in [[Campione d'Italia]], but the official currency there is the [[Swiss Franc]].<ref>{{cite web |url=http://www.comune.campione-d-italia.co.it/ |title=Comune di Campione d'Italia |publisher=Comune.campione-d-italia.co.it |date=14 July 2010 |accessdate=30 October 2010}}</ref></span>
|footnote_c = <span style="font-size:100%;">To call [[Campione d'Italia]], it is necessary to use the Swiss code [[+41]].</span>
|footnote_d = <span style="font-size:100%;">The [[.eu]] domain is also used, as it is shared with other [[European Union]] member states.</span>

变为(在explode换行之后):

Array
(
    [conventional_long_name] => Italian Republic
    [native_name] => {{lang|it|''Repubblica italiana
    [common_name] => Italy
    [nickname(s)] => Il Belpaese
    [image_flag] => Flag of Italy.svg
    [image_coat] => Italy-Emblem.svg
    [symbol_type] => Emblem
    [image_map] => EU-Italy.svg
    [map_caption] => {{map caption |location_color=dark green |region=Europe |region_color=dark grey |subregion=the [[European Union]] |subregion_color=green |legend=EU-Italy.svg}}
    [national_anthem] => {{native name|it|[[Il Canto degli Italiani]]}}
    [official_languages] => [[Italian language|Italian]]
    [Religion] => [[Roman Catholic]]
    [capital] => {{Coat of arms|Rome}}
    [latd] => 41 |latm=54 |latNS=N |longd=12 |longm=29 |longEW=E
    [largest_city] => capital
    [largest_metropolitan area] => {{hlist |[[Milan]] |[[Naples]]}}
    [demonym] => [[Italians|Italian]]
    [government_type] => [[Unitary state|Unitary]] [[parliamentary system|parliamentary]] [[constitutional republic]]
    [leader_title1] => [[President of Italy|President]]
    [leader_name1] => [[Giorgio Napolitano]]
    [leader_title2] => [[Prime Minister of Italy|Prime Minister]]
    [leader_name2] => [[Enrico Letta]]
    [leader_title3] => [[List of Presidents of the Senate of Italy|President of the Senate]]
    [leader_name3] => [[Pietro Grasso]]
    [leader_title4] => [[List of Presidents of the Italian Chamber of Deputies|President of the Chamber of Deputies]]
    [leader_name4] => [[Laura Boldrini]]
    [legislature] => [[Parliament of Italy|Parliament]]
    [upper_house] => [[Italian Senate|Senate of the Republic]]
    [lower_house] => [[Italian Chamber of Deputies|Chamber of Deputies]]
    [accessionEUdate] => 25 March 1957 (founding member)
    [EUseats] => 78
    [area_rank] => 72nd
    [area_magnitude] => 1 E11
    [area_km2] => 301,338
    [area_sq_mi] => 116,347 
    [percent_water] => 2.4
    [population_census] => 59,433,744
    [population_census_year] => 2011
    [population_census_rank] => 23rd
    [population_estimate] => 59,685,227
    [population_estimate_year] => 2012
    [population_estimate_rank] => 23rd
    [population_density_rank] => 63rd
    [population_density_km2] => 197.7
    [population_density_sq_mi] => 511.6 
    [GDP_PPP] => $1.848 trillion
    [GDP_PPP_rank] => 11th
    [GDP_PPP_year] => 2014
    [GDP_PPP_per_capita] => $30,218
    [GDP_PPP_per_capita_rank] => 34th
    [GDP_nominal] => $2.148 trillion
    [GDP_nominal_rank] => 9th
    [GDP_nominal_year] => 2014
    [GDP_nominal_per_capita] => $35,123
    [GDP_nominal_per_capita_rank] => 27th
    [sovereignty_type] => [[History of Italy|Formation]]
    [established_event1] => [[Italian unification|Unification]]
    [established_date1] => 17 March 1861
    [established_event2] => [[Italian constitutional referendum, 1946|Republic]]
    [established_date2] => 2 June 1946
    [Gini_year] => 2011
    [Gini_change] => 
    [Gini] => 31.9 
    [Gini_ref] => 
    [HDI_year] => 2013
    [HDI_change] => increase 
    [HDI] => 0.881 
    [HDI_ref] => 
    [HDI_rank] => 25th
    [currency] => Euro ([[Euro sign|â¬]])
    [currency_code] => EUR
    [time_zone] => [[Central European Time|CET]]
    [utc_offset] => +1
    [time_zone_DST] => [[Central European Summer Time|CEST]]
    [utc_offset_DST] => +2
    [drives_on] => right
    [calling_code] => [[Telephone numbers in Italy|39]]
    [cctld] => [[.it]]
    [footnote_a] => 
    [footnote_b] => 
    [footnote_c] => 
    [footnote_d] => 
)

我想知道:

|native_name = {{lang|it|''Repubblica italiana<!--italiana is without uppercase; see Italian wiki-->''}}

发生了什么事

不可能是:

|native_name = {{lang|it|''Repubblica italiana''}}

相反,它似乎正在消除HTML注释后面的文本。

有什么想法吗?

1 个答案:

答案 0 :(得分:0)

地狱之路:

$str = substr($str, 1);
$lines = explode("\n|", $str);

$result = array();

$pattern = '~
# subpattern definitions
(?(DEFINE)
    (?<c> <!--.*?--> )      # html comment
    (?<tag>                 # tag (possible nested tags with the same name)
        (   <(\w++)
            (?>[^<]++ | \g<c> | < (?!/?\g{-1}) | (?-2) )*
            </\g{-1}> ) 
    )
    (?<sctag> </w++[^>]*> ) # self closing tag 
)
# main pattern
\g<c> | \g<tag> | \g<sctag> | \s+$
~x';

foreach($lines as $line) {
    $kv = explode(' = ', $line, 2);

    $kv[1] = (isset($kv[1])) ? preg_replace($pattern, '', $kv[1]) : null;

    $result[$kv[0]] = $kv[1];
}
unset($kv, $pattern, $lines, $str);
echo '<pre>' . htmlspecialchars(print_r($result, true)) . '</pre>';

注1:由于字符串包含非常见标记(即不是html标记的标记),因此这些标记可能同时是自动关闭标记。换句话说,您可以在同一文档中找到<ref>....</ref><ref/>(或<ref>作为自闭标记)。要处理此特定情况,您可以将标记子模式定义的中间行更改为:(?>[^<]++ | \g<c> | < (?!/?\g{-1}) | (?-2) | <\g{-1}\b[^>]*?/?> )*

注意2:如果您不想使用正则表达式,那么方法是使用DOM,但由于html中不存在标记<ref>,您必须编写自己的描述此标记的DTD (以及所有其他html标记),将其添加到您的字符串中,并使用loadXML类的DOMDocument方法。