通过正则表达式从HTMLString中删除重复的属性

时间:2017-07-06 14:02:37

标签: regex

我有一个HtmlString,其中some标签有多个“href”属性。我必须删除其中一个。如果href属性大于1则必须通过regex删除空白href属性。

<p>
  Contrary to popular belief, Lorem Ipsum is not simply random text.It has
  <a title="Test PDF for RTF" href="" title="Test PDF for RTF" href="Test%20PDF%20for%20rtf.pdf">
     Test PDF
  </a>
  roots in a piece of classical Latin literature from 45 BC, making
  <a title="Learn More" href="test.html" title="Learn More" >
    Learn More
  </a>
  it over 2000 years old. Richard McClintock,
  <a title="Test Page" href="" >
    Test Page
  </a>
  Latin professor at Hampden-Sydney College in Virginia,
  <a title="Test PDF for RTF" href="" title="Test PDF for RTF" href="Test%20PDF%20for%20rtf.pdf">
    Test PDF
  </a>
  looked up one of the more obscure Latin words, consectetur
</p>

3 个答案:

答案 0 :(得分:1)

我认为你想要:当它有两个herf并且与你的评论一样时,首先在一行或文本中匹配href我必须保留一个href,不要&#t; t如果它是空的。您想要删除重复 href,如果是,您可以申请:

(?=href.+?href)[^"]+""

这一部分:(?=href.+?href)是一个先行断言,如果它找到两次,就会在第一个href之前匹配一个零长度,而这个部分:[^"]+""匹配那个空href="" }}

(?=href.+?href)[^"]+""

您在文件中输入的最佳测试:

perl -lne 'print $& while/(?=href.+?href)[^"]+""/g' file  

输出:

href=""
href=""

并删除:

perl -lpe 's/(?=href.+?href)[^"]+""/==>Removed<==/g' file

它输出:

  <p>
  Contrary to popular belief, Lorem Ipsum is not simply random text.It has
  <a title="Test PDF for RTF" ==>Removed<== title="Test PDF for RTF" href="Test%20PDF%20for%20rtf.pdf">
     Test PDF
  </a>
  roots in a piece of classical Latin literature from 45 BC, making
  <a title="Learn More" href="test.html" title="Learn More" >
    Learn More
  </a>
  it over 2000 years old. Richard McClintock,
  <a title="Test Page" href="" >
    Test Page
  </a>
  Latin professor at Hampden-Sydney College in Virginia,
  <a title="Test PDF for RTF" ==>Removed<== title="Test PDF for RTF" href="Test%20PDF%20for%20rtf.pdf">
    Test PDF
  </a>
  looked up one of the more obscure Latin words, consectetur
</p>

此外,您可以将此模式应用于,并将替换设置为""

答案 1 :(得分:0)

A possible solution(删除重复项,href =空白):

 (\w+=".*?")(?=[^>]+\1)|href="" //replace with nothing

假设>尚未发生,意味着我们处于相同的标签中,这可能是天真的,但可能足够安全。

答案 2 :(得分:0)

如果只有一个空的href,没有href就离开了: /\s?href=\"\/?\"/将匹配所有空白hrefs

哟没有指定您使用正则表达式的语言,因此可能需要稍微调整一下。