使用Regex在两个标签之间提取文本

时间:2014-09-25 21:35:45

标签: java xml regex perl parsing

我有一个这种格式的文本文件:

   <seg id="1"> They are the same thing. Let's shoot them both. </seg>
   <seg id="1"> We can't wait for you to move back either. </seg>
   <seg id="2"> You seem quite uptight. </seg>
   <seg id="3"> Does your wife (who is also your sister) not give it up any more? </seg>
   <seg id="1"> Can domestic violence abusers be rehabilitated? http://usat.ly/1rwvgWf </seg>
   <seg id="1"> Taulia enables Fortune 500 businesses to electronically handle </seg>
   <seg id="2"> all invoicing and payment to their suppliers </seg>

我想以下列格式获取标签的内容:

   They are the same thing. Let's shoot them both.
   We can't wait for you to move back either.You seem quite uptight.Does your wife (who is also your sister) not give it up any more?
   Can domestic violence abusers be rehabilitated? http://usat.ly/1rwvgWf
   Taulia enables Fortune 500 businesses to electronically handle all invoicing and payment to their suppliers

正如您可以看到seg id =&#34; 1&#34;的内容,seg id =&#34; 2&#34;,seg id =&#34; 3&#34;由于它们是一个帖子,所以在同一行打印。此外,seg id =&#34; 1&#34;和seg id =&#34; 2&#34;印在同一行。

我正在考虑使用java和Regex,但我想知道是否有其他方法可以获得我需要的东西。

5 个答案:

答案 0 :(得分:2)

对于每一行line

line = line.replaceAll("<.*?>(.*?)</.*?>", "$1");
  • 检测开放<.*?>
  • 并关闭</.*?>代码
  • 之间创建组(.*?)
  • 用组$1替换整个匹配表达式。

答案 1 :(得分:2)

如果您匹配的话,结果将在捕获组1中:

/<seg\b[^>]*>(.*?)<\/seg>/g

Demo

答案 2 :(得分:1)

尝试下一个:

String input = "   <seg id=\"1\"> They are the same thing. Let's shoot them both. </seg>\n   <seg id=\"1\"> We can't wait for you to move back either. </seg>\n   <seg id=\"2\"> You seem quite uptight. </seg>\n   <seg id=\"3\"> Does your wife (who is also your sister) not give it up any more? </seg>\n   <seg id=\"1\"> Can domestic violence abusers be rehabilitated? http://usat.ly/1rwvgWf </seg>";

String[] array = input.replaceAll("\\s*<seg[^>]+>", "").split("</seg>");

如果您逐行阅读文件,最好的选择是:

String line = line.replaceAll("</?seg[^>]*>");

如果要删除前导和尾随空格:

String line = line.replaceAll("\\s*</?seg[^>]*>\\s*");

答案 3 :(得分:0)

最好你可以试试这对你有很大的帮助。

use strict;
use warnings;
my $string = qq(<seg id="1"> They are the same thing. Let's shoot them both. </seg>
   <seg id="1"> We can't wait for you to move back either. </seg>
   <seg id="2"> You seem quite uptight. </seg>
   <seg id="3"> Does your wife (who is also your sister) not give it up any more? </seg>
   <seg id="1"> Can domestic violence abusers be rehabilitated? http://usat.ly/1rwvgWf </seg>
   <seg id="1"> Taulia enables Fortune 500 businesses to electronically handle </seg>
   <seg id="2"> all invoicing and payment to their suppliers </seg>);
$string =~ s{<seg(?: [^>]+)?>((?:(?!</?seg[ >]).)*)</seg>}{$1}ig;
print $string;exit;

答案 4 :(得分:0)

甚至不尝试。 XML不是常规语言[技术术语],因此正则表达式是工作的错误工具。请看这里的着名帖子:

RegEx match open tags except XHTML self-contained tags

使用XML解析器。

相关问题