如何将段落分成句子?

时间:2014-01-29 11:57:42

标签: java regex string split text-segmentation

请查看以下内容。

String[]sentenceHolder = titleAndBodyContainer.split("\n|\\.(?!\\d)|(?<!\\d)\\.");

这就是我试图将一个段落分成句子的方法。但有个问题。我的段落包括Jan. 13, 2014等日期,U.S等字词和2.2等字母。他们都被上面的代码分开了。所以基本上,这个代码分裂了许多“点”,无论它是否完全停止。

我也试过String[]sentenceHolder = titleAndBodyContainer.split(".\n");String[]sentenceHolder = titleAndBodyContainer.split("\\.");。都失败了。

如何“恰当地”将一个段落分成句子?

3 个答案:

答案 0 :(得分:14)

你可以试试这个

String str = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S and numbers like 2.2. They all got split by the above code.";

Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher reMatcher = re.matcher(str);
while (reMatcher.find()) {
    System.out.println(reMatcher.group());
}

输出:

This is how I tried to split a paragraph into a sentence.
But, there is a problem.
My paragraph includes dates like Jan.13, 2014 , words like U.S and numbers like 2.2.
They all got split by the above code.

答案 1 :(得分:1)

String[] sentenceHolder = titleAndBodyContainer.split("(?i)(?<=[.?!])\\S+(?=[a-z])");

试试这个对我有用。

答案 2 :(得分:0)

这会将段落分为. ? !

String a[]=str.split("\\.|\\?|\\!");

您可以在\\之后添加任何符号,并使用|分隔每个条件。