将段落分成句子 - 一个特例

时间:2015-11-08 09:45:54

标签: java

我是Java编程的新手。我想将一个文件中的段落分成句子并将它们写在不同的文件中。还应该有机制来确定哪个句子来自哪个段落。到目前为止我使用的代码如下所述。但是这段代码打破了:

Former Secretary of Finance Dr. P.B. Jayasundera is being questioned by the police Financial Crime Investigation Division.

Former Secretary of Finance Dr.
P.B.
Jayasundera is being questioned by the police Financial Crime Investigation Division.

我该如何纠正?提前谢谢。

import java.io.*;  
class trial4{  
    public static void main(String args[]) throws IOException   
 {  
 FileReader fr = new FileReader("input.txt");  
 BufferedReader br = new BufferedReader(fr);  
 String s;  
 OutputStream out = new FileOutputStream("output10.txt");  
                      String token[];  

 while((s = br.readLine()) != null)  
    {  
      token = s.split("(?<=[.!?])\\s* ");
      for(int i=0;i<token.length;i++)  
       {  
         byte buf[]=token[i].getBytes(); 
     for(int j=0;j<buf.length;j=j+1)  
         {  
                                out.write(buf[j]);  
                 if(j==buf.length-1)  
                        out.write('\n');  
            }  
         }  
      }  
       fr.close();  
  }  
}  


我引用了StackOverFlow上发布的所有类似问题。但这些答案无法帮我解决这个问题。

2 个答案:

答案 0 :(得分:0)

如何将负外观与替换结合使用。简单地说:替换所有没有特殊情况的行结尾&#34;在他们之前的行结束后跟换行符。

&#34;已知缩写&#34;的列表将需要。不能保证这些可以是多长,或者在一行结尾可能有多短。 (见?&#39;如果已经很短暂了!)

class trial4{  
    public static void main(String args[]) throws IOException {  
     FileReader fr = new FileReader("input.txt");  
     BufferedReader br = new BufferedReader(fr);  
     PrintStream out = new PrintStream(new FileOutputStream("output10.txt")); 

     String s = br.readLine();
     while(s != null) {  
        out.print(        //Prints newline after each line in any case
           s.replaceAll("(?i)"             //Make the match case insensitive
                 + "(?<!"                  //Negative lookbehind
                 +   "(\\W\\w)|"           //Single non-word followed by word character (P.B.)
                 +   "(\\W\\d{1,2})|"      //one or two digits (dates!)
                 +   "(\\W(dr|mr|mrs|ms))" //List of known abbreviations
                 + ")"                     //End of lookbehind                     
                 +"([!?\\.])"              //Match end-ofsentence
                    , "$5"                 //Replace with end-of-sentence found
                          +System.lineSeparator())); //Add newline if found
       s = br.readLine();
     }
   }
}  

答案 1 :(得分:0)

正如评论中所提到的那样,在没有正式确定要求的情况下将文本分成段落是很合理的。看看BreakIterator - 特别是SentenceInstance。您可以推出自己的BreakIterator,因为它与regexp相同,除非它更抽象。或者尝试找到第三方解决方案,例如http://deeplearning4j.org/sentenceiterator.html,可以训练以标记您的输入

BreakIterator示例:

String str = "Former Secretary of Finance Dr. P.B. Jayasundera is being questioned by the police Financial Crime Investigation Division.";

BreakIterator bilus = BreakIterator.getSentenceInstance(Locale.US); 
bilus.setText(str);

int last  = bilus.first();
int count = 0;

while (BreakIterator.DONE != last)
{
    int first = last;       
    last = bilus.next();

    if (BreakIterator.DONE != last)
    {
        String sentence = str.substring(first, last);
        System.out.println("Sentence:" + sentence);
        count++;
    }
}
System.out.println("" + count + " sentences found.");