将文本拆分为句子

时间:2017-11-22 11:15:09

标签: c#

在尝试将文本解析为句子时遇到了问题。 一切正常,文本的格式是这样的:(随机文本)

  

很多确实曾打电话给新的画作。限制期望她了解法律。     现在你有没有看到女人吵闹的比赛钱房。

程序将文本解析为3个句子。

但是只要句子中间有换行符,我的程序就会错误地分割文字。

  

很多确实曾打电话给新的画作。限制(她的新行)期待她的精神。   现在你有没有看到女人吵闹的比赛钱房。

程序将文本解析为4个句子。

我的代码:

public static void ReadData()
    {
        char[] sentenceSeparators = {'.', '!', '?'};

        using (StreamReader reader = new StreamReader(dataFile))
        {
            string line = null;

            while (null != (line = reader.ReadLine()))
            {
                var split = line.Split(sentenceSeparators, StringSplitOptions.RemoveEmptyEntries);

                foreach (var i in split)
                {
                    Console.WriteLine(i);
                }
            }
        }
    }

输入#1:

Much did had call new drew that kept. Limits expect wonder law she.
Now has you views woman noisy match money rooms.

输出#1:

Much did had call new drew that kept
Limits expect wonder law she
Now has you views woman noisy match money rooms

输入#2:

 Much did had call new drew that kept. Limits expect 
 wonder law she.
 Now has you views woman noisy match money rooms.

输出#2:

Much did had call new drew that kept
Limits expect
wonder law she
Now has you views woman noisy match money rooms

2 个答案:

答案 0 :(得分:1)

因为您正在使用ReadLine。请改用ReadToEnd

public static void ReadData()
{
    char[] sentenceSeparators = {'.', '!', '?'};

    using (StreamReader reader = new StreamReader(dataFile))
    {
        string line = reader.ReadToEnd();

        var split = line.Split(sentenceSeparators, StringSplitOptions.RemoveEmptyEntries);

        foreach (var i in split)
        {
            Console.WriteLine(i);
        }
    }
}

答案 1 :(得分:1)

如前所述,如果您希望\n不影响您的分割,请不要逐行阅读。这是一个在1行中完成工作的版本:

string [] split = File.ReadAllText(dataFile).Split(sentenceSeparators, StringSplitOptions.RemoveEmptyEntries);

另外:控制台中的显示是虚幻的。因为它会显示"坏"在2行上的句子,但在split数组中它将在一个位置上!

Console.WriteLine(split.Length); // will display 3