如何在文本中找到10个最常用的单词

时间:2016-11-21 06:26:30

标签: c# text

所以我在txt文件中有任意文本,我需要找到 10个最常用的单词。我该怎么办?我想我应该阅读从点到点的句子并把它放到一个数组中,但不知道该怎么做。

4 个答案:

答案 0 :(得分:9)

你可以用Linq实现它。尝试这样的事情:

var words = "two one three one three one";
var orderedWords = words
  .Split(' ')
  .GroupBy(x => x)
  .Select(x => new { 
    KeyField = x.Key, 
    Count = x.Count() })
  .OrderByDescending(x => x.Count)
  .Take(10);

答案 1 :(得分:2)

将所有数据转换为String,并将其拆分为数组

示例:

char[] delimiterChars = { ' ', ',', '.', ':', '\t' };
string text = "one\ttwo three:four,five six seven";

string[] words = text.Split(delimiterChars);

var dict = new Dictionary<String, int>();
foreach(var value in array)
{
    if (dict.ContainsKey(value))
        dict[value]++;
    else
        dict[value] = 1;
}

for(int i=0;i<dict.length();i++) //or i<10
{
   Console.WriteLine(dict[i]);
}

首先需要使用更大的值对数组进行排序。

答案 2 :(得分:1)

该任务最困难的部分是初始文本拆分为单词。 自然语言(例如英语)这个词非常复杂:

Forget-me-not     // 1 word (a nice blue flower) 
Do not Forget me! // 4 words
Cannot            // 1 word or shall we split "cannot" into "can" + "not"?
May not           // 2 words
George W. Bush    // Is "W" a word?
W.A.S.P.          // ...If it is, is it equal to "W" in the "W.A.S.P"?
Donald Trump      // Homonyms: name
Spades is a trump // ...and a special follow in a game of cards 
It's an IT; it is // "It" and "IT" are different (IT is an acronym), "It" and "it" are same

另一个问题是:您可能希望将Itit统一为同一个字,但将IT视为不同的缩写。作为第一次尝试,我建议这样的事情:

var top10words = File
  .ReadLines(@"C:\MyFile.txt")
  .SelectMany(line => Regex
    .Matches(value, @"[A-Za-z-']+")
    .OfType<Match>()
    .Select(match => CultureInfo.InvariantCulture.TextInfo.ToTitleCase(match.Value)))
  .GroupBy(word => word)
  .Select(chunk => new {
     word = chunk.Key,
     count = chunk.Count()})
  .OrderByDescending(item => item.count)
  .ThenBy(item => item.word)
  .Take(10);

在我的解决方案中,我假设:

  • 字词只能包含A..Z, a..z-(破折号)和'(叛逆)字母
  • TitleCase已被用于将所有大写首字母缩略词与常规词语分开(Itit将被视为同一个词,而IT则视为不同的词)
  • 如果 tie (两个或多个单词具有相同的频率),这个领带按字母顺序排列

答案 3 :(得分:0)

这是我根据Aldi Renaldi GunawanJanneP提供的答案写的一种组合方法。我想定界符char取决于您的用例。您可以为10参数提供numWords

public static Dictionary<string, int> WordCount(string text, int numWords = int.MaxValue)
{
    var delimiterChars = new char[] { ' ', ',', ':', '\t', '\"', '\r', '{', '}', '[', ']', '=', '/' };
    return text
        .Split(delimiterChars)
        .Where(x => x.Length > 0)
        .Select(x => x.ToLower())
        .GroupBy(x => x)
        .Select(x => new { Word = x.Key, Count = x.Count() })
        .OrderByDescending(x => x.Count)
        .Take(numWords)
        .ToDictionary(x => x.Word, x => x.Count);
}