在词干之后计算词频

时间:2012-07-20 22:49:25

标签: c# c#-4.0

假设我有以下字符串:

"present present present presenting presentation do  do doing " 

我按照频率按降序排列字符串中的单词:

I'm using GroupBy count 
present    3
do         2
doing      1
presenting 1
presentation 1

然后,我正在说出这些话:

using array [ , ] or any other structure

present  3
do       2
do       1
present  1
present  1

最后,我想将这些单词重新计入字典。所以输出应该是:

present 5
do      3

任何人都可以帮忙吗?提前致谢。

2 个答案:

答案 0 :(得分:1)

//使用List而不是Dictionary来允许键多重性:             列表> words = new List< KeyValuePair>();

        string text = "present present present presenting presentation do  do doing";
        var ws = text.Split(' ');

        //Passing the words into the list:
        words = (from w in ws
                 group w by w into wsGroups
                 select new KeyValuePair<string, int>(
                     wsGroups.Key, ws.Count()
                     )
                 ).ToList<KeyValuePair<string, int>>();

        //Ordering:
        words.OrderBy(w => w.Value);

        //Stemming the words:
        words = (from w in words
                 select new KeyValuePair<string, int>
                     (
                         stemword(w.Key),
                         w.Value
                     )).ToList<KeyValuePair<string, int>>();

        //Sorting and put into Dictionary:
        var wordsRef = (from w in words
                        group w by w.Key into groups
                        select new
                        {
                            count = groups.Count(),
                            word = groups.Key
                        }).ToDictionary(w => w.word, w => w.count);

答案 1 :(得分:0)

LINQ GroupBy或Aggregate是计算此类计数的好方法。

如果你想手工做...看起来你想要有两组结果:一个非词干,另一个词干:

void incrementCount(Dictionary<string, int> counts, string word)
{
  if (counts.Contains(word))
  {
    counts[word]++;
  }
  else
  {
    counts.Add(word, 0);
  }
}

var stemmedCount = new Dictionary<string, int>();
var nonStemmedCount = new Dictionary<string, int>();

foreach(word in words)
{
  incrementCount(stemmedCount, Stem(word));
  incrementCount(nonStemmedCount, word);
}