查找在字符串中多次使用过的短语

时间:2013-09-05 16:58:57

标签: c# algorithm text

通过使用词典来识别最常使用的单词,但是给定文本文件,可以很容易地计算文件中单词的出现次数,如何找到常用短语,其中“短语”是一组两个或更多连续的字词?

例如,以下是一些示例文本:

  

除了口头遗嘱,每一个遗嘱都应是书面形式,但可能是   手写或打字。遗嘱应包含遗嘱执行人的签名   或者在立遗嘱人的有意识的存在中的其他人   并在立遗嘱人的明确方向。遗嘱应予以证明   并通过两个或两个订阅了在立遗嘱人的有意识的存在   更有能力的证人,谁看到立遗嘱人认购,或听到了   立遗嘱人确认立遗嘱人的签名

     

就本节而言,有意识的存在意味着在   任何立遗嘱人的感官范围,不包括视觉或感觉   通过电话,电子或其他遥远的方式感知的声音   通信。

我怎样才能确定短语“有意识存在”(3次)和“立遗嘱人签名”(2次)出现不止一次(除了蛮力搜索每一组两三个字)?

我将用c#编写这个,所以c#代码会很棒,但是我甚至无法识别出一个好的算法,所以我会解决任何代码甚至伪代码以解决这个问题。

3 个答案:

答案 0 :(得分:5)

试一试。它绝不是万无一失的,但是现在应该完成工作。

是的,这只匹配双字组合,不会删除标点符号,而且是暴力破解。不,ToList不是必需的。

string text = "that big long text block";

var splitBySpace = text.Split(' ');

var doubleWords = splitBySpace
    .Select((x, i) => new { Value = x, Index = i })
    .Where(x => x.Index != splitBySpace.Length - 1)
    .Select(x => x.Value + " " + splitBySpace.ElementAt(x.Index + 1)).ToList();

var duplicates = doubleWords
    .GroupBy(x => x)
    .Where(x => x.Count() > 1)
    .Select(x => new { x.Key, Count = x.Count() }).ToList();

我得到了以下结果:

enter image description here


这是我试图获得超过2个单词组合。再次,与之前相同的警告。

List<string> multiWords = new List<string>();

//i is the number of words to combine
//in this case, 2-6 words
for (int i = 2; i <= 6; i++)
{
    multiWords.AddRange(splitBySpace
        .Select((x, index) => new { Value = x, Index = index })
        .Where(x => x.Index != splitBySpace.Length - i + 1)
        .Select(x => CombineItems(splitBySpace, x.Index, x.Index + i - 1)));
}

var duplicates = multiWords
    .GroupBy(x => x)
    .Where(x => x.Count() > 1)
    .Select(x => new { x.Key, Count = x.Count() }).ToList();

private string CombineItems(IEnumerable<string> source, int startIndex, int endIndex)
{
    return string.Join(" ", source.Where((x, i) => i >= startIndex && i <= endIndex).ToArray());
}

这次的结果:
enter image description here

现在我只想说我的代码很可能出现一个错误。我没有对它进行全面测试,因此请确保在使用前对其进行测试。

答案 1 :(得分:5)

以为我会快速解决这个问题 - 不确定这不是你试图避免的蛮力方法 - 但是:

static void Main(string[] args)
{
    string txt = @"Except oral wills, every will shall be in writing, 
but may be handwritten or typewritten. The will shall contain the testator's 
signature or by some other person in the testator's conscious presence and at the
testator's express direction . The will shall be attested and subscribed in the
conscious presence of the testator, by two or more competent witnesses, who saw the
testator subscribe, or heard the testator acknowledge the testator's signature.

For purposes of this section, conscious presence means within the range of any of the
testator's senses, excluding the sense of sight or sound that is sensed by telephonic,
electronic, or other distant communication.";

    //split string using common seperators - could add more or use regex.
    string[] words = txt.Split(',', '.', ';', ' ', '\n', '\r');

    //trim each tring and get rid of any empty ones
    words = words.Select(t=>t.Trim()).Where(t=>t.Trim()!=string.Empty).ToArray();

    const int MaxPhraseLength = 20;

    Dictionary<string, int> Counts = new Dictionary<string,int>();

    for (int phraseLen = MaxPhraseLength; phraseLen >= 2; phraseLen--)
    {
        for (int i = 0; i < words.Length - 1; i++)
        {
            //get the phrase to match based on phraselen
            string[] phrase = GetPhrase(words, i, phraseLen);
            string sphrase = string.Join(" ", phrase);

            Console.WriteLine("Phrase : {0}", sphrase);

            int index = FindPhraseIndex(words, i+phrase.Length, phrase);

            if (index > -1)
            {
                Console.WriteLine("Phrase : {0} found at {1}", sphrase, index);

                if(!Counts.ContainsKey(sphrase))
                    Counts.Add(sphrase, 1);

                Counts[sphrase]++;
            }
        }
    }

    foreach (var foo in Counts)
    {
        Console.WriteLine("[{0}] - {1}", foo.Key, foo.Value);
    }

    Console.ReadKey();
}

static string[] GetPhrase(string[] words, int startpos, int len)
{
    return words.Skip(startpos).Take(len).ToArray();
}

static int  FindPhraseIndex(string[] words, int startIndex, string[] matchWords)
{
    for (int i = startIndex; i < words.Length; i++)
    {
        int j;

        for(j=0; j<matchWords.Length && (i+j)<words.Length; j++)
            if(matchWords[j]!=words[i+j])
                break;

        if (j == matchWords.Length)
            return startIndex;
    }

    return -1;
}

答案 2 :(得分:0)

如果我这样做,我可能会从蛮力方法开始,但听起来你想要避免这种情况。两阶段方法可以对每个单词进行计数,取最前面的几个结果(仅从出现次数最多的前几个单词开始),然后仅搜索并计算包含这些流行单词的短语。然后,您不会花时间搜索所有短语。

我有这种感觉,CS人会纠正我说这实际上需要比直接蛮力更多的时间。也许一些语言学家会使用一些方法来检测短语或其他东西。

祝你好运!