Question

我正在尝试为给定的单词生成所有可能的音节组合。识别什么是音节的过程在这里是不相关的，但它产生的所有组合都给我一个问题。我认为这可能是我可以用几行来递归的（尽管其他任何方式都很好），但是我无法让它工作。有人可以帮忙吗？

    // how to test a syllable, just for the purpose of this example
    bool IsSyllable(string possibleSyllable) 
    {
        return Regex.IsMatch(possibleSyllable, "^(mis|und|un|der|er|stand)$");
    }

    List<string> BreakIntoSyllables(string word)
    {
       // the code here is what I'm trying to write 
       // if 'word' is "misunderstand" , I'd like this to return
       //  => {"mis","und","er","stand"},{ "mis","un","der","stand"}
       // and for any other combinations to be not included
    }

Answer 1

尝试从这开始：

var word = "misunderstand";

Func<string, bool> isSyllable =
    t => Regex.IsMatch(t, "^(mis|und|un|der|er|stand)$");

var query =
    from i in Enumerable.Range(0, word.Length)
    from l in Enumerable.Range(1, word.Length - i)
    let part = word.Substring(i, l)
    where isSyllable(part)
    select part;

返回：

misunderstand-results

这至少有助于开始吗？

编辑：我想到了这个问题，我想了几个问题：

Func<string, IEnumerable<string[]>> splitter = null;
splitter =
    t =>
        from n in Enumerable.Range(1, t.Length - 1)
        let s = t.Substring(0, n)
        let e = t.Substring(n)
        from g in (new [] { new [] { e } }).Concat(splitter(e))
        select (new [] { s }).Concat(g).ToArray();

var query =
    from split in (new [] { new [] { word } }).Concat(splitter(word))
    where split.All(part => isSyllable(part))
    select split;

现在query返回此信息：

misunderstanding-results2

如果现在已经确定了它，请告诉我。

Answer 2

通常使用Tries解决此类问题。我将在How to create a trie in c#上实现Trie的实现（但请注意我已经重写了它）。

var trie = new Trie(new[] { "un", "que", "stio", "na", "ble", "qu", "es", "ti", "onable", "o", "nable" });
//var trie = new Trie(new[] { "u", "n", "q", "u", "e", "s", "t", "i", "o", "n", "a", "b", "l", "e", "un", "qu", "es", "ti", "on", "ab", "le", "nq", "ue", "st", "io", "na", "bl", "unq", "ues", "tio", "nab", "nqu", "est", "ion", "abl", "que", "stio", "nab" });

var word = "unquestionable";

var parts = new List<List<string>>();

Split(word, 0, trie, trie.Root, new List<string>(), parts);

//

public static void Split(string word, int index, Trie trie, TrieNode node, List<string> currentParts, List<List<string>> parts)
{   
    // Found a syllable. We have to split: one way we take that syllable and continue from it (and it's done in this if).
    // Another way we ignore this possible syllable and we continue searching for a longer word (done after the if)
    if (node.IsTerminal)
    {
        // Add the syllable to the current list of syllables
        currentParts.Add(node.Word);

        // "covered" the word with syllables
        if (index == word.Length)
        {
            // Here we make a copy of the parts of the word. This because the currentParts list is a "working" list and is modified every time.
            parts.Add(new List<string>(currentParts));
        }
        else
        {
            // There are remaining letters in the word. We restart the scan for more syllables, restarting from the root.
            Split(word, index, trie, trie.Root, currentParts, parts);
        }

        // Remove the syllable from the current list of syllables
        currentParts.RemoveAt(currentParts.Count - 1);
    }

    // We have covered all the word with letters. No more work to do in this subiteration
    if (index == word.Length)
    {
        return;
    }

    // Here we try to find the edge corresponding to the current character

    TrieNode nextNode;

    if (!node.Edges.TryGetValue(word[index], out nextNode))
    {
        return;
    }

    Split(word, index + 1, trie, nextNode, currentParts, parts);
}

public class Trie
{
    public readonly TrieNode Root = new TrieNode();

    public Trie()
    {
    }

    public Trie(IEnumerable<string> words)
    {
        this.AddRange(words);
    }

    public void Add(string word)
    {
        var currentNode = this.Root;

        foreach (char ch in word)
        {
            TrieNode nextNode;

            if (!currentNode.Edges.TryGetValue(ch, out nextNode))
            {
                nextNode = new TrieNode();
                currentNode.Edges[ch] = nextNode;
            }

            currentNode = nextNode;
        }

        currentNode.Word = word;
    }

    public void AddRange(IEnumerable<string> words)
    {
        foreach (var word in words)
        {
            this.Add(word);
        }
    }
}

public class TrieNode
{
    public readonly Dictionary<char, TrieNode> Edges = new Dictionary<char, TrieNode>();
    public string Word { get; set; }

    public bool IsTerminal
    {
        get
        {
            return this.Word != null;
        }
    }
}

word是您感兴趣的字符串，parts将包含可能音节列表的列表（将其设为List<string[]>可能更为正确，但它相当很容易做到。而不是parts.Add(new List<string>(currentParts));写parts.Add(currentParts.ToArray());并将所有List<List<string>>更改为List<string[]>。

我将添加一个Enigmativity响应变体，因为它会立即丢弃错误的音节，而不是稍后对其进行后置过滤。如果你喜欢它，你应该给他+1，因为没有他的想法，这种变体是不可能的。但请注意，它仍然是一个黑客。 “正确”的解决方案是使用Trie： - ）

Func<string, bool> isSyllable = t => Regex.IsMatch(t, "^(un|que|stio|na|ble|qu|es|ti|onable|o|nable)$");

Func<string, IEnumerable<string[]>> splitter = null;
splitter =
    t =>
        (
        from n in Enumerable.Range(1, t.Length - 1)
        let s = t.Substring(0, n)
        where isSyllable(s)
        let e = t.Substring(n)
        let f = splitter(e)
        from g in f
        select (new[] { s }).Concat(g).ToArray()
        )
        .Concat(isSyllable(t) ? new[] { new string[] { t } } : new string[0][]);

var parts = splitter(word).ToList();

解释：

        from n in Enumerable.Range(1, t.Length - 1)
        let s = t.Substring(0, n)
        where isSyllable(s)

我们计算一个单词的所有可能的音节，从长度1到单词的长度 - 1并检查它是否是一个音节。我们直接淘汰了非音节。作为音节的完整单词将在稍后检查。

        let e = t.Substring(n)
        let f = splitter(e)

我们搜索字符串剩余部分的音节

        from g in f
        select (new[] { s }).Concat(g).ToArray()

我们用“当前”音节将找到的音节链接起来。请注意，我们正在创建许多无用的数组。如果我们接受IEnumerable<IEnumerable<string>>作为结果，我们可以取消ToArray。

（我们可以一起重写多行，删除许多let，例如

        from g in splitter(t.Substring(n))
        select (new[] { s }).Concat(g).ToArray()

但我们不会为了清晰起见而这样做。

我们将“当前”音节与找到的音节连接起来。

        .Concat(isSyllable(t) ? new[] { new string[] { t } } : new string[0][]);

这里我们可以稍微重建一下查询，以便不使用这个Concat并创建空数组，但这有点复杂（我们可以将整个lambda函数重写为isSyllable(t) ? new[] { new string[] { t } }.Concat(oldLambdaFunction) : oldLambdaFunction）< / p>

最后，如果整个单词是一个音节，我们将整个单词添加为一个音节。否则我们Concat是一个空数组（所以没有Concat）

Answer 3

你可能在扩展这个问题时说实话，我不确定你的数据集有多大，但基于一个简单的解决方案'这是一个音节吗？'你需要为每个单词调用你的'音节检测'程序大约为0（n * n），其中n =单词中的字符数（如果这没有意义，那就意味着大数据集可能会很慢！）。这没有考虑到检测算法的可扩展性，当您添加更多音节时，这种可扩展性也可能会变慢。。

我知道你说你识别什么是音节的过程是不相关的，但是我们可以说你可以改变它以使它更像自动完成，即通过一个开头的音节，让它告诉你从这一点可能的所有音节将更具可扩展性。如果性能失控，请查看用trie替换它。

生成字符串中子串的组合

3 个答案: