Question

如果我有一个字符串＆＃34; blueberrymuffinsareinsanelydelicious＆＃34;，解析它的最有效方法是什么，以便我留下[＆＃34;蓝莓＆＃34;，＆＃34;松饼＆＃34; ，＆＃34;＆＃34;，＆＃34;疯狂＆＃34;，＆＃34;美味＆＃34;]？

我已经有了我的wordlist（mac＆＃39; s / usr / share / dict / words），但是如何确保完整的单词存储在我的数组中，也就是：blueberry，而不是两个单独的单词，blue和浆果。

Answer 1

这是一个递归方法，可以在我缓慢的笔记本电脑上找到0.4秒的正确句子。

它首先导入几乎100K的英文单词，然后通过缩小尺寸对其进行排序
对于每个word，它会检查text是否以
如果是，则从word中删除text，将word保留在数组中并递归调用自身。
如果text为空，则表示已找到一个句子。
它使用懒惰数组停在第一个找到的句子。

text = "blueberrymuffinsareinsanelydeliciousbecausethey'rereallymoistandcolorful"

dictionary = File.readlines('/usr/share/dict/american-english')
                 .map(&:chomp)
                 .sort_by{ |w| -w.size }

def find_words(text, possible_words, sentence = [])
  return sentence if text.empty?
  possible_words.lazy.select{ |word|
    text.start_with?(word)
  }.map{ |word|
    find_words(text[word.size..-1], possible_words, sentence + [word])
  }.find(&:itself)
end

p find_words(text, dictionary)
#=> ["blueberry", "muffins", "are", "insanely", "delicious", "because", "they're", "really", "moist", "and", "colorful"]
p find_words('someword', %w(no way to find a combination))
#=> nil
p find_words('culdesac', %w(culd no way to find a combination cul de sac))
#=> ["cul", "de", "sac"]
p find_words("carrotate", dictionary)
#=> ["carrot", "ate"]

为了加快查询速度，最好使用Trie。

Answer 2

虽然有些情况下可能会有多种解释，但选择最好的解释可能会有问题，但您总是可以使用相当天真的算法来解决这个问题：

WORDS = %w[
  blueberry
  blue
  berry
  fin
  fins
  muffin
  muffins
  are
  insane
  insanely
  in
  delicious
  deli
  us
].sort_by do |word|
  [ -word.length, word ]
end

WORD_REGEXP = Regexp.union(*WORDS)

def best_fit(string)
  string.scan(WORD_REGEXP)
end

这将解析你的例子：

best_fit("blueberrymuffinsareinsanelydelicious")
# => ["blueberry", "muffins", "are", "insanely", "delicious"]

请注意，这会跳过任何不匹配的组件。

将没有空格的字符串解析为单个单词的数组

2 个答案: