更新

Question

我正在寻找一个Ruby gem（最好），它会将域名切换成他们的文字。

whatwomenwant.com => 3 words, "what", "women", "want".

如果它可以忽略数字和乱码之类的东西那么棒。

Answer 1

您需要一个word list，例如由Project Gutenberg制作的或{isell＆amp; c的来源中提供的那些}。然后，您可以使用以下代码将域分解为单词：

WORD_LIST = [
  'experts',
  'expert',
  'exchange',
  'sex',
  'change',
]

def words_that_phrase_begins_with(phrase)
  WORD_LIST.find_all do |word|
    phrase.start_with?(word)
  end
end

def phrase_to_words(phrase, words = [], word_list = [])
  if phrase.empty?
    word_list << words
  else
    words_that_phrase_begins_with(phrase).each do |word|
      remainder = phrase[word.size..-1]
      phrase_to_words(remainder, words + [word], word_list)
    end
  end
  word_list
end

p phrase_to_words('expertsexchange')
# => [["experts", "exchange"], ["expert", "sex", "change"]]

如果给出一个包含任何无法识别的单词的短语，则返回一个空数组：

p phrase_to_words('expertsfoo')
# => []

如果单词列表很长，这将很慢。您可以通过将单词列表预处理到树中来加快此算法的速度。预处理本身需要时间，因此它是否值得，取决于您要测试的域数。

以下是将单词列表转换为树的一些代码：

def add_word_to_tree(tree, word)
  first_letter = word[0..0].to_sym
  remainder = word[1..-1]
  tree[first_letter] ||= {}
  if remainder.empty?
    tree[first_letter][:word] = true
  else
    add_word_to_tree(tree[first_letter], remainder)
  end
end

def make_word_tree
  root = {}
  WORD_LIST.each do |word|
    add_word_to_tree(root, word)
  end
  root
end

def word_tree
  @word_tree ||= make_word_tree
end

这会产生一个如下所示的树：

{：C =＆GT; {：H =＆GT; {：A =＆GT; {：N =＆GT; {：G =＆GT; {：E =＆GT; {：字=＆GT;真}}}} }，：s =＆gt; {：e =＆gt; {：x =＆gt; {：word =＆gt; true}}}，：e =＆gt; {：x =＆gt; {：c =＆gt; {： h =＆gt; {：a =＆gt; {：n =＆gt; {：g =＆gt; {：e =＆gt; {：word =＆gt; true}}}}}} ,: p =＆gt; {：e =＆gt; {：r =＆gt; {：t =＆gt; {：word =＆gt; true，：s =＆gt; {：word =＆gt; true}}}}}}}}

它看起来像Lisp，不是吗？树中的每个节点都是一个哈希。每个散列键都是一个字母，其值是另一个节点，或者是符号：值为true的单词。节点：单词是单词。

修改words_that_phrase_begins_with以使用新树结构将使其更快：

def words_that_phrase_begins_with(phrase)
  node = word_tree
  words = []
  phrase.each_char.with_index do |c, i|
    node = node[c.to_sym]
    break if node.nil?
    words << phrase[0..i] if node[:word]
  end
  words
end

Answer 2

我不知道这方面的宝石，但如果我必须解决这个问题，我会下载一些英文单词字典并阅读有关文本搜索算法的内容。

当你有多个变体来分隔字母时（比如在sepp2k的 expertsexchange 中），你可以有两个提示：

您的词典按...排序，例如，词的受欢迎程度。因此，最流行的单词的多样性将更有价值。
您可以访问带有域名的网站主页，只需阅读内容，搜索您的文字。我不认为你会在某个专家的页面上找到 sex 。但是......嗯...专家可能会如此不同，。）

Answer 3

更新

<小时/> 我一直在努力应对这一挑战并提出以下代码。如果我做错了，请重构： - ）

基准：

运行时间：11秒 f-文件：13.000行域名
w-文件：2000字（要检查）

代码：

f           = File.open('resource/domainlist.txt', 'r')
lines       = f.readlines
w           = File.open('resource/commonwords.txt', 'r')
words       = w.readlines

results  = {}

lines.each do |line|
  # Start with words from 2 letters on, so ignoring 1 letter words like 'a'
  word_size = 2
  # Only get the .com domains
  if line =~ /^.*,[a-z]+\.com.*$/i then
    # Strip the .com off the domain
    line.gsub!(/^.*,([a-z]+)\.com.*$/i, '\\1')
    # If the domain name is between 3 and 12 characters
    if line.size > 3 and line.size < 15 then
      # For the length of the string run ...
      line.size.times do |n|
        # Set the counter
        i = 0
        # As long as we're within the length of the string
        while i <= line.size - word_size do
          # Get the word in proper DRY fashion
          word = line[i,word_size]
          # Check the word against our list
          if words.include?(word) 
            results[line] = [] unless results[line]
            # Add all the found words to the hash
            results[line] << word
          end
          i += 1
        end
        word_size += 1
      end
    end
  end
end
p results

提取域名中的各个现有单词

3 个答案:

更新

基准：

代码：