我正在寻找一个Ruby gem(最好),它会将域名切换成他们的文字。
whatwomenwant.com => 3 words, "what", "women", "want".
如果它可以忽略数字和乱码之类的东西那么棒。
答案 0 :(得分:3)
您需要一个word list,例如由Project Gutenberg制作的或{isell& c的来源中提供的那些}。然后,您可以使用以下代码将域分解为单词:
WORD_LIST = [
'experts',
'expert',
'exchange',
'sex',
'change',
]
def words_that_phrase_begins_with(phrase)
WORD_LIST.find_all do |word|
phrase.start_with?(word)
end
end
def phrase_to_words(phrase, words = [], word_list = [])
if phrase.empty?
word_list << words
else
words_that_phrase_begins_with(phrase).each do |word|
remainder = phrase[word.size..-1]
phrase_to_words(remainder, words + [word], word_list)
end
end
word_list
end
p phrase_to_words('expertsexchange')
# => [["experts", "exchange"], ["expert", "sex", "change"]]
如果给出一个包含任何无法识别的单词的短语,则返回一个空数组:
p phrase_to_words('expertsfoo')
# => []
如果单词列表很长,这将很慢。您可以通过将单词列表预处理到树中来加快此算法的速度。预处理本身需要时间,因此它是否值得,取决于您要测试的域数。
以下是将单词列表转换为树的一些代码:
def add_word_to_tree(tree, word)
first_letter = word[0..0].to_sym
remainder = word[1..-1]
tree[first_letter] ||= {}
if remainder.empty?
tree[first_letter][:word] = true
else
add_word_to_tree(tree[first_letter], remainder)
end
end
def make_word_tree
root = {}
WORD_LIST.each do |word|
add_word_to_tree(root, word)
end
root
end
def word_tree
@word_tree ||= make_word_tree
end
这会产生一个如下所示的树:
{:C =&GT; {:H =&GT; {:A =&GT; {:N =&GT; {:G =&GT; {:E =&GT; {:字=&GT;真}}}} },:s =&gt; {:e =&gt; {:x =&gt; {:word =&gt; true}}},:e =&gt; {:x =&gt; {:c =&gt; {: h =&gt; {:a =&gt; {:n =&gt; {:g =&gt; {:e =&gt; {:word =&gt; true}}}}}} ,: p =&gt; {:e =&gt; {:r =&gt; {:t =&gt; {:word =&gt; true,:s =&gt; {:word =&gt; true}}}}}}}}
它看起来像Lisp,不是吗?树中的每个节点都是一个哈希。每个散列键都是一个字母,其值是另一个节点,或者是符号:值为true的单词。节点:单词是单词。
修改words_that_phrase_begins_with
以使用新树结构将使其更快:
def words_that_phrase_begins_with(phrase)
node = word_tree
words = []
phrase.each_char.with_index do |c, i|
node = node[c.to_sym]
break if node.nil?
words << phrase[0..i] if node[:word]
end
words
end
答案 1 :(得分:1)
我不知道这方面的宝石,但如果我必须解决这个问题,我会下载一些英文单词字典并阅读有关文本搜索算法的内容。
当你有多个变体来分隔字母时(比如在sepp2k的 expertsexchange 中),你可以有两个提示:
答案 2 :(得分:1)
<小时/> 我一直在努力应对这一挑战并提出以下代码。 如果我做错了,请重构: - )
运行时间:11秒
f-文件:13.000行域名
w-文件:2000字(要检查)
f = File.open('resource/domainlist.txt', 'r')
lines = f.readlines
w = File.open('resource/commonwords.txt', 'r')
words = w.readlines
results = {}
lines.each do |line|
# Start with words from 2 letters on, so ignoring 1 letter words like 'a'
word_size = 2
# Only get the .com domains
if line =~ /^.*,[a-z]+\.com.*$/i then
# Strip the .com off the domain
line.gsub!(/^.*,([a-z]+)\.com.*$/i, '\\1')
# If the domain name is between 3 and 12 characters
if line.size > 3 and line.size < 15 then
# For the length of the string run ...
line.size.times do |n|
# Set the counter
i = 0
# As long as we're within the length of the string
while i <= line.size - word_size do
# Get the word in proper DRY fashion
word = line[i,word_size]
# Check the word against our list
if words.include?(word)
results[line] = [] unless results[line]
# Add all the found words to the hash
results[line] << word
end
i += 1
end
word_size += 1
end
end
end
end
p results