Question

我有两个数组

sentences_ary = ['This is foo', 'bob is cool'] 

words_ary = ['foo', 'lol', 'something']

我想检查sentences_ary中的任何元素是否与words_ary中的任何字匹配。

我能够检查一项工作，但无法使用word_ary来完成。

#This is working
['This is foo', 'bob is cool'].any? { |s| s.match(/foo/)}

但是我无法使用ary of ary regex。我总是这样做：

# This is not working    
['This is foo', 'bob is cool'].any? { |s| ['foo', 'lol', 'something'].any? { |w| w.match(/s/) } }

我在if条件下使用此功能。

Answer 1

您可以使用Regexp.union和Enumerable#grep：

sentences_ary.grep(Regexp.union(words_ary))
#=> ["This is foo"]

Answer 2

RegexpTrie改善了这一点：

require 'regexp_trie'

sentences_ary = ['This is foo', 'This is foolish', 'bob is cool', 'foo bar', 'bar foo']
words_ary = ['foo', 'lol', 'something']

words_regex = /\b(?:#{RegexpTrie.union(words_ary, option: Regexp::IGNORECASE).source})\b/i
# => /\b(?:(?:foo|lol|something))\b/i

sentences_ary.any?{ |s| s[words_regex] } # => true
sentences_ary.find{ |s| s[words_regex] } # => "This is foo"
sentences_ary.select{ |s| s[words_regex] } # => ["This is foo", "foo bar", "bar foo"]

你必须小心如何构建正则表达式模式，否则你可能会得到假阳性命中。这可能是一个难以追查的错误。

sentences_ary = ['This is foo', 'This is foolish', 'bob is cool', 'foo bar', 'bar foo']
words_ary = ['foo', 'lol', 'something']
words_regex = /\b(?:#{ Regexp.union(words_ary).source })\b/ # => /\b(?:foo|lol|something)\b/
sentences_ary.any?{ |s| s[words_regex] } # => true
sentences_ary.find{ |s| s[words_regex] } # => "This is foo"
sentences_ary.select{ |s| s[words_regex] } # => ["This is foo", "foo bar", "bar foo"]

生成的/\b(?:foo|lol|something)\b/模式非常智能，可以查找单词边界，这样可以找到单词，而不仅仅是子字符串。

另外，请注意source的使用。这非常重要，因为它的缺失会导致很难找到错误。比较这两个正则表达式：

/#{ Regexp.union(words_ary).source }/ # => /foo|lol|something/
/#{ Regexp.union(words_ary) }/        # => /(?-mix:foo|lol|something)/

注意第二个如何嵌入标志(?-mix:...)。它们改变了封闭图案的标志，里面周围的图案。内部模式的行为可能与周围的模式不同，导致黑洞吮吸您不期望的结果。

即使the Regexp union documentation显示了这种情况，但也没有提到为什么它会变坏：

Regexp.union(/dogs/, /cats/i)        #=> /(?-mix:dogs)|(?i-mx:cats)/

请注意，在这种情况下，两种模式都有不同的标志。在我们的团队中，我们经常使用union，但我总是小心翼翼地查看在同行评审期间如何使用它。我曾经有过这样的经历，很难弄清楚出了什么问题，所以我对此非常敏感。虽然union采用模式，但在示例中，我建议不要使用它们，而是使用一个单词数组或模式作为字符串，以避免那些讨厌的标志潜入那里。他们有时间和地点，但了解这一点可以让我们控制他们何时被使用。

多次阅读the Regexp documentation，因为有很多需要学习的内容，并且在前几次通过它的过程中会很多。

而且，对于额外学分，请阅读：

Answer 3

另一种方式：

def good_sentences(sentences_ary, words_ary)
  sentences_ary.select do |s|
    (s.downcase.gsub(/[^a-z\s]/,'').split & words_ary).any?
  end
end

例如：

sentences_ary = ['This is foo', 'bob is cool']
words_ary = ['foo', 'lol', 'something']

good_sentences(sentences_ary, words_ary)
  #=> ["This is foo"]

案例：

words_ary = ['this', 'lol', 'something']
  #=> ["This is foo"]
good_sentences(sentences_ary, words_ary)
  #=> ["This is foo"]

对于标点符号：

sentences_ary = ['This is Foo!', 'bob is very "cool" indeed!']
words_ary = ['foo', 'lol', 'cool']
good_sentences(sentences_ary, words_ary)
  #=> ["This is Foo!", "bob is very \"cool\" indeed!"]

两个字符串数组之间的正则表达式匹配

3 个答案: