
时间:2014-09-21 21:55:23

标签: ruby regex markov-chains

我目前正在使用Ruby中的Markov chain text generator应用程序,它接收文本的正文(“语料库”),然后基于此生成新文本。我当前需要解决的问题是编写一个Regexp,它将返回包含我指定的单词数的数组。我想在这里做的就是获取一定数量的单词(由用户指定),但在整个字符串中多次。

关闭我见过的另一个应用程序,我正在使用类似/(([.,?"();\-!':—^\w]+ ){#{depth}})/的内容,其中#{depth}一次插入我想要多少个单词。这应该一次抓住两个单词,同时允许一个特殊字符的子集,这就是让我感觉到的那一部分。所以总的问题是:如何动态指定我想要的单词数量(用空格分隔),同时还允许这些单词中的一系列特殊字符?


# Regex
@match_regex = /(([.,?"();\-!':—^\w]+ ){2})/
s = input.scan(@match_regex).to_a
puts s.inspect

# Input
Within weeks they planned a meeting. She sent him poetry along with her itinerary,
having worked in a business meeting to excuse the opportunity. He prepared flowers
and a banner of welcome on his hearth. 

# Output - seems to be grabbing last word again for some reason
[["Within weeks ", "weeks "], ["they planned ", "planned "], ["a meeting. ", "meeting. "],
["She sent ", "sent "], ["him poetry ", "poetry "], ["along with ", "with "],
["her itinerary, ", "itinerary, "], ["having worked ", "worked "], ["in a ", "a "],
["business meeting ", "meeting "], ["to excuse ", "excuse "],
["the opportunity. ", "opportunity. "], ["He prepared ", "prepared "], ["flowers and ", "and "],
["a banner ", "banner "], ["of welcome ", "welcome "], ["on his ", "his "]]

# Desired output. I'm not picky if it has trailing spaces or not as I can always trim that
["Within weeks", "they planned", "a meeting.", "She sent", "him poetry", "along with",
"her itinerary," "having worked", "in a", "business meeting", "to excuse", "the opportunity.",
"He prepared", "flowers and", "a banner", "of welcome", "on his"]


2 个答案:

答案 0 :(得分:0)



要获得您想要的内容,您需要删除这些捕获组。第一组可以简单地删除,对于第二组你想要使它成为非捕获组,为此你用?:开始括号。 因此,您需要的表达式为:

@match_regex = /(?:[.,?"();\-!':—^\w]+ ){2}/

答案 1 :(得分:0)


def split_it(text, num_words, special_chars)

text =<<_
Within weeks they planned a meeting. She sent him poetry along with her itinerary,
having worked in a business meeting to excuse the opportunity. He prepared flowers
and a banner of welcome on his hearth.

special_chars = ".,?\"();\\-!':"

split_it(text, 2, special_chars)
  #=> ["Within weeks ", "they planned ", "a meeting. ", "She sent ", "him poetry ",
  #    "along with ", "her itinerary,\n", "having worked ", "in a ",
  #    "business meeting ", "to excuse ", "the opportunity. ", "He prepared ",
  #    "flowers\nand ", "a banner ", "of welcome ", "on his "]
split_it(text, 3, special_chars)
  #=> ["Within weeks they ", "planned a meeting. ", "She sent him ",
  #    "poetry along with ", "her itinerary,\nhaving ", "worked in a ",
  #    "business meeting to ", "excuse the opportunity. ", "He prepared flowers\n",
  #    "and a banner ", "of welcome on "]

注意\\-中的special_chars。如果你有-\-,它将出现在正则表达式中的括号-之间,而Ruby会期望你定义一个范围,并会引发异常。额外的反斜杠导致\-出现在括号之间,告诉Ruby它是文字-。 @Amadan指出,如果-位于字符串的开头或结尾,则不需要擒纵。

