正则表达式选择具有特定长度的句子

时间:2016-04-29 21:33:20

标签: regex r

我需要从包含特定单词的文本块中提取句子。我有这个:

[A-Z][^\\.;\\?\\!]*(word)[^\\.;\\?\\!]*

但我也需要这个句子是一个特定的长度,比如30到250个符号。我知道这似乎很容易,但我无法知道如何做到这一点。

所以输入可能是:

Welcome to RegExr v2.1 by gskinner.com, proudly **hosted** by Media Temple! A full Reference & Help is available in the Library, or watch the video Tutorial hosted by Media Temple which are so amazingly awesome that just looking at the name I get a boner instantly, and I am really serious right now, it's that exciting if you didn't get it.

上面的文字包含2个句子:一个是76个符号,另一个是266.这两个句子都包含托管这个词,这将是我们的选择词。因此正则表达式应匹配第一句。输出应该是:

Welcome to RegExr v2.1 by gskinner.com, proudly **hosted** by Media Temple

提前致谢。

2 个答案:

答案 0 :(得分:1)

我假设您正在使用英文文本进行解析。

您可以使用NLP库将文本拆分为句子,然后只使用包含word且具有特定长度的文本。我使用了维基百科的Earnest Hemingway传记摘录,并使用“1970”这个词来提取,然后应用第二个grep只有一个长度受限的值。

> require(tm)
> require(openNLP)
> text <- as.String("Ernest Hemingway wrote For Whom the Bell Tolls in Havana, Cuba; Key West, Florida; and Sun Valley, Idaho in 1939. In Cuba, he lived in the Hotel Ambos-Mundos where he worked on the manuscript. The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans. The characters in the novel include those who are purely fictional, those based on real people but fictionalized, and those who were actual figures in the war. Set in the Sierra de Guadarrama mountain range between Madrid and Segovia, the action takes place during four days and three nights. For Whom the Bell Tolls became a Book of the Month Club choice, sold half a million copies within months, was nominated for a Pulitzer Prize, and became a literary triumph for Hemingway. Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75.")
> sentence_token_annotator <- Maxent_Sent_Token_Annotator()
> sentence.boundaries <- annotate(text, sentence_token_annotator)
> sentences <- text[sentence.boundaries]
> sentences
[1] "Ernest Hemingway wrote For Whom the Bell Tolls in Havana, Cuba; Key West, Florida; and Sun Valley, Idaho in 1939."                                                                                                                                   
[2] "In Cuba, he lived in the Hotel Ambos-Mundos where he worked on the manuscript."                                                                                                                                                                      
[3] "The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans.[8]"
[4] "The characters in the novel include those who are purely fictional, those based on real people but fictionalized, and those who were actual figures in the war."                                                                                     
[5] "Set in the Sierra de Guadarrama mountain range between Madrid and Segovia, the action takes place during four days and three nights."                                                                                                                
[6] "For Whom the Bell Tolls became a Book of the Month Club choice, sold half a million copies within months, was nominated for a Pulitzer Prize, and became a literary triumph for Hemingway."                                                          
[7] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."                                                                                                                                                        
> with_word = grep("1940", sentences, fixed = TRUE, value = TRUE)
> with_word
[1] "The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans.[8]"
[2] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."                                                                                                                                                        
> with_word[grep("^.{30,100}$", with_word)]
[1] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."

在您的情况下,使用您自己的单词和{30,250}限制量词来获得您需要的句子。

请注意,可以通过1次操作来查找所需的句子,但是需要更复杂的PCRE正则表达式并具有前瞻性:

> my_sent <- grep("(?s)(?=.{30,100}$).*1940.*$", sentences, value = TRUE, perl = TRUE)
> my_sent
[1] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."

"(?s)(?=.{30,100}$).*1940.*$"正则表达式要求字符串从头到尾有30到100个(设置自己的限制)字符,字符串应包含1940个字词(请注意,如果您的单词包含特殊字符正则表达式元字符,必须使用\\)进行转义。

刚刚测试了您的数据:

> with_word = grep("(?s)^(?=.{30,250}$).*\\bhosted\\b.*$", sentences, perl = TRUE, value = TRUE)
> with_word
[1] "proudly hosted by Media Temple!"

答案 1 :(得分:0)

您可以使用positive lookahead

(?=[\p{Any}]{30,250}.*)
相关问题