R:仅索引另一个模式后第一次出现的模式

时间:2015-10-21 20:22:47

标签: regex r string indexing

我有一个像这样的字符串向量(一个更大的字符串的一部分):

a <- c("My string",
       "characters",
       "sentence",
       "text.",
       "My string word sentence word.",
       "Other thing word sentence characters.",
       "My string word sentence numbers.",
       "Other thing",
       "word.",
       "sentence",
       "text.",
       "Other thing word. characters sentence.",
       "Different string word text.",
       "Different string.",
       "word.",
       "sentence.",
       "My string",
       "word",
       "sentence",
       "things.",
       "My string word sentence blah.")

如您所见,向量包含一些表达式,其中一些表达式位于单个元素中,其他表达式则分为多个元素(这很好)。另请注意,其中一些在单个或拆分字符串中有多个句点。我想要实现的是提取以My string开头并以同一元素中的句点结束的那些(如果整个表达式在单个字符串中)或在结束表达式的最后一个元素的末尾开始与My string

我想象的第一种方式是,索引包含My string的所有元素:

> b <- grep(pattern = "My string", x = a, fixed = TRUE)
> b
[1]  1  5  7 17 21

然后,索引字符串末尾的所有句点:

> c <- grep(pattern = "\\.$", x = a)
> c
 [1]  4  5  6  7  9 11 12 13 14 15 16 20 21

最后,在每个以My string开头的表达式之后(在单个元素中或跨元素分布),仅获取FIRST周期的位置。然后,只需将我需要的元素集中起来就可以很容易地得到这样的东西:

d <- c("My string",
       "characters",
       "sentence",
       "text.",
       "My string word sentence word.",
       "My string word sentence numbers.",
       "My string",
       "word",
       "sentence",
       "things.",
       "My string word sentence blah.")

有人可以帮助完成最后一步(在每个以My string开头的表达式之后只获取FIRST期间的位置吗?

2 个答案:

答案 0 :(得分:2)

这是dplyr

的替代方法
library(dplyr)

a <- c("My string",
       "characters",
       "sentence",
       "text.",
       "My string word sentence word.",
       "Other thing word sentence characters.",
       "My string word sentence numbers.",
       "Other thing",
       "word.",
       "sentence",
       "text.",
       "Other thing word. characters sentence.",
       "Different string word text.",
       "Different string.",
       "word.",
       "sentence.",
       "My string",
       "word",
       "sentence",
       "things.",
       "My string word sentence blah.")

data.frame(a = a,
           stringsAsFactors = FALSE) %>%
  mutate(period = grepl("[.]", a), 
         sentence_id = lag(cumsum(period), default = 0)) %>%
  group_by(sentence_id) %>%
  mutate(retain = any(grepl("My string", a))) %>%
  ungroup() %>%
  filter(retain)

该过程是识别具有句点的元素并使用这些索引来指示新句子何时开始。这给了我们一个sentence_id来分组,然后我们只需要查找字符串"My string"

答案 1 :(得分:1)

我觉得这样的事情会做你想做的事情

b <- grep(pattern = "My string", x = a, fixed = TRUE)
c <- grep(pattern = "\\.$", x = a)

# find first period for each start string
e <- sapply(b, function(x) head(c[c>=x],1))

# extract ranges
d <- a[unlist(Map(`:`, b,e))]

#  [1] "My string"                       
#  [2] "characters"                      
#  [3] "sentence"                        
#  [4] "text."                           
#  [5] "My string word sentence word."   
#  [6] "My string word sentence numbers."
#  [7] "My string"                       
#  [8] "word"                            
#  [9] "sentence"                        
# [10] "things."                         
# [11] "My string word sentence blah."