使用dplyr / tidyr方法提取字符串及其位置

时间:2017-09-13 18:42:55

标签: r dplyr tidyr stringr

输入数据框有三个id列和一个raw_text。 u_id对应于用户,doc_id对应于特定用户的文档,而句子id对应于用户的文档内的句子。

branch

假设我们需要从raw_text中提取所有日期及其位置。到目前为止我的方法 -

df <- data.frame(u_id=c(1,1,1,1,1,2,2,2),
                 doc_id=c(1,1,1,2,2,1,1,2),
                 sent_id=c(1,2,3,1,2,1,2,1),
                 text=c("admission date: 2001-4-19 discharge date: 2002-5-23 service:",
                               "pertinent results: 2105-4-16 05:02pm gap-14 
                               2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
                               "method exists and the former because calls to the corresponding",
                        "admission date: 2001-4-19 discharge date: 2002-5-23 service:",
                        "pertinent results: 2105-4-16 05:02pm gap-14 
                        2105-4-16 04:23pm rdw-13.1 2105-4-16 .",
                        "method exists and the former because calls to the corresponding",
                        "method exists and the former because calls to the corresponding",
                        "method exists and the former because calls to the corresponding"))

日期提取采用data.frame格式。是否有一种方法将string_locations放在与其id和字符串对应的data.frame格式中?理想情况下,输出应为 -

#define a regex for date
date<-"([0-9]{2,4})[- . /]([0-9]{1,4})[- . /]([0-9]{2,4})"

#library
library(dplyr)
library(stringr)

#extract dates
df_i<-df %>% 
  mutate(i=str_extract_all(text,date)) %>% 
  mutate(date=lapply(i, function(x) if(identical(x, character(0))) NA_character_ else x)) %>% 
  unnest(date)

#extract date locations
df_ii<-str_locate_all(df$text,date)
n<-max(sapply(df_ii, nrow))
date_loc<-as.data.frame(do.call(rbind, lapply(df_ii, function (x) 
  rbind(x, matrix(, n-nrow(x), ncol(x))))))

1 个答案:

答案 0 :(得分:4)

你可以这样做:

regex = "\\b[0-9]+[-][0-9]+[-][0-9]+\\b"
df_i = str_extract_all(df$text, regex) 
df_ii = str_locate_all(df$text, regex) 

output1 = Map(function(x, y, z){
  if(length(y) == 0){
    y = NA
  }
  if(nrow(z) == 0){
    z = rbind(z, list(start = NA, end = NA))
  }
  data.frame(id = x, date = y, z)
}, df$id, df_i, df_ii) %>%
  do.call(rbind,.) %>%
  merge(df, .)

或坚持使用仅管道语法:

regex = "[0-9]+[-][0-9]+[-][0-9]+"

output1 = df %>%
  {list(.$id, str_extract_all(.$text, regex), 
       str_locate_all(.$text, regex))} %>%
  {Map(function(x, y, z){
    if(length(y) == 0){
      y = NA
    }
    if(nrow(z) == 0){
      z = rbind(z, list(start = NA, end = NA))
    }
    data.frame(id = x, date = y, z)
  }, .[[1]], .[[2]], .[[3]])} %>%
  do.call(rbind, .) %>%
  merge(df, .)

<强>结果:

  id
1  1
2  1
3  2
4  2
5  2
6  3
                                                                                                                 text
1                                                        admission date: 2001-4-19 discharge date: 2002-5-23 service:
2                                                        admission date: 2001-4-19 discharge date: 2002-5-23 service:
3 pertinent results: 2105-4-16 05:02pm gap-14 \n                               2105-4-16 04:23pm rdw-13.1 2105-4-16 .
4 pertinent results: 2105-4-16 05:02pm gap-14 \n                               2105-4-16 04:23pm rdw-13.1 2105-4-16 .
5 pertinent results: 2105-4-16 05:02pm gap-14 \n                               2105-4-16 04:23pm rdw-13.1 2105-4-16 .
6                                                     method exists and the former because calls to the corresponding
       date start end
1 2001-4-19    17  25
2 2002-5-23    43  51
3 2105-4-16    20  28
4 2105-4-16    77  85
5 2105-4-16   104 112
6      <NA>    NA  NA

备注:

  1. 您的正则表达式错误地从“rdw-13.1 2105-4-16”中提取“13.1”,因为您在[- . /]中添加了空格。 date<-"([0-9]{2,4})[-./]([0-9]{1,4})[-./]([0-9]{2,4})"应该这样做。
  2. mutate允许您使用刚刚在同一函数调用中创建的变量,因此不需要为mutate使用两个单独的df_i
  3. 对于我的仅限Pipping的解决方案,{}list()周围需要Map()来覆盖dplyr默认值,将输出从前一步骤输入到 first < / em>下一个函数的参数。
  4. 例如:

    df %>%
          list(.$id, str_extract_all(.$text, regex), 
                     str_locate_all(.$text, regex))
    

    变为:

    list(df, df$id, str_extract_all(df$text, regex), 
                    str_locate_all(df$text, regex))
    

    这不是我们想要的。

    <强>编辑:

    OP更新了他的df,以包含text不包含任何dates的行。这会导致我的原始解决方案失败,因为str_extract_allstr_locate_all列表中的某些元素会有length(0)nrow(0)。我通过添加两个if语句解决了这个问题:

    if(length(y) == 0){
      y = NA
    }
    if(nrow(z) == 0){
      z = rbind(z, list(start = NA, end = NA))
    }
    

    这会使dates = "NA为那些没有日期的行添加一行NA到startend。这允许iddata.frame步骤中绑定一行。