在txt文件中查找名称

时间:2015-10-13 12:05:18

标签: r dataframe names

我在txt文件(T1.txt)中有一个长文本。 我想找到txt文件中的所有名称(英文)和前面的2个单词以及名称后面的2个单词。 例如,我有以下文字:

    "Hello world!, my name is Mr. A.B. Morgan (in short) and it is nice to meet you."
Orange Silver paid 100$ for his gift.
I'll call Dina H. in two hours.

我想获得以下数据框:

   > df1
       Before         Name         After
  1   name is     A. B. Morgan  in short
  2               Orange Silver paid 100$
  3   I'll call   Dina H.       in two

1 个答案:

答案 0 :(得分:1)

这不完美也不漂亮,但它是一个开始:

text1 <- c("Hello world!, my name is Mr. A.B. Morgan (in short) and it is nice to meet you.")
text2 <- c("Orange Silver paid 100$ for his gift.")
text3 <- c("I'll call Dina H. in two hours.")

library(stringr)

find_names_and_BA <- function(x) {
  matches <- str_extract_all(str_sub(x, 2), "[A-Z]\\S+")[[1]]

  if (length(matches) < 2) { matches <- str_extract_all(x, "[A-Z]\\S+")[[1]] }
      name_match <- paste(matches, collapse = " ")
    beg_of_match <- str_locate(x, name_match)[1]
    end_of_match <- str_locate(x, name_match)[2]

     start_words <- str_extract_all(str_sub(x, , beg_of_match), "\\w+")[[1]]
       end_words <- str_extract_all(str_sub(x, end_of_match), "\\w+")[[1]]

          before <- paste(tail(start_words, 3)[1:2], collapse = " ")
           after <- paste(head(end_words, 3)[2:3], collapse = " ")
  return( data.frame(Before = before, Name = name_match, After = after) )
}

dplyr::bind_rows(find_names_and_BA(text1),
                 find_names_and_BA(text2),
                 find_names_and_BA(text3))

# Source: local data frame [3 x 3]
# 
#    Before            Name     After
#     (chr)           (chr)     (chr)
# 1 name is Mr. A.B. Morgan  in short
# 2    O NA   Orange Silver  paid 100
# 3 ll call         Dina H. two hours