提取文本的匹配关键字

时间:2018-05-30 06:23:10

标签: r grep stringr

寻找从文本中提取关键字的一些帮助。我有两个数据框。第一个数据框有描述列,另一个数据框只有一列包含关键字。

我想在描述字段中搜索dataframe2中的关键字,并在dataframe1中使用匹配的关键字创建一个新列。如果有多个关键字,我需要新添加的列,其中所有关键字都用逗号分隔,如下所述。

Dataframe2

Keywords
New
FUND
EVENT 
Author
book

Dataframe1

ID  NAME    Month   DESCRIPTION              Keywords
12  x1       Jan    funding recived            fund
23  x2       Feb    author of the book     author, book
14  x3       Mar    new year event         new, event

另外,我需要关键字,即使描述有完整的单词。 I.efnding我可以在新专栏中获得关键字基金。

2 个答案:

答案 0 :(得分:4)

我们可以使用regex_left_join中的fuzzyjoin并进行group_by连接(paste

library(fuzzyjoin)
library(dplyr)
df1 %>% 
   regex_left_join(df2, by = c('DESCRIPTION' = 'Keywords'), 
              ignore_case = TRUE) %>% 
   group_by(ID, NAME, Month, DESCRIPTION) %>% 
   summarise(Keywords = toString(unique(tolower(Keywords))))
# A tibble: 3 x 5
# Groups:   ID, NAME, Month [?]
#     ID NAME  Month DESCRIPTION        Keywords    
#  <int> <chr> <chr> <chr>              <chr>       
#1    12 x1    Jan   funding recived    fund        
#2    14 x3    Mar   new year event     new, event  
#3    23 x2    Feb   author of the book author, book

数据

df1 <- structure(list(ID = c(12L, 23L, 14L), NAME = c("x1", "x2", "x3"
), Month = c("Jan", "Feb", "Mar"), DESCRIPTION = c("funding recived", 
"author of the book", "new year event")), .Names = c("ID", "NAME", 
"Month", "DESCRIPTION"), class = "data.frame", row.names = c(NA, 
-3L))

df2 <- structure(list(Keywords = c("New", "FUND", "EVENT", "Author", 
"book")), .Names = "Keywords", class = "data.frame", row.names = c(NA, 
-5L))

答案 1 :(得分:1)

解决方案是使用stringr::str_detect检查每个KeywordsDESCRIPTION的存在。

library(stringr)

df1$Keywords <- mapply(function(x)paste(df2$Keywords[str_detect(x, tolower(df2$Keywords))],
                                        collapse = ","), df1$DESCRIPTION)

df1
#   ID NAME Month        DESCRIPTION    Keywords
# 1 12   x1   Jan    funding recived        FUND
# 2 23   x2   Feb author of the book Author,book
# 3 14   x3   Mar     new year event   New,EVENT

数据:

df1 <- read.table(text = 
"ID  NAME    Month   DESCRIPTION      
12  x1       Jan    'funding recived'   
23  x2       Feb    'author of the book'
14  x3       Mar    'new year event'",
header = TRUE, stringsAsFactors = FALSE)

df2 <- read.table(text = 
"Keywords
New
FUND
EVENT 
Author
book",
header = TRUE, stringsAsFactors = FALSE)