通过搜索和加入两个数据帧。匹配字符串

时间:2016-09-20 10:49:27

标签: r dataframe

我有两个数据框

DF1

+-------+---------+  
|   Id  |  Title  |
+-------+---------+  
|   1   |   AAA   |
|   2   |   BBB   |
|   3   |   CCC   |
+-------+---------+

DF2

+-------+---------------+------------------------------------+
|   Id  |      Sub      |               Body                 |
+-------+---------------+------------------------------------+  
|   1   |   some sub1   | some mail body AAA some text here  |
|   2   |   some sub2   | some text here BBB continues here  |
|   3   |   some sub3   | some text AAA present here         |
|   4   |   some sub4   | AAA string is present here also    |
|   5   |   some sub5   | CCC string is present here         |
+-------+---------------+------------------------------------+

我想将 df1 中的Title df2 Body列相匹配, 如果Body列中存在标题字符串,则两个行都应该连接,输出数据框应该是:

DF3

+----------+---------------+------------------------------------+
|   Title  |      Sub      |               Body                 |
+----------+---------------+------------------------------------+  
|   AAA    |   some sub1   | some mail body AAA some text here  |
|   BBB    |   some sub2   | some text here BBB continues here  |
|   AAA    |   some sub3   | some text AAA present here         |
|   AAA    |   some sub4   | AAA string is present here also    |
|   CCC    |   some sub5   | CCC string is present here         |
+----------+---------------+------------------------------------+

1 个答案:

答案 0 :(得分:1)

一个解决方案可能看起来像这样,虽然更有经验的R用户可能会得到更好的答案

# set up test data
df1 <- data.frame(stringsAsFactors = F,
                  id = 1:3,
                  title = c('AAA', 'BBB', 'CCC'))
df2 <- data.frame(stringsAsFactors = F,
                  id = 1:5,
                  sub = c('some sub1', 'some sub2', 'some sub3', 'some sub4', 'some sub5'),
                  body = c('some mail body AAA some text here',
                           'some text here BBB continous here',
                           'some text AAA present here',
                           'AAA string is present here also',
                           'CCC string is present here'))

# join data frames
df.list <- lapply(1:nrow(df1), function (idx) cbind(title=df1[idx,2], df2[grepl(df1$title[idx], df2$body), 2:3]))
do.call('rbind', df.list)

将导致以下输出

  title       sub                              body
1   AAA some sub1 some mail body AAA some text here
3   AAA some sub3        some text AAA present here
4   AAA some sub4   AAA string is present here also
2   BBB some sub2 some text here BBB continous here
5   CCC some sub5        CCC string is present here

由于评论而更新:

如果我们不能依赖每个标题与df2中某些行匹配的事实,那么您可能想要做类似这样的事情

# set up test data
df1 <- data.frame(stringsAsFactors = F,
                  id = 1:4,
                  title = c('AAA', 'AAA BB', 'BBB', 'CCC'))
df2 <- data.frame(stringsAsFactors = F,
                  id = 1:5,
                  sub = c('some sub1', 'some sub2', 'some sub3', 'some sub4', 'some sub5'),
                  body = c('some mail body AAA some text here',
                           'some text here BBB continous here',
                           'some text AAA present here',
                           'AAA string is present here also',
                           'CCC string is present here'))

MergeByTitle <- function(title.idx) {
  df2.hits <- df2[grepl(df1$title[title.idx], df2$body), 2:3]
  if (nrow(df2.hits) > 0)
    cbind(title=df1[title.idx,2], df2.hits)
}

# join data frames
df.list <- lapply(1:nrow(df1), MergeByTitle)
do.call('rbind', df.list)