匹配两个数据表(Vlookup,dplyr,match(),left_join)保持行数

时间:2017-12-07 17:28:40

标签: r merge dplyr left-join

我想要包含表格" tech_distance"进入我的"第一次出现"表。 两个数据表:

head(first_occurrences)
# A tibble: 6 x 4
# Groups:   Main, Second [6]
   year  Main Second  occurrence
   <int> <chr>  <chr>      <int>
1  1991  C09D   C08F          1
2  2002  A47C   A47D          1
3  2002  G10K   H05K          1
4  2004  G06G   C07K          1
5  2015  B64F   B64D          1
6  2015  H02G   B29C          1


head(tech_distance)
# A tibble: 6 x 2
    Main  tech_distance
   <fctr>         <dbl>
1   C09D           0.3
2   A47C           0.0
3   G10K           0.5
4   G06G           0.5
5   B64F           0.0
6   H02G           0.5 

这是我想要得到的结果:

head(first_occurrences)
   Main year Second occurrence tech_distance
 1 A01B 2004   E21B          1           0.7
 2 A01B 2004   E21B          1           0.5
 3 A01B 2004   E21B          1           0.7
 4 A01B 2004   E21B          1           0.5
 5 A01B 2004   E21B          1           0.5
 6 A01B 2004   E21B          1           1.0

我在dplyr中使用了mutate:

first_occurrences <- data %>% 
 select(year = X3,Main = X7,Second = X8) %>% 
 group_by(Main,Second) %>% 
 mutate(occurrence = n(), tech_distance) %>% 
 filter(occurrence >= 0, occurrence <= 1, !(Main == Second)) 

但是我收到了这个错误:

 Error in mutate_impl(.data, dots) : 
 Column `tech_distance` must be length 24 (the group size) or one, not 2

所以我尝试使用merge():

first_occurrences <- merge(first_occurrences, tech_distance, by.x = "Main", by.y = "Main", all.x=T)

这似乎有效,但我得到了大量的行(240,217个条目)

 str(first_occurrences)
 'data.frame':  240217 obs. of  5 variables:
  $ Main         : chr  "A01B" "A01B" "A01B" "A01B" ...
  $ year         : int  2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
  $ Second       : chr  "E21B" "E21B" "E21B" "E21B" ...
  $ occurrence   : int  1 1 1 1 1 1 1 1 1 1 ...
  $ tech_distance: num  0.7 0.5 0.7 0.5 0.5 1 0.5 0.7 0.3 0 ...

以前的数据集是:

str(first_occurrences)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 8015 obs. of  4 variables:
 $ year      : int  1991 2002 2002 2004 2015 2015 2015 2015 2015 2015 ...
 $ Main      : chr  "C09D" "A47C" "G10K" "G06G" ...
 $ Second    : chr  "C08F" "A47D" "H05K" "C07K" ...
 $ occurrence: int  1 1 1 1 1 1 1 1 1 1 ...

str(tech_distance)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   8015 obs. of  2 variables:
 $ Main         : Factor w/ 815 levels "A01B","A01C",..: 345 62 684 651 265 749 328 735 173 788 ...
 $ tech_distance: num  0.3 0 0.5 0.5 0 0.5 0.5 0 0.5 0.5 ...

有没有人知道如何合并两个保持相同行数的数据帧?

2 个答案:

答案 0 :(得分:1)

基于上述评论;

如果tech_distance因多项内容而异,例如main和second,我实际上会创建一个新列,然后使用它来执行left_join

    first_occurrences <- mutate(first_occurrences, ID = paste0(main, "_", second, "_", year)
    tech_distance <- mutate(tech_distance, ID = paste0(main, "_", second, "_", year)  
    combined_data <- dplyr::left_join(first_occurrences, tech_distance, by = "ID")

对于重新排序列,您只需使用select(#order of columns separated by names, -ID)

即可

对于其他可能正在阅读此内容的人:

假设tech_distance是每个main特定的,而不是其他任何东西,我会使用:

combined_data <- dplyr::left_join(first_occurrences, tech_distance, by = "main")

答案 1 :(得分:0)

Main列是否都是唯一的?如果是,那么你可以得到一对一的匹配,你的结果将有8015行。如果存在重复项,那么您将获得一对多匹配并获得更多行。