Question

我想要包含表格＆＃34; tech_distance＆＃34;进入我的＆＃34;第一次出现＆＃34;表。两个数据表：

head(first_occurrences)
# A tibble: 6 x 4
# Groups:   Main, Second [6]
   year  Main Second  occurrence
   <int> <chr>  <chr>      <int>
1  1991  C09D   C08F          1
2  2002  A47C   A47D          1
3  2002  G10K   H05K          1
4  2004  G06G   C07K          1
5  2015  B64F   B64D          1
6  2015  H02G   B29C          1


head(tech_distance)
# A tibble: 6 x 2
    Main  tech_distance
   <fctr>         <dbl>
1   C09D           0.3
2   A47C           0.0
3   G10K           0.5
4   G06G           0.5
5   B64F           0.0
6   H02G           0.5

这是我想要得到的结果：

head(first_occurrences)
   Main year Second occurrence tech_distance
 1 A01B 2004   E21B          1           0.7
 2 A01B 2004   E21B          1           0.5
 3 A01B 2004   E21B          1           0.7
 4 A01B 2004   E21B          1           0.5
 5 A01B 2004   E21B          1           0.5
 6 A01B 2004   E21B          1           1.0

我在dplyr中使用了mutate：

first_occurrences <- data %>% 
 select(year = X3,Main = X7,Second = X8) %>% 
 group_by(Main,Second) %>% 
 mutate(occurrence = n(), tech_distance) %>% 
 filter(occurrence >= 0, occurrence <= 1, !(Main == Second))

但是我收到了这个错误：

 Error in mutate_impl(.data, dots) : 
 Column `tech_distance` must be length 24 (the group size) or one, not 2

所以我尝试使用merge（）：

first_occurrences <- merge(first_occurrences, tech_distance, by.x = "Main", by.y = "Main", all.x=T)

这似乎有效，但我得到了大量的行（240,217个条目）

 str(first_occurrences)
 'data.frame':  240217 obs. of  5 variables:
  $ Main         : chr  "A01B" "A01B" "A01B" "A01B" ...
  $ year         : int  2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
  $ Second       : chr  "E21B" "E21B" "E21B" "E21B" ...
  $ occurrence   : int  1 1 1 1 1 1 1 1 1 1 ...
  $ tech_distance: num  0.7 0.5 0.7 0.5 0.5 1 0.5 0.7 0.3 0 ...

以前的数据集是：

str(first_occurrences)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 8015 obs. of  4 variables:
 $ year      : int  1991 2002 2002 2004 2015 2015 2015 2015 2015 2015 ...
 $ Main      : chr  "C09D" "A47C" "G10K" "G06G" ...
 $ Second    : chr  "C08F" "A47D" "H05K" "C07K" ...
 $ occurrence: int  1 1 1 1 1 1 1 1 1 1 ...

str(tech_distance)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   8015 obs. of  2 variables:
 $ Main         : Factor w/ 815 levels "A01B","A01C",..: 345 62 684 651 265 749 328 735 173 788 ...
 $ tech_distance: num  0.3 0 0.5 0.5 0 0.5 0.5 0 0.5 0.5 ...

有没有人知道如何合并两个保持相同行数的数据帧？

Answer 1

基于上述评论;

如果tech_distance因多项内容而异，例如main和second，我实际上会创建一个新列，然后使用它来执行left_join。

    first_occurrences <- mutate(first_occurrences, ID = paste0(main, "_", second, "_", year)
    tech_distance <- mutate(tech_distance, ID = paste0(main, "_", second, "_", year)  
    combined_data <- dplyr::left_join(first_occurrences, tech_distance, by = "ID")

对于重新排序列，您只需使用select(#order of columns separated by names, -ID)

即可

对于其他可能正在阅读此内容的人：

假设tech_distance是每个main特定的，而不是其他任何东西，我会使用：

combined_data <- dplyr::left_join(first_occurrences, tech_distance, by = "main")

Answer 2

Main列是否都是唯一的？如果是，那么你可以得到一对一的匹配，你的结果将有8015行。如果存在重复项，那么您将获得一对多匹配并获得更多行。

匹配两个数据表（Vlookup，dplyr，match（），left_join）保持行数

2 个答案: