Question

我有两个数据帧：

info
Fname  Lname
Henry      H
 Rose      R
Jacob      T
 John      O
 Fred      Y
Simon      S
  Gay      T

和

students
Fname  Lname  Age  Height  Subject Result
Henry      H   12      15 Math;Sci      P
 Rose      R   11      18 Math;Sci      P
Jacob      T   11      15 Math;Sci      P
Henry      H   11      14 Math;Sci      P
 John      O   12      13 Math;Sci      P
 John      O   13      16 Math;Sci      F
 Fred      Y   11      16      Sci      P
Simon      S   12      10 Eng;Math      P
  Gay      T   12      11 Math;Sci      F
 Rose      R   15      18 Math;Sci      P
 Fred      Y   12      16 Math;Sci      P

我想做一个JOIN并从信息中获取所有名称，并从学生那里找到相关的元数据。但只选择最高年龄的那个（当Fname和LName相等时）。我的输出应该如下：

Final
Fname Lname Age Height  Subject Result
Henry     H  12     15 Math;Sci      P
 Rose     R  15     18 Math;Sci      P
Jacob     T  11     15 Math;Sci      P
 John     O  13     16 Math;Sci      F
 Fred     Y  12     16 Math;Sci      P
Simon     S  12     10 Eng;Math      P
  Gay     T  12     11 Math;Sci      F

我试过sqldf但是还没有运气。我只是无法正确获取标识符。有没有其他方法可以获得我的输出？

Answer 1

这可能是一种不太优雅的方式，使用基础R。

现在，合并名称上的帧（尽管在这个示例中这样做没什么意义;它实际上只是students帧中已有的名称列表。）

merged_df <- merge(students,info,by=c("Fname","Lname"))

最后，聚合，这里只是名字。您可以添加任何分类或因子变量。

merged_df_max <-aggregate(
                merged_df[c('Age','Height')], 
                by=list(Fname = merged_df$Fname,
                        Lname = merged_df$Lname), 
                FUN=max, na.rm=TRUE)

## add back details to the merged df
result <- merge(merged_df_max,students,by=c("Fname","Lname","Age","Height"))

从文件

创建data.frame

## load data
lines <-"
Fname,Lname,Age,Height,Subject,Result
Henry,H,12,15,Math;Sci,P
Rose,R,11,18,Math;Sci,P
Jacob,T,11,15,Math;Sci,P
Henry,H,11,14,Math;Sci,P
John,O,12,13,Math;Sci,P
John,O,13,16,Math;Sci,F
Fred,Y,11,16,Sci,P
Simon,S,12,10,Eng;Math,P
Gay,T,12,11,Math;Sci,F
Rose,R,15,18,Math;Sci,P
Fred,Y,12,16,Math;Sci,P
"

lines2 <-"
Fname,Lname
Henry,H
Rose,R
Jacob,T
John,O
Fred,Y
Simon,S
Gay,T
"

con <- textConnection(lines)
students <- read.csv(con,sep=',')
con2 <- textConnection(lines2)
info <- read.csv(con2,sep=',')
close(con)
close(con2)

Answer 2

使用dplyr：

library(dplyr)

info %>% left_join(students) %>%
    group_by(Fname, Lname) %>%
    filter(Age == max(Age))

Answer 3

试试这个：

library(sqldf)
sqldf("select Fname, Lname, max(Age) Age, Height, Subject, Result 
       from info left join students using (Fname, Lname)
       group by Fname, Lname")

我们使用左联接，以防info中的学生在students中没有数据。在问题info和students中的学生是相同的，因此我们可以在查询中省略单词left，但仍然得到相同的结果。另请注意，由于同一组学生同时出现在info和students中，我们根本不需要使用info。除了from行之外，这与上一个查询相同，但对所提供的数据给出了相同的答案：

sqldf("select Fname, Lname, max(Age) Age, Height, Subject, Result 
       from students
       group by Fname, Lname")

注意：为了重现性，以下内容构建了info和student数据框。在提出有关SO的问题时，请在将来自己提供。

Lines_info <- "
Fname  Lname
Henry      H
 Rose      R
Jacob      T
 John      O
 Fred      Y
Simon      S
  Gay      T
"
Lines_students <- "
Fname  Lname  Age  Height  Subject Result
Henry      H   12      15 Math;Sci      P
 Rose      R   11      18 Math;Sci      P
Jacob      T   11      15 Math;Sci      P
Henry      H   11      14 Math;Sci      P
 John      O   12      13 Math;Sci      P
 John      O   13      16 Math;Sci      F
 Fred      Y   11      16      Sci      P
Simon      S   12      10 Eng;Math      P
  Gay      T   12      11 Math;Sci      F
 Rose      R   15      18 Math;Sci      P
 Fred      Y   12      16 Math;Sci      P
"

info <- read.table(text = Lines_info, header = TRUE)
students <- read.table(text = Lines_students, header = TRUE)

INNER JOIN MAX条件类型

3 个答案: