使用另一个表中的数据从数据框中选择特定列

时间:2016-06-01 15:45:35

标签: r dataframe

我有一张包含约8000个观测值和65个变量的表格。我有另一张表,其中有35个观测值和11个变量。

较大的表格如下所示: portion of the larger table

,小表看起来像这样: portion of the smaller table

如您所见,较小的表的第一列包含较大表的一些列名。我怎么能比简单地写出我想要选择哪些列更紧凑,让R创建一个表,在较大的表中只有较小的表中指定的列中的数据?

非常感谢任何帮助!

更新: 感谢回答者的数据。我想知道是否有可能匹配large.df中列的顺序与名称出现在small.df中的顺序

large.df <- data.frame(A=rnorm(5), B=abs(rnorm(5, sd=0.08)),
             C=rnorm(5), D=abs(rnorm(5, sd=0.08)))


        A           B          C          D
1  0.2367193 0.002297593 -0.1958682 0.03877595
2 -1.2419638 0.034031808  0.3253622 0.02578829
3 -0.2718915 0.188627689  0.4844783 0.04405741
4 -0.6587699 0.024045926 -1.1209473 0.09849541
5  1.7890422 0.055520325  0.1093612 0.11637796

samll.df <- data.frame(Category = c("B","D"))
samll.df

  Category
1        D
2        B

我希望输出的列有'D','B',而不是'B','D'。我的例子有~35列,所以比按所需顺序键入列名更紧凑的方式是理想的。谢谢

2 个答案:

答案 0 :(得分:1)

使用%in%

  > a <- data.frame(A=1:10,B=11:20,C=1:10)   # Small data frame
  > b <- data.frame(A=1:10,D=11:20,C=21:30,E=41:50) # Big data frame

  # Column names common are A and C
  > R <- b[,names(b) %in% names(a)]
  > R
      A  C
  1   1 21
  2   2 22
  3   3 23
  4   4 24
  5   5 25
  6   6 26
  7   7 27
  8   8 28
  9   9 29
  10 10 30

答案 1 :(得分:0)

cols.small_table<-as.character(samll.df$Category)

解决方案:1#与small.df

具有相同的顺序
# order columns in large.df based on cols.small_table and subset data
large.df[ ,match(cols.keep, names(large.df))]
            D           B
1 0.0007403109 0.080096733
2 0.0528159794 0.045623426
3 0.0327912984 0.038420719
4 0.0976794958 0.108335834
5 0.0974624753 0.008220431

解决方案2

# Keep the columns in large table based on match in small table 
large.df[ , which(names(large.df) %in% cols.small_table)] 
            B          D
1 0.002297593 0.03877595
2 0.034031808 0.02578829
3 0.188627689 0.04405741
4 0.024045926 0.09849541
5 0.055520325 0.11637796

# Remove the columns in large table based on match in small table
large.df[ , -which(names(large.df) %in% cols.small_table)] 

           A          C
1  0.2367193 -0.1958682
2 -1.2419638  0.3253622
3 -0.2718915  0.4844783
4 -0.6587699 -1.1209473
5  1.7890422  0.1093612

数据

large.df <- data.frame(A=rnorm(5), B=abs(rnorm(5, sd=0.08)),
                 C=rnorm(5), D=abs(rnorm(5, sd=0.08)))


            A           B          C          D
1  0.2367193 0.002297593 -0.1958682 0.03877595
2 -1.2419638 0.034031808  0.3253622 0.02578829
3 -0.2718915 0.188627689  0.4844783 0.04405741
4 -0.6587699 0.024045926 -1.1209473 0.09849541
5  1.7890422 0.055520325  0.1093612 0.11637796

samll.df <- data.frame(Category = c("D","B"))
samll.df

  Category
1        D
2        B