Question

我有一个数据框（df），看起来像以下内容（具有更多的列和行）：

Cell_Cluster     ARB2     DRAB2A    FOXP2 ....
C18|O11.F2       2.234    0.315     3.325
C18|010.J2       0.215    1.215    -0.310
C18|S92.C1      -0.562    4.624     1.426
C20|O11.F2       1.150   -1.326     3.135
C20|S93.C2      -1.135    3.001    -2.932 
C21|010.J2       2.125    1.250     0.013
.
.
.

Cell_Cluster之后的列都是不同的基因。我要做的是按Cell_Cluster分组（准确地说是“ |”之前的所有字符），然后在每个组中添加一列，代表每个基因的平均值。我该如何实现？

Answer 1

我们假定输入数据帧可重复显示在末尾的注释中。

现在，假设您想要在原始数据帧上添加额外的列mean，以便组中的每一行均值均等于该组中所有数字列的均值，因为所有这些数字的平均值等于该组中rowMeans的平均值，我们可以首先获取rowMeans，然后取该组中那些均值的平均值。例如，查看第4行和第5行

# mean of all elements in rows 4 and 5
mean(c(1.15, -1.326, 3.135, -1.135, 3.001, -2.932))
## [1] 0.3155

# take mean of row 4 and then mean of row 5 and then mean of those 2 means
mean(c(mean(c(1.15, -1.326, 3.135)), mean(c(-1.135, 3.001, -2.932))))
## [1] 0.3155

不使用任何软件包。

transform(DF, mean = ave(rowMeans(DF[-1]), sub("\\|.*","",Cell_Cluster), FUN = mean))

给予：

  Cell_Cluster   ARB2 DRAB2A  FOXP2     mean
1   C18|O11.F2  2.234  0.315  3.325 1.386889
2   C18|010.J2  0.215  1.215 -0.310 1.386889
3   C18|S92.C1 -0.562  4.624  1.426 1.386889
4   C20|O11.F2  1.150 -1.326  3.135 0.315500
5   C20|S93.C2 -1.135  3.001 -2.932 0.315500
6   C21|010.J2  2.125  1.250  0.013 1.129333

注意

Lines <- "
Cell_Cluster     ARB2     DRAB2A    FOXP2
C18|O11.F2       2.234    0.315     3.325
C18|010.J2       0.215    1.215    -0.310
C18|S92.C1      -0.562    4.624     1.426
C20|O11.F2       1.150   -1.326     3.135
C20|S93.C2      -1.135    3.001    -2.932 
C21|010.J2       2.125    1.250     0.013"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, strip.white = TRUE)

Answer 2

如果要对组中的每个基因（而不是单个列）求平均，则首先制作长格式数据可能会有所帮助。您可以同时使用tidyr和data.table软件包。

`tidyr`方法

library(tidyverse)
gene <-
  read_table("Cell_Cluster     ARB2     DRAB2A    FOXP2
C18|O11.F2       2.234    0.315     3.325
C18|010.J2       0.215    1.215    -0.310
C18|S92.C1      -0.562    4.624     1.426
C20|O11.F2       1.150   -1.326     3.135
C20|S93.C2      -1.135    3.001    -2.932 
C21|010.J2       2.125    1.250     0.013")

gather(key, value)可以使数据变长。您可以指定列。

(gene1 <- 
  gene %>% 
  gather(-Cell_Cluster, key = key, value = value)) # gather except Cell_Cluster
#> # A tibble: 18 x 3
#>    Cell_Cluster key     value
#>    <chr>        <chr>   <dbl>
#>  1 C18|O11.F2   ARB2    2.23 
#>  2 C18|010.J2   ARB2    0.215
#>  3 C18|S92.C1   ARB2   -0.562
#>  4 C20|O11.F2   ARB2    1.15 
#>  5 C20|S93.C2   ARB2   -1.14 
#>  6 C21|010.J2   ARB2    2.12 
#>  7 C18|O11.F2   DRAB2A  0.315
#>  8 C18|010.J2   DRAB2A  1.22 
#>  9 C18|S92.C1   DRAB2A  4.62 
#> 10 C20|O11.F2   DRAB2A -1.33 
#> 11 C20|S93.C2   DRAB2A  3.00 
#> 12 C21|010.J2   DRAB2A  1.25 
#> 13 C18|O11.F2   FOXP2   3.32 
#> 14 C18|010.J2   FOXP2  -0.31 
#> 15 C18|S92.C1   FOXP2   1.43 
#> 16 C20|O11.F2   FOXP2   3.14 
#> 17 C20|S93.C2   FOXP2  -2.93 
#> 18 C21|010.J2   FOXP2   0.013

由于您要按|之前的cell_cluster分组（如果我理解正确的话），因此可以将该列分成两部分。由\\|拆分。

gene1 %>% 
  separate(Cell_Cluster, into = c("cell", "cluster"), 
           sep = "\\|", remove = FALSE)
#> # A tibble: 18 x 5
#>    Cell_Cluster cell  cluster key     value
#>    <chr>        <chr> <chr>   <chr>   <dbl>
#>  1 C18|O11.F2   C18   O11.F2  ARB2    2.23 
#>  2 C18|010.J2   C18   010.J2  ARB2    0.215
#>  3 C18|S92.C1   C18   S92.C1  ARB2   -0.562
#>  4 C20|O11.F2   C20   O11.F2  ARB2    1.15 
#>  5 C20|S93.C2   C20   S93.C2  ARB2   -1.14 
#>  6 C21|010.J2   C21   010.J2  ARB2    2.12 
#>  7 C18|O11.F2   C18   O11.F2  DRAB2A  0.315
#>  8 C18|010.J2   C18   010.J2  DRAB2A  1.22 
#>  9 C18|S92.C1   C18   S92.C1  DRAB2A  4.62 
#> 10 C20|O11.F2   C20   O11.F2  DRAB2A -1.33 
#> 11 C20|S93.C2   C20   S93.C2  DRAB2A  3.00 
#> 12 C21|010.J2   C21   010.J2  DRAB2A  1.25 
#> 13 C18|O11.F2   C18   O11.F2  FOXP2   3.32 
#> 14 C18|010.J2   C18   010.J2  FOXP2  -0.31 
#> 15 C18|S92.C1   C18   S92.C1  FOXP2   1.43 
#> 16 C20|O11.F2   C20   O11.F2  FOXP2   3.14 
#> 17 C20|S93.C2   C20   S93.C2  FOXP2  -2.93 
#> 18 C21|010.J2   C21   010.J2  FOXP2   0.013

现在，您可以计算每个组的平均值。您需要附加列，因此可以使用dplyr::mutate()。

使用spread(key, value)，您可以返回原始格式。

gene %>% 
  gather(-Cell_Cluster, key = key, value = value) %>% 
  separate(Cell_Cluster, into = c("cell", "cluster"), 
           sep = "\\|", remove = FALSE) %>% 
  group_by(cell) %>% # group by cell column
  mutate(M = mean(value)) %>% # make mean column
  spread(key, value) %>% 
  ungroup() %>% # do not need cell and cluster column, so remove them
  select(-cell, -cluster)

#> # A tibble: 6 x 5
#>   Cell_Cluster     M   ARB2 DRAB2A  FOXP2
#>   <chr>        <dbl>  <dbl>  <dbl>  <dbl>
#> 1 C18|010.J2   1.39   0.215  1.22  -0.31 
#> 2 C18|O11.F2   1.39   2.23   0.315  3.32 
#> 3 C18|S92.C1   1.39  -0.562  4.62   1.43 
#> 4 C20|O11.F2   0.315  1.15  -1.33   3.14 
#> 5 C20|S93.C2   0.315 -1.14   3.00  -2.93 
#> 6 C21|010.J2   1.13   2.12   1.25   0.013

您可以看到M列，该列已计算出每个基因组。

`data.table`方法

基因数据可能很大，因此data.table可能更适合实施。

您可以使用tidyr::gather()代替data.table::melt()
- id.vars
- variable.name
您可以使用tidyr::separate()代替data.table::tstrsplit()
- 要使用正则表达式\\|，请添加perl = TRUE。
您可以使用tidyr::spread()代替data.table::dcast()
- 公式：在左侧，输入id并添加变量。在右侧，放置原始变量。
- value.var

一次全部

gene %>% 
  data.table() %>% 
  melt(id.vars = "Cell_Cluster", variable.name = "key") %>% # gather
  .[,
    c("cell", "cluster") := tstrsplit(Cell_Cluster, split = "\\|", perl = TRUE)] %>% # split Cell_Cluster
  .[,
    M := mean(value), # average value column
    by = cell] %>% # group by cell
  dcast(Cell_Cluster + M ~ key, value.var = "value") # spread

#>    Cell_Cluster     M   ARB2 DRAB2A  FOXP2
#> 1:   C18|010.J2 1.387  0.215  1.215 -0.310
#> 2:   C18|O11.F2 1.387  2.234  0.315  3.325
#> 3:   C18|S92.C1 1.387 -0.562  4.624  1.426
#> 4:   C20|O11.F2 0.315  1.150 -1.326  3.135
#> 5:   C20|S93.C2 0.315 -1.135  3.001 -2.932
#> 6:   C21|010.J2 1.129  2.125  1.250  0.013

此data.table会更快。

microbenchmark::microbenchmark(
  DPLYR = {
    gene %>% 
  gather(-Cell_Cluster, key = key, value = value) %>% 
  separate(Cell_Cluster, into = c("cell", "cluster"), 
           sep = "\\|", remove = FALSE) %>% 
  group_by(cell) %>% 
  mutate(M = mean(value)) %>% 
  spread(key, value) %>% 
  ungroup() %>% 
  select(-cell, -cluster)
  },
  DATATABLE = {
    gene %>% 
  data.table() %>% 
  melt(id.vars = "Cell_Cluster", variable.name = "key") %>% 
  .[,
    c("cell", "cluster") := tstrsplit(Cell_Cluster, split = "\\|", perl = TRUE)] %>% 
  .[,
    M := mean(value),
    by = cell] %>%
  dcast(Cell_Cluster + M ~ key, value.var = "value")
  },
  times = 50
)
#> Unit: milliseconds
#>       expr  min    lq mean median    uq   max neval
#>      DPLYR 8.55 10.15 11.7  11.39 12.53 20.22    50
#>  DATATABLE 3.39  3.94  4.8   4.77  5.46  7.69    50

计算所有列中每组的平均值

2 个答案:

注意

`tidyr`方法

`data.table`方法

计算所有列中每组的平均值

2 个答案:

注意

tidyr方法

data.table方法

`tidyr`方法

`data.table`方法