汇总数据并在一列中排除重复项

时间:2019-07-03 16:41:40

标签: r data.table

我正在尝试简化使用两个SQL查询(最小到一个)的分析。为此,我将生物量数据与单个SQL查询中的大小类数据结合在一起,从而创建了重复项。这是因为生物量已经是一个总和,并且是每个taxa_namesite的总生物量,即它是我的新表中的一对多值。

为了摆脱2个SQL查询,我通过两次data.table操作和最后的联接完成了工作。一种替代方法是进行计算并删除重复项两次。有没有一种方法可以仅通过使用data.table来避免这两种情况?

示例数据

testdf <- structure(list(spcode = c(10008L, 10008L, 10002L, 10002L, 10006L, 10008L, 10008L, 10002L, 10002L, 10011L, 10002L, 10002L, 10006L, 10006L, 10006L), abundance = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 2L), biomass = c(0.2, 0.2, 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.5, 0.1, 0.1, 0.5, 0.5, 0.5), size_class = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 13L, 17L, 12L, 5L, 9L, 10L, 11L), site = c(907L, 907L, 907L, 907L, 907L, 914L, 914L, 914L, 914L, 914L, 910L, 910L, 910L, 910L, 910L), taxa_name = c("Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Parophrys vetulus", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Microstomus pacificus", "Microstomus pacificus"), lnXabun = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 26L, 17L, 12L, 5L, 9L, 40L, 22L)), row.names = c(NA, -15L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x00362498>)

计算

# biomass
bm <- testdf
bm <- bm[, .(site = unique(site)),
   by = list(spcode, taxa_name, biomass)][, totbm := sum(biomass), by = list(spcode)][!duplicated(spcode), c(1,5)]

    > bm
   spcode totbm
1:  10008   0.5
2:  10002   0.3
3:  10006   0.6
4:  10011   0.5

接下来完成丰度,然后在spcode上将两者合并。

# abundance
testdf <- testdf[, .(totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)),
      by = list(spcode, taxa_name)]

# join
testdf[bm, on = 'spcode', bm := i.totbm]

> testdf
   spcode             taxa_name totabn n minlngth maxlngth  bm
1:  10008 Hippoglossina stomata     85 4       20       23 0.5
2:  10002  Symphurus atricaudus     83 7        5       16 0.3
3:  10006 Microstomus pacificus     85 8        9       14 0.6
4:  10011     Parophrys vetulus     17 1       17       17 0.5

testdf的上述输出是我想要的输出。我的其他尝试依赖于两个!duplicated调用。在我的脑海中,我希望能够在丰度计算中使用[, totbm := sum(biomass), by = list(unique(site), spcode)],但这是行不通的。

testdf[, .(site = (site), biomass = biomass, totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)), by = list(spcode, taxa_name)][, totbm := sum(biomass), by = list(unique(site), spcode)]
Error in `[.data.table`(testdf[, .(site = (site), biomass = biomass, totabn = sum(lnXabun),  : The items in the 'by' or 'keyby' list are length (3,15). Each must be length 15; the same length as there are rows in x (after subsetting if i is provided).

替代方法:

alt <- bm[, .(site = site, taxa_name = taxa_name, biomass = biomass, totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)),
by = list(spcode)]
alt <- alt[!duplicated(alt, by = c("site", "spcode"))]
alt[, totbm := sum(biomass), by = list(spcode)]
alt[!duplicated(alt, by = "spcode"), c(1,3,5:9)]

2 个答案:

答案 0 :(得分:3)

就像我在评论中提到的那样,我不喜欢数据冗余的表,但这是解决问题的一种方法。基本上,不是使用某种“独特”功能,而是按站点/ taxa_name的组来输入索引号,以便可以将除第一个生物量值之外的所有值都设置为0。然后按spcode / taxa_name进行的总和应该可以正常工作。当然,这是假定一组site / taxa_name值恰好对应一个生物量值。

testdf <- data.table(spcode = c(10008L, 10008L, 10002L, 10002L, 10006L, 10008L, 10008L, 10002L, 10002L, 10011L, 10002L, 10002L, 10006L, 10006L, 10006L), 
                         abundance = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 2L), 
                         biomass = c(0.2, 0.2, 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.5, 0.1, 0.1, 0.5, 0.5, 0.5), 
                         size_class = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 13L, 17L, 12L, 5L, 9L, 10L, 11L), 
                         site = c(907L, 907L, 907L, 907L, 907L, 914L, 914L, 914L, 914L, 914L, 910L, 910L, 910L, 910L, 910L), 
                         taxa_name = c("Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Parophrys vetulus", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Microstomus pacificus", "Microstomus pacificus"), 
                         lnXabun = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 26L, 17L, 12L, 5L, 9L, 40L, 22L))

testdf[, biomassIdx := 1:.N, by = c('site', 'taxa_name')]
testdf[biomassIdx > 1, biomass := 0]
testdf[, .(tatabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class) , bm = sum(biomass)),
        by = list(spcode, taxa_name)]

答案 1 :(得分:1)

除非我缺少任何东西,否则您会使自己复杂化一点。 只需做一个不同的摘要即可:

bm <- testdf[, .SD[1L], by = list(spcode, taxa_name, biomass, site) # distinct
             ][, .(totbm = sum(biomass)), by = "spcode"] # summary
相关问题