Question

我正在汇总大量传感器数据。我需要提取1.）特定类别的最大运行长度和2.）运行中所有变量的摘要统计信息。

例如数据：

I0716 01:19:57.682785 140641120896768 text_encoder.py:825] Processing token [减少耻辱感和歧视] took 20.3634090424 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I0716 01:20:48.160912 140641120896768 text_encoder.py:825] Processing token [随之而来的是关于许多城市房价疯涨的报道] took 38.5226171017 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I0716 01:21:21.621052 140641120896768 text_encoder.py:825] Processing token [本应该到学校的时候还在睡觉] took 23.9026520252 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I0716 01:21:36.516011 140641120896768 text_encoder.py:825] Processing token [他瘦骨嶙峋] took 7.57232689857 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I0716 01:21:53.882486 140641120896768 text_encoder.py:825] Processing token [预订的新船] took 7.55470395088 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I0716 01:22:26.441000 140641120896768 text_encoder.py:825] Processing token [由自由广场向华盛顿特区国会山行进] took 23.2021839619 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I0716 01:22:52.463999 140641120896768 text_encoder.py:825] Processing token [并且是在你们的命令下] took 17.6086068153 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I0716 01:23:40.435317 140641120896768 text_encoder.py:825] Processing token [工件的散热方式分为三个阶段] took 35.9974820614 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I0716 01:25:42.266793 140641120896768 text_encoder.py:825] Processing token [主要是用于远程教育] took 91.1036720276 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I0716 01:34:51.282723 140641120896768 text_encoder.py:825] Processing token [发展和保持一个高素质的人力资源管理系统] took 423.85622716 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I0716 01:38:05.721715 140641120896768 text_encoder.py:825] Processing token [而导致失败的] took 80.3117871284 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I0716 01:49:44.630991 140641120896768 text_encoder.py:825] Processing token [因为无需在包运行时重新生成索引] took 518.985460997 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I0716 02:04:28.477201 140641120896768 text_encoder.py:825] Processing token [但商业内幕网站是他的另外一项个人投资] took 700.790585995 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I0716 02:15:35.718251 140641120896768 text_encoder.py:825] Processing token [美林的结论是] took 317.336364031 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.

STORAGE_BUCKET=gs://darkt_t2t
DATA_DIR=$STORAGE_BUCKET/data/
TMP_DIR=/mnt/disks/mnt-dir/t2t_tmp
t2t-datagen --problem=translate_enzh_wmt32k --data_dir=$DATA_DIR --tmp_dir=$TMP_DIR

我可以通过将每一行与前面的行进行比较以查看它们是否不同来完成＃1。

但是，此结果还返回最后一行的Temp的单个条目，我希望能够计算Temp数据上的各种汇总，例如均值。

require(dplyr)
    fruit <- as.factor(c('apple','apple','banana','banana','banana','guava','guava','guava','guava','apple','apple','apple','banana','guava'))
    duration <- c(1,2,1,2,3,1,2,3,4,1,2,3,1,1)
    set.seed(14)
    temp <- round(runif(14, 80.0, 105.0))
    test <- data.frame(duration, fruit, temp)

我想结束的地方是一个更像这样的框架：

#Example Data Frame
duration  fruit   temp
 1        apple   86
 2        apple   96
 1        banana  104
 2        banana  94
 3        banana  105
 1        guava   93
 2        guava   103
 3        guava   91
 4        guava   92
 1        apple   90
 2        apple   102
 3        apple   84
 1        banana  92
 1        guava   101

test %>% filter((lead(`fruit`) != `fruit`)| is.na(lead(`fruit`)) )

关于如何有效执行此操作的任何想法？

Answer 1

我们可以使用lag和cumsum创建组，然后为每个组计算统计信息。

library(dplyr)

test %>%
  group_by(group = cumsum(fruit != lag(fruit, default = first(fruit)))) %>%
  summarise(fruit = first(fruit), 
            duration = n(), 
            mean_temp = mean(temp)) %>%
  select(-group)

#  fruit  duration mean_temp
#  <fct>     <int>     <dbl>
#1 apple         2      91  
#2 banana        3     101  
#3 guava         4      94.8
#4 apple         3      92  
#5 banana        1      92  
#6 guava         1     101

也可以使用data.table::rleid将group_by行替换为

来创建组。

group_by(group = data.table::rleid(fruit))

或使用rle

group_by(group = with(rle(as.character(fruit)), rep(seq_along(values), lengths)))

或使用data.table

library(data.table)
setDT(test)[, .(duration = .N, fruit = fruit[1L], 
                mean_temp = mean(temp)), by = rleid(fruit)]

根据分类运行进行汇总

1 个答案: