Question

我正在尝试查找历史上连续多年的商品销售高峰。我的问题是，某些商品在过去已经售出并停产，但仍需要作为分析的一部分。例如：

我已经研究过r中的一些for循环，但是我不确定如何解决连续多年的总和并将其与同一数据集中的其他局部最大值进行比较的问题。

Year      Item            Sales
2001      Trash Can       100
2002      Trash Can       125
2003      Trash Can       90
2004      Trash Can       97
2002      Red Balloon     23
2003      Red Balloon     309
2004      Red Balloon     67
2005      Red Balloon     8
1998      Blue Bottle     600
1999      Blue Bottle     565

基于上述数据，如果我想计算2年的销售高峰，我想输出Blue Bottle 1165（1998和1999年的总和），Red Balloon 376（2003和2004年的总和）和垃圾桶。 225（2001年和2002年之和）。但是，如果我想要一个3年的峰值，那么Blue瓶将是不合格的，因为它只有2年的数据。

如果我想计算三年的销售高峰，我想输出Red Balloon 399（2002年至2004年之和）和Trash Can 315（2001年至2003年之和）。

Answer 1

在SQL中，可以使用窗口函数。对于两年的合格销售：

    select item, sales, year
    from (select t.*,
                 sum(sales) over (partition by item order by year rows between 1 preceding and current row) as two_year_sales,
                 row_number() over (partition by item order by year) as seqnum
          from t
         ) t
    where seqnum >= 2;

并达到顶峰：

select t.*   
from (select item, two_year_sales, year,
             max(two_year_sales) over (partition by item) as max_two_year_sales
      from (select t.*,
                   sum(sales) over (partition by item order by year rows between 1 preceding and current row) as two_year_sales,
                   row_number() over (partition by item order by year) as seqnum
            from t
           ) t
      where seqnum >= 2
     ) t
where two_year_sales = max_two_year_sales;

Answer 2

R中使用tidyverse和RcppRoll的解决方案：

#Loading the packages and your data as a `tibble`
library("RcppRoll")
library("dplyr")

tbl <- tribble(
  ~Year,     ~Item,          ~Sales,
  2001,      "Trash Can",       100,
  2002,      "Trash Can",       125,
  2003,      "Trash Can",       90,
  2004,      "Trash Can",       97,
  2002,      "Red Balloon",     23,
  2003,      "Red Balloon",     309,
  2004,      "Red Balloon",      67,
  2005,      "Red Balloon",     8,
  1998,      "Blue Bottle",     600,
  1999,      "Blue Bottle",     565
)

# Set the number of consecutive years
n <- 2

# Compute the rolling sums (assumes data to be sorted) and take max
res <- tbl %>% 
  group_by(Item) %>% 
  mutate(rollingsum = roll_sumr(Sales, n)) %>% 
  summarize(best_sum = max(rollingsum, na.rm = TRUE))
print(res)
## A tibble: 3 x 2
#  Item        best_sum
#  <chr>          <dbl>
#1 Blue Bottle     1165
#2 Red Balloon      376
#3 Trash Can        225

设置n <- 3会产生不同的res：

print(res)
## A tibble: 3 x 2
#  Item        best_sum
#  <chr>          <dbl>
#1 Blue Bottle     -Inf
#2 Red Balloon      399
#3 Trash Can        315

Answer 3

我只能在SQL部分为您提供帮助；将GROUP BY与HAVING一起使用。使用HAVIG，它将过滤掉所有没有指定最小历史数据年数的项目。

检查此查询是否可以调整您的要求。

SELECT 
     item
     , count(*) as num_years
     , sum(Sales) as local_max 
from [your_table] 
where year between [year_ini] and [year_end]
group by item 
having count(*) >= [number_of_years]

Answer 4

将数据dat（在末尾的注释中可重复显示）读入一个动物园系列中，每个Item包含一列，然后转换为ts系列tt（它将填充在缺少的年份中与NA）。然后使用rollsumr取每个k每隔Item年的总和，找出每个Item的最大值，将其堆叠到数据帧中并忽略任何NA行。函数Max与max(x, na.rm = TRUE)相似，除了如果x是所有NA，它将返回NA而不是-Inf并且不会发出警告。 stack秒输出项目列，因此使用2：1反转列并添加更好的名称。

library(zoo)

Max <- function(x) if (all(is.na(x))) NA else max(x, na.rm = TRUE)

peak <- function(data, k) {
  tt <- as.ts(read.zoo(data, split = "Item"))
  s <- na.omit(stack(apply(rollsumr(tt, k), 2, Max)))
  setNames(s[2:1], c("Item", "Sum"))
}

peak(dat, 2)
##          Item  Sum
## 1 Blue Bottle 1165
## 2 Red Balloon  376
## 3   Trash Can  225

peak(dat, 3)
##          Item Sum
## 2 Red Balloon 399
## 3   Trash Can 315

注意

可重复输入的形式假定为：

dat <- 
structure(list(Year = c(2001L, 2002L, 2003L, 2004L, 2002L, 2003L, 
2004L, 2005L, 1998L, 1999L), Item = c("Trash Can", "Trash Can", 
"Trash Can", "Trash Can", "Red Balloon", "Red Balloon", "Red Balloon", 
"Red Balloon", "Blue Bottle", "Blue Bottle"), Sales = c(100L, 
125L, 90L, 97L, 23L, 309L, 67L, 8L, 600L, 565L)), row.names = c(NA, 
-10L), class = "data.frame")

如何以不同的时间间隔求和以找到多年高峰

4 个答案:

注意