使用dplyr对R中的多层数据进行均值离群值替换

时间:2018-11-29 14:59:32

标签: r dplyr time-series outliers

我的df有不同客户的销售数据,但有一些离群值,我想替换离群值(均值以下2 SD以上)(μ±2σ),并用其每个customer_id均值替换它们。

var app = new Framework7({
  root: '#app',

  // Create routes for all pages
  routes: [
    {
      path: '/',
      url: 'index.html',
    },{
      // Add your contents page route 
      path: '/your-page/',
      url: 'pages/your-page.html',
    },
    .....
});

有人可能会帮助我使用dplyr。 注意:所有“ 0”值和销售额(不等于(μ±2σ))都需要替换为与其customer_id相关的平均值

1 个答案:

答案 0 :(得分:0)

dplyr的另一种方式:)

不能完全确定是否要基于全局平均值或按客户分组,所以有2个版本。

编辑:要检查<均值-2sd以及!= 0,则必须将ifelse的第一个参数更改为

sales > mean(sales) + 2*sd(sales) | sales < mean(sales) - 2*sd(sales) | sales == 0

代码

# version to check for > global mean + 2 * global sd
# if sales-value > global cutoff sales-value gets replaced by customer mean
test_data2 = 
  test_data %>% group_by(customer_id) %>% 
  mutate(sales = ifelse(sales > mean(test_data$sales) + 2*sd(test_data$sales), mean(sales), sales))

# version to check for mean per customer + 2 * sd per customer
# if sales-value > customer cutoff sales-value gets replaced by customer mean
test_data2 = 
  test_data %>% group_by(customer_id) %>% 
  mutate(sales = ifelse(sales > mean(sales) + 2*sd(sales), mean(sales), sales))



### check if this is what we want

# calc global mean + global sd + cutoff global
mean(test_data$sales)
sd(test_data$sales)
mean(test_data$sales) + 2*sd(test_data$sales)

# calc mean, sd, cutoff for each customer
test_data %>% group_by(customer_id) %>% summarise(mean = mean(sales), sd = sd(sales), cutoff = mean + 2*sd(sales))



test_data$sales2 = test_data2$sales

test_data %>% filter(customer_id == "80A09")
test_data %>% filter(customer_id == "9000A")
test_data %>% filter(customer_id == "Y90BC")

使用单独的控制代码,不会在两个版本之间进行推断:

df = structure(list(Date = c("6/29/2014", "7/6/2014", "7/13/2014", 
                                    "7/20/2014", "7/27/2014", "8/3/2014", "8/10/2014", "8/17/2014", 
                                    "8/24/2014", "6/29/2014", "7/6/2014", "7/13/2014", "7/20/2014", 
                                    "7/27/2014", "8/3/2014", "8/10/2014", "8/17/2014", "8/24/2014", 
                                    "7/6/2014", "7/13/2014", "7/20/2014", "7/27/2014", "8/3/2014", 
                                    "8/10/2014", "8/17/2014", "8/24/2014"), customer_id = c("9000A", 
                                                                                            "9000A", "9000A", "9000A", "9000A", "9000A", "9000A", "9000A", 
                                                                                            "9000A", "80A09", "80A09", "80A09", "80A09", "80A09", "80A09", 
                                                                                            "80A09", "80A09", "80A09", "Y90BC", "Y90BC", "Y90BC", "Y90BC", 
                                                                                            "Y90BC", "Y90BC", "Y90BC", "Y90BC"), sales = c(20L, 40L, 0L, 
                                                                                                                                           42L, 56L, 90L, 500L, 23L, 60L, 200L, 234L, 500L, 450L, 0L, 900L, 
                                                                                                                                           459L, 347L, 895L, 380L, 390L, 432L, 320L, 400L, 10L, 0L, 1000L
                                                                                            )), class = "data.frame", row.names = c(NA, -26L))



test_data = df %>% group_by(customer_id) %>% mutate(sales =ifelse( sales > mean(sales) + 2*sd(sales) | sales < mean(sales) - 2*sd(sales) | sales == 0,mean(sales),sales))
test_data$sales_old = df$sales

df %>% group_by(customer_id) %>% summarise(mean = mean(sales), sd = sd(sales), cutoff = mean + 2*sd(sales))


test_data %>% filter(customer_id == "80A09" & sales != sales_old)
test_data %>% filter(customer_id == "9000A" & sales != sales_old)
test_data %>% filter(customer_id == "Y90BC" & sales != sales_old)

输出:

> df %>% group_by(customer_id) %>% summarise(mean = mean(sales), sd = sd(sales), cutoff = mean + 2*sd(sales))
# A tibble: 3 x 4
  customer_id  mean    sd cutoff
  <chr>       <dbl> <dbl>  <dbl>
1 80A09       443.   301.  1045.
2 9000A        92.3  155.   402.
3 Y90BC       366.   310.   986.
> test_data %>% filter(customer_id == "80A09" & sales != sales_old)
# A tibble: 1 x 4
# Groups:   customer_id [1]
  Date      customer_id sales sales_old
  <chr>     <chr>       <dbl>     <int>
1 7/27/2014 80A09        443.         0
> test_data %>% filter(customer_id == "9000A" & sales != sales_old)
# A tibble: 2 x 4
# Groups:   customer_id [1]
  Date      customer_id sales sales_old
  <chr>     <chr>       <dbl>     <int>
1 7/13/2014 9000A        92.3         0
2 8/10/2014 9000A        92.3       500
> test_data %>% filter(customer_id == "Y90BC" & sales != sales_old)
# A tibble: 2 x 4
# Groups:   customer_id [1]
  Date      customer_id sales sales_old
  <chr>     <chr>       <dbl>     <int>
1 8/17/2014 Y90BC        366.         0
2 8/24/2014 Y90BC        366.      1000