您可以并行化R中的面板操作吗?

时间:2020-05-28 16:55:40

标签: r dplyr multidplyr

在我的R脚本中,我正在使用pmdplyr函数mutate_cascade()tlag()来突变我的数据,该数据包含超过300万条记录,因此代码非常复杂速度慢,但是可以。为了加快速度,我尝试添加multidplyr的并行处理功能。但这引发了错误:小标题中的所有列都必须是向量。 x列.是一个multidplyr_party_df对象。是因为无法在multidplyr群集上运行pmdplyr pibble吗?我对pmdplyr和multidplyr都是陌生的,所以也许我只是做错了什么?

我得到一个包含变量import_data的合并uuid, location_id, import_date, customer_name, total_value数据集。异常导入会导致total_value中的大量峰值,因此我的代码力求使该值(相对于每个客户)的几乎不可能的波动均匀:


cluster_library(cluster, "dplyr")
cluster_library(cluster, "pmdplyr")

mydata <- import_data %>%

  mutate(
    time_var = time_variable(import_date),
    id_var = id_variable(uuid, location_id),
    total_value_imported = total_value
  ) %>%

  arrange(uuid, location_id, import_date) %>%

  group_by(uuid, location_id) %>%

  partition(cluster) %>%

  pibble(
    .i = id_var,
    .t = time_var,
    .d = 0
  ) %>%

  mutate_cascade(
    total_value = case_when(
      total_value > (tlag(total_value, .n = 1, .quick = TRUE) * 10)  
        ~ tlag(total_value, .n = 1, .quick = TRUE),
      total_value < (tlag(total_value, .n = 1, .quick = TRUE) / 10) 
        ~ tlag(total_value, .n = 1, .quick = TRUE),
      TRUE ~ total_value_imported
    )
  ) %>%

  collect()

> rlang::last_error()
<error/tibble_error_column_scalar_type>
All columns in a tibble must be vectors.
x Column `.` is a `multidplyr_party_df` object.
Backtrace:
  1. dplyr::mutate(...)
  1. dplyr::arrange(., uuid, location_id, import_date)
  1. dplyr::group_by(., uuid, location_id)
  1. multidplyr::partition(., cluster)
  8. pmdplyr::pibble(., .i = id_var, .t = time_var, .d = 0)
  9. tibble::tibble(...)
 10. tibble:::tibble_quos(xs[!is_null], .rows, .name_repair)
 11. tibble:::check_valid_col(res, col_names[[j]], j)
 12. tibble:::check_valid_cols(set_names(list(x), name))
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/tibble_error_column_scalar_type>
All columns in a tibble must be vectors.
x Column `.` is a `multidplyr_party_df` object.
Backtrace:
     █
  1. └─`%>%`(...)
  2.   ├─base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
  3.   └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
  4.     └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
  5.       └─`_fseq`(`_lhs`)
  6.         └─magrittr::freduce(value, `_function_list`)
  7.           └─function_list[[i]](value)
  8.             └─pmdplyr::pibble(., .i = id_var, .t = time_var, .d = 0)
  9.               └─tibble::tibble(...)
 10.                 └─tibble:::tibble_quos(xs[!is_null], .rows, .name_repair)
 11.                   └─tibble:::check_valid_col(res, col_names[[j]], j)
 12.                     └─tibble:::check_valid_cols(set_names(list(x), name))```


0 个答案:

没有答案
相关问题