Question

想象一下，我想把我作为Hive表的Iris数据集发送到不同的reducer，以便在R上并行运行相同的任务。我可以通过transform函数执行我的R脚本并使用侧视图爆炸在hive上对虹膜数据集和包含我的“分区”变量的数组进行笛卡尔积，如下面的查询所示：

    set source_table = iris;

    set x_column_names = "sepallenght|sepalwidth|petallength|petalwidth";
    set y_column_name = "species";
    set output_dir = "/r_output";
    set model_name ="paralelism_test";    
    set param_var = params;
    set param_array = array(1,2,3);

    set mapreduce.job.reduces=3;

    select transform(id, sepallenght, sepalwidth, petallength, petalwidth, species, ${hiveconf:param_var})
    using 'controlScript script.R ${hiveconf:x_column_names}${hiveconf:y_column_name}${hiveconf:output_dir}${hiveconf:model_name}${hiveconf:param_var}'
    as (script_result string)
    from 
    (select * 
    from ${hiveconf:source_table} 
    lateral view explode ( ${hiveconf:param_array} ) temp_table 
    as ${hiveconf:param_var}
    distribute by ${hiveconf:param_var}
    ) data_query;

我调用了一个内存控制脚本，所以请为了客观性而忽略它。我的script.R返回的是它收到的唯一参数列表（填充了“param_var”数组值的“params”列）和它获得的分区的行数，如下所示：

    #The aim of this script is to validate the paralel computation of R scripts through Hive.

compute_model <- function(data){
  paste("parameter ",unique(data[ncol(data)]), ", " , nrow(data), "lines")
}

main <- function(args){

  #Reading the input parameters
  #These inputs were passed along the transform's "using" clause, on Hive.
  x_column_names <- as.character(unlist(strsplit(gsub(' ','',args[1]),'\\|')))
  y_column_name <- as.character(args[2])
  target_dir <- as.character(args[3])
  model_name <- as.character(args[4])
  param_var_name <- as.character(args[5])

  #Reading the data table
  f <- file("stdin")
  open(f)
  data <- tryCatch({
    as.data.frame (
      read.table(f, header=FALSE, sep='\t', stringsAsFactors = T, dec='.')
    )}, 
    warning = function(w) cat(w),
    error = function(e) stop(e),
    finally = close(f)
  )

  #Computes the model. Here, the model can be any computation.
  instance_result <- as.character(compute_model(data))

  #writes the result to "stdout" separated by '\t'. This output must be a data frame where
  #each column represents a Hive Table column.
  write.table(instance_result,
              quote = FALSE,
              row.names = FALSE,
              col.names = FALSE,
              sep = "\t",
              dec='.'
  )
}

#Main code
###############################################################

main(commandArgs(trailingOnly=TRUE))

我想让配置单元做的是在这些缩减器中平等地复制Iris数据集。当我将序列值放在我的param_array变量上时它工作正常，但是对于像array（10,100,1000,10000）和mapreduce.job.reduces = 4或array（-5，-4，-3，-2）这样的值，-1,0,1,2,3,4,5）和mapreduce.job.reduces = 11，某些Reducer不会收到任何数据，而其他的会收到多个密钥。

问题是：有没有办法确保hive将每个分区分配给不同的reducer？

我清楚了吗？这样做可能看起来很愚蠢，但我想在Hadoop上运行网格搜索，并且对使用更适合此任务的其他技术有一些限制。

谢谢！

如何强制Hive在不同的reducer之间平均分配数据？

0 个答案: