如何按Spark中的字段对数据进行分组?

时间:2016-10-31 10:11:08

标签: java apache-spark cassandra

我想从数据库中读取两列,将它们按第一列分组,然后使用Spark将结果插入另一个表中。我的程序是用Java编写的。我尝试了以下方法:

$('.question').html(function(){
    var json = JSON.parse($(this).attr('data-infos'));
    if(json.parent_id === 0){
        return 'it ok';
    }
});

这给了我错误:

public static void aggregateSessionEvents(org.apache.spark.SparkContext sparkContext) {
    com.datastax.spark.connector.japi.rdd.CassandraJavaPairRDD<String, String> logs = javaFunctions(sparkContext)
            .cassandraTable("dove", "event_log", mapColumnTo(String.class), mapColumnTo(String.class))
            .select("session_id", "event");
    logs.groupByKey();
    com.datastax.spark.connector.japi.CassandraJavaUtil.javaFunctions(logs).writerBuilder("dove", "event_aggregation", null).saveToCassandra();
    sparkContext.stop();
}

我的依赖关系是:

The method cassandraTable(String, String, RowReaderFactory<T>) in the type SparkContextJavaFunctions is not applicable for the arguments (String, String, RowReaderFactory<String>, mapColumnTo(String.class))

我该如何解决这个问题?

2 个答案:

答案 0 :(得分:1)

改变这个:

C:\Program Files (x86)\company\Campaign_Analyze>"C:\Program Files (x86)\sintec\Ca
mpaign_Analyze\dist\test_zone_A_main.exe"
Traceback (most recent call last):
  File "test_zone_A_main.py", line 9, in <module>
  File "calcs_performence.pyc", line 11, in <module>
  File "test_config_cs.pyc", line 11, in <module>
  File "pandas\__init__.pyc", line 13, in <module>
ImportError: C extension: DLL load failed: The specified module could not be found. not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace' to build the C extensions first.

到:

.cassandraTable("dove", "event_log", mapColumnTo(String.class), mapColumnTo(String.class))

你发送了额外的论据。

答案 1 :(得分:0)

要按字段对数据进行分组,请执行以下步骤:

  1. 必须将数据检索到该表的JavaRDD中。
  2. 必须将所需的列提取为一对,其中键作为第一个,其余数据作为第二个。
  3. 使用reduceByKey根据需求聚合值。
  4. 之后,可以将数据插入另一个表中或用于进一步处理。

    public static void aggregateSessionEvents(SparkContext sparkContext) {
        JavaRDD<Data> datas = javaFunctions(sparkContext).cassandraTable("test", "data",
                mapRowTo(Data.class));
        JavaPairRDD<String, String> pairDatas = datas
                .mapToPair(data -> new Tuple2<>(data.getKey(), data.getValue()));
        pairDatas.reduceByKey((value1, value2) -> value1 + "," + value2);
        sparkContext.stop();
    }