将数据帧转换为spark中的RDD [元组]

时间:2016-04-27 11:58:32

标签: scala apache-spark data-migration

我有一个火花数据框,我希望存储在Cassandra中。我不想使用类parallelize()这个然后存储在Cassandra中,因为我想概括这段代码,从配置文件中读取模式。 现在每个数据框可能有不同的列数。

我正在尝试将数据帧转换为RDD [元组]。这就是我的尝试:

val rddTuple = dataframe.map(r => r match {
  case r if (r.size == 3) => (r.getAs(0), r.getAs(1), r.getAs(2))
  case r  if (r.size == 4) => (r.getAs(0), r.getAs(1), r.getAs(2), r.getAs(3))
  case r if (r.size == 5) => (r.getAs(0), r.getAs(1), r.getAs(2), r.getAs(3),r.getAs(4))
  case r if (r.size == 6) => (r.getAs(0), r.getAs(1), r.getAs(2), r.getAs(3),r.getAs(4),r.getAs(5))
  case r if (r.size == 7) => (r.getAs(0), r.getAs(1), r.getAs(2), r.getAs(3), r.getAs(4), r.getAs(5), r.getAs(6))
  case r if (r.size == 8) => (r.getAs(0), r.getAs(1), r.getAs(2), r.getAs(3),r.getAs(4),r.getAs(5),r.getAs(6),r.getAs(7))
})
val cassColumn = entities.getStringList("cassandraSchemaKey").asScala //gets schema from config file
rddTuple.saveToCassandra("usermgmt", "channel", SomeColumns(cassColumn :_*))

这会导致错误:

Exception in thread "main" java.lang.IllegalArgumentException: Some primary key columns are missing in RDD or have not been selected: id
at com.datastax.spark.connector.writer.DefaultRowWriter.checkMissingPrimaryKeyColumns(DefaultRowWriter.scala:44)
at com.datastax.spark.connector.writer.DefaultRowWriter.<init>(DefaultRowWriter.scala:71)
at com.datastax.spark.connector.writer.DefaultRowWriter$$anon$2.rowWriter(DefaultRowWriter.scala:109)
at com.datastax.spark.connector.writer.DefaultRowWriter$$anon$2.rowWriter(DefaultRowWriter.scala:107)
at com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:171)
at com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:23)
at Main$.main(Main.scala:56)
at Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

但是,当我将它用于单个表时,它可以正常工作:

val rddTuple = dataframe.map(r => (r.getAs(0), r.getAs(1), r.getAs(2), r.getAs(3), r.getAs(4), r.getAs(5)))

此外,还有其他方法可以将数据框存储到Cassandra吗?

0 个答案:

没有答案