scala - 我应该使用哪种HBase HBase连接器？

我应该使用哪种HBase HBase连接器？

时间：2016-12-01 11:00:00

标签： scala apache-spark hbase google-cloud-dataproc google-cloud-bigtable

我们的堆栈由Google Data Proc（Spark 2.0）和Google BigTable（HBase 1.2.0）组成，我正在寻找使用这些版本的连接器。

对于我找到的连接器，我不清楚Spark 2.0和新的DataSet API支持：

spark-hbase ：pthreads requires
spark-hbase-connector ：https://github.com/apache/hbase/tree/master/hbase-spark
hortonworks-spark / shc ：https://github.com/nerdammer/spark-hbase-connector

该项目使用SBT在Scala 2.11中编写。

感谢您的帮助

2 个答案:

答案 0 :(得分：7)

更新：SHC现在似乎可以与Spark 2和Table API一起使用。见https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc

原始回答：

我不相信这些（或任何其他现有连接器）中的任何一个都能满足您的需求。

spark-hbase 在发布时可能是正确的解决方案（HBase 1.4？），但目前只在头部和is still working on Spark 2 support构建。
spark-hbase-connector 似乎只支持RDD API，但由于它们更稳定，可能会有所帮助。
hortonworks-spark / shc 可能无法正常工作，因为我认为它只支持Spark 1并使用不适合BigTable的旧HTable API。

我建议只使用HBase MapReduce API和RDD方法，如newAPIHadoopRDD（或者可能是spark-hbase-connector？）。然后手动将RDD转换为DataSet。在Scala或Java中，这种方法比Python更容易。

这是HBase社区正在努力改进的领域，Google Cloud Dataproc将在这些改进发生时加入这些改进。

答案 1 :(得分：1)

除了上述答案之外，使用if (strcmp(argv[1], "\?") == 0) { ajudaPrompt(); }else { printf ("ERROR.\n\n"); system("pause"); }意味着，您可以从HBase获得所有数据，然后从中获取所有核心火花。你不会得到像过滤器等任何HBase特定的API。而目前的spark-hbase，只有快照可用。