我有以下代码:
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("Fleet")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "2g")
.set("spark.submit.deployMode", "cluster")
.set("spark.executor.instances", "4")
.set("spark.executor.cores", "3")
.set("spark.cores.max", "12")
.set("spark.driver.cores", "4")
.set("spark.ui.port", "4040")
.set("spark.streaming.backpressure.enabled", "true")
.set("spark.streaming.kafka.maxRatePerPartition", "30")
val spark = SparkSession
.builder
.appName("Fleet")
.config("spark.cassandra.connection.host", "192.168.0.40")
.config("spark.cassandra.connection.port", "9042")
.config("spark.submit.deployMode", "cluster")
.master("local[*]")
.getOrCreate()
val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val sqlContext = new SQLContext(sc)
val topics = Map("historyfleet" -> 1)
val kafkaStream = KafkaUtils.createStream(ssc, "192.168.0.40:2181", "fleetgroup", topics)
kafkaStream.foreachRDD(rdd =>
{
val dfs = rdd.toDF()
println(dfs.show())
dfs.write.format("org.apache.spark.sql.cassandra").options(Map("table" -> "test", "keyspace" -> "test_db")).mode(SaveMode.Append).save()
})
ssc.start()
ssc.awaitTermination()
}
我可以在本地机器上从Eclipse执行此程序,但是当尝试通过群集上的spark提交作业执行时,它会出现错误: -
ERROR 2018-05-21 13:00:27,009 org.apache.spark.deploy.DseSparkSubmitBootstrapper: Failed to start or submit Spark application
java.lang.RuntimeException: com.datastax.bdp.fs.model.NoSuchFileException: File not found: /tmp/hive/
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) ~[hive-exec-1.2.1.spark2.jar:1.2.1.spark2]
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:189) ~[spark-hive_2.11-2.0.2.16.jar:2.0.2.16]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_161]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[na:1.8.0_161]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.8.0_161]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[na:1.8.0_161]
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) ~[spark-hive_2.11-2.0.2.16.jar:2.0.2.16]
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) ~[spark-hive_2.11-2.0.2.16.jar:2.0.2.16]
我的想法是从Kafka流中获取记录并将数据推送到Cassandra。谢谢,
答案 0 :(得分:0)
您需要将dsefs
键空间的复制因子(rf)增加到大于1的值。此外,dsefs
键空间(与任何其他键空间一样)最适合NetworkTopologyStrategy
。这是一个用rf = 3改变策略的命令。
ALTER KEYSPACE dsefs WITH replication = {'class':'NetworkTopologyStrategy', '<YOUR DC HERE>': '3'}
更改键空间后,您需要运行节点修复所有节点。
nodetool repair dsefs
除此之外,您可以从DSEFS中删除/tmp/hive
并使用
dse fs
mkdir -p -m 733 /tmp/hive