我正在尝试获取表中的行数:bank_accounts。条件为"source_system_name=SAP" & period_year="2017"
为此,我想出了以下代码:
object PartitionRetrieval {
var conf = new SparkConf().setAppName("Spark-JDBC").set("spark.executor.heartbeatInterval","120s")
.set("spark.network.timeout","12000s")
val log = LogManager.getLogger("Spark-JDBC Program")
Logger.getLogger("org").setLevel(Level.ERROR)
val conFile = "/home/user/ReconTest/inputdir/testconnection.properties"
val properties = new Properties()
properties.load(new FileInputStream(conFile))
val connectionUrl = properties.getProperty("gpDevUrl")
val devUserName = properties.getProperty("devUserName")
val devPassword = properties.getProperty("devPassword")
val driverClass = properties.getProperty("gpDriverClass")
val tableName = "dev.banknumbers"
try {
Class.forName(driverClass).newInstance()
} catch {
case cnf: ClassNotFoundException =>
log.error("Driver class: " + driverClass + " not found")
System.exit(1)
case e: Exception =>
log.error("Exception: " + e.printStackTrace())
System.exit(1)
}
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().config(conf).master("yarn").enableHiveSupport().getOrCreate()
val gpTable = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable",tableName)
.option("user",devUserName)
.option("password",devPassword).load()
val rc = gpTable.filter(gpTable("source_system_name")==="GEA_CENTERPIECE" && gpTable("period_year")==="2017").count()
println("gpTable Count: " + rc)
}
}
罐子运行5分钟,然后给我结果。输出为21222313 如果我在工作台工具上以查询格式运行相同的代码,则将在5秒钟内得到结果。 早些时候得到了:
18/07/24 10:10:50 ERROR YarnScheduler: Lost executor 2 on ip-10-230-137-10.ec2.internal: Executor heartbeat timed out after 120041 ms
18/07/24 10:10:52 ERROR YarnScheduler: Lost executor 2 on ip-10-230-137-10.ec2.internal: Container container_e540_1532132067680_0344_01_000003 exited from explicit termination request.
给出以下命令后运行正常
set("spark.executor.heartbeatInterval","120s")
set("spark.network.timeout","12000s")
我正在练习spark,它只是一个计数查询,但是为什么它在spark上运行缓慢。 我是否应该以其他任何方式赋予过滤器谓词或更改代码中的任何其他参数,以使其运行更快?