Question

我在EMR 4.3上使用Spark 1.6来查询属于hive Metastore中的表的~15TB数据（由S3中的gzipped镶木地板文件支持）。对于我的集群，我有一个r3.8xlarge主节点和15个r3.8xlarge核心节点（3.6TB RAM，9.6TB SSD）。

〜15TB数据可能包含在90亿行中。每行有~15列，用于存储长度为5-50的字符串，以及一列包含~30个字符串的数组，每行10-20个字符。数组中只存储了约100万个唯一字符串。我试图做的就是计算数组列中的唯一字符串，但似乎我一直在耗尽内存： OutOfMemoryError：无法创建新的本机线程关于遗嘱执行人。由于内存不足错误，执行程序被禁用，然后作业失败，任务失败。

当我查询5-10TB数据时它可以工作。我不能正确理解存储在内存中的内容（这是我想要弄清楚的）。顺便说一下，在上面的群集中，我设置了：

spark.executor.memory 30g
spark.executor.cores 5
spark.executor.instances 90 // 6 instances per r3.8xlarge host

我没想到Spark SQL将中间表存储在内存中。由于没有超过1M的唯一字符串，我认为带有计数的字符串应该很容易适合内存。这是查询：

val initial_df = sqlContext.sql("select unique_strings_col from Table where timestamp_partition between '2016-09-20T07:00:00Z' and '2016-09-23T07:00:00Z'")
initial_df.registerTempTable("initial_table") // ~15TB compressed data to read in from S3

val unique_strings_df = sqlContext.sql("select posexplode(unique_strings_col) as (string_pos, string) from initial_table").select($"string_pos", $"string")
unique_strings_df.registerTempTable("unique_strings_table")  // ~70% initial data remaining at this point

val strings_count_df = sqlContext.sql("select string, count(*) as unique_string_count from unique_strings_table where string_pos < 21 group by string order by unique_string_count desc") // ~50% initial data remaining at this point
strings_count_df.write.parquet("s3://mybucket/counts/2016-09-20-2016-09-23")

压缩的镶木地板文件很小（比如每个5mb）。看起来他们可以一次阅读，过滤，并存储他们的计数。我错过了什么？

Answer 1

事实证明，我需要有足够的磁盘+内存空间来存储初始RDD。如果我在创建临时表之前在初始RDD中进行更多的前期过滤，我能够成功运行查询。耶！

Spark SQL 1.6.0 - 简单查询的大量内存使用

1 个答案: