spark scala 'take(10)' 操作花费的时间太长

时间:2021-02-12 19:56:31

标签: scala apache-spark amazon-emr

我在我的超级简单的 Spark Scala 应用程序中得到了以下代码:

    ...
    val t3 = System.currentTimeMillis
    println("VertexRDD created in " + (t3 - t2) + " ms")
    vertRDD.cache
    val t4 = System.currentTimeMillis
    println("VertexRDD size : "+vertRDD.partitions.size)

    println("VertexRDD cached in " + (t4 - t3) + " ms")
    vertRDD.take(10).foreach(println)
    println("VertexRDD size : "+vertRDD.partitions.size)
    ...

我使用命令将我的应用程序提交到 EMR Apache Spark 集群

spark-submit --deploy-mode cluster --master yarn --num-executors 4 --executor-memory 6g --driver-memory 6g --class com.****.TestSpark s3://****.jar

关于 vertRDD - 总共有 250k 条记录(我从数据库中读取它们是 25Mbyte 的数据)

。正如您可以从代码中我缓存 RDD 几行之前调用此行 (#175) 下面

vertRDD.take(10).foreach(println) - line #175 of my app

当我查看 Spark 历史记录时,我可以看到所有内存和其他参数都没有得到充分利用 - 当这条线被执行时,它的利用率仅为 60Mb,而可用的几 GB 可用数据以及执行时间超过总是 15 分钟,在某些情况下它甚至无法完成并且集群变得“因错误而终止”。

我正在运行的 EMR 集群是 1m5.2xlarge master 和 4m5.2xlarge 内核,并且在许多情况下失败!我看不懂WTF!

更新。在 EMR 控制台中挖掘后,我可以看到大部分时间它都有垃圾收集工作

enter image description here

而且我还看到 YARN 阻止了 2 个工人中的一个,这是那里的日志

2021-02-12T20:17:01.404+0000: [GC (Allocation Failure) [PSYoungGen: 126976K->9341K(147968K)] 126976K->9357K(486912K), 0.0076611 secs] [Times: user=0.04 sys=0.00, real=0.01 secs] 
2021-02-12T20:17:02.068+0000: [GC (Allocation Failure) [PSYoungGen: 136317K->9547K(147968K)] 136333K->9579K(486912K), 0.0079604 secs] [Times: user=0.03 sys=0.02, real=0.01 secs] 
2021-02-12T20:17:02.317+0000: [GC (Metadata GC Threshold) [PSYoungGen: 80014K->8203K(147968K)] 80046K->8243K(486912K), 0.0047442 secs] [Times: user=0.02 sys=0.00, real=0.00 secs] 
2021-02-12T20:17:02.321+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 8203K->0K(147968K)] [ParOldGen: 40K->7927K(195584K)] 8243K->7927K(343552K), [Metaspace: 20290K->20290K(1067008K)], 0.0239302 secs] [Times: user=0.10 sys=0.01, real=0.02 secs] 
2021-02-12T20:17:02.885+0000: [GC (Allocation Failure) [PSYoungGen: 126976K->4351K(195584K)] 134903K->12286K(391168K), 0.0042397 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 
2021-02-12T20:17:03.438+0000: [GC (Allocation Failure) [PSYoungGen: 195327K->9196K(258560K)] 203262K->17139K(454144K), 0.0076206 secs] [Times: user=0.02 sys=0.01, real=0.01 secs] 
2021-02-12T20:17:03.511+0000: [GC (Metadata GC Threshold) [PSYoungGen: 45869K->4857K(301568K)] 53813K->12800K(497152K), 0.0045228 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
2021-02-12T20:17:03.515+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 4857K->0K(301568K)] [ParOldGen: 7943K->10963K(274944K)] 12800K->10963K(576512K), [Metaspace: 33870K->33868K(1079296K)], 0.0268540 secs] [Times: user=0.09 sys=0.00, real=0.02 secs] 
2021-02-12T20:17:04.638+0000: [GC (Allocation Failure) [PSYoungGen: 289792K->11772K(301568K)] 300755K->24419K(576512K), 0.0113583 secs] [Times: user=0.03 sys=0.01, real=0.01 secs] 
2021-02-12T20:17:07.984+0000: [GC (Metadata GC Threshold) [PSYoungGen: 273980K->14305K(448000K)] 286626K->27278K(722944K), 0.0115704 secs] [Times: user=0.05 sys=0.01, real=0.02 secs] 
2021-02-12T20:17:07.995+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 14305K->0K(448000K)] [ParOldGen: 12972K->23489K(372736K)] 27278K->23489K(820736K), [Metaspace: 53854K->52909K(1099776K)], 0.1044483 secs] [Times: user=0.57 sys=0.02, real=0.10 secs] 
2021-02-12T20:17:10.207+0000: [GC (Allocation Failure) [PSYoungGen: 433664K->16376K(462848K)] 457153K->62952K(835584K), 0.0293058 secs] [Times: user=0.17 sys=0.02, real=0.03 secs] 
2021-02-12T20:17:12.893+0000: [GC (Allocation Failure) [PSYoungGen: 462840K->27642K(481280K)] 509416K->328728K(854016K), 0.2258796 secs] [Times: user=1.57 sys=0.22, real=0.23 secs] 
2021-02-12T20:17:13.119+0000: [Full GC (Ergonomics) [PSYoungGen: 27642K->0K(481280K)] [ParOldGen: 301086K->317625K(916480K)] 328728K->317625K(1397760K), [Metaspace: 63821K->63816K(1110016K)], 1.6353318 secs] [Times: user=10.11 sys=0.08, real=1.64 secs] 
2021-02-12T20:17:15.068+0000: [GC (Allocation Failure) [PSYoungGen: 453632K->75168K(579584K)] 771257K->523874K(1496064K), 0.0906250 secs] [Times: user=0.59 sys=0.13, real=0.09 secs] 
2021-02-12T20:17:15.514+0000: [GC (Allocation Failure) [PSYoungGen: 528800K->2329K(671232K)] 977506K->451043K(1587712K), 0.0152511 secs] [Times: user=0.11 sys=0.00, real=0.01 secs] 
2021-02-12T20:17:15.945+0000: [GC (Allocation Failure) [PSYoungGen: 543001K->76277K(669696K)] 991715K->983751K(1586176K), 0.1116201 secs] [Times: user=0.54 sys=0.35, real=0.12 secs] 
2021-02-12T20:17:16.057+0000: [Full GC (Ergonomics) [PSYoungGen: 76277K->0K(669696K)] [ParOldGen: 907474K->523576K(1430528K)] 983751K->523576K(2100224K), [Metaspace: 65321K->65321K(1110016K)], 0.9539858 secs] [Times: user=7.26 sys=0.01, real=0.95 secs] 
2021-02-12T20:17:17.427+0000: [GC (Allocation Failure) [PSYoungGen: 540672K->7657K(679936K)] 1064248K->531242K(2110464K), 0.0102141 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] 
2021-02-12T20:17:17.914+0000: [GC (Allocation Failure) [PSYoungGen: 637929K->102391K(760832K)] 1161514K->1215807K(2191360K), 0.1063215 secs] [Times: user=0.58 sys=0.20, real=0.10 secs] 
2021-02-12T20:17:18.020+0000: [Full GC (Ergonomics) [PSYoungGen: 102391K->0K(760832K)] [ParOldGen: 1113416K->454233K(1679872K)] 1215807K->454233K(2440704K), [Metaspace: 65779K->65764K(1112064K)], 0.0906173 secs] [Times: user=0.39 sys=0.00, real=0.09 secs] 
2021-02-12T20:17:18.733+0000: [GC (Allocation Failure) [PSYoungGen: 630272K->17588K(888832K)] 1084505K->471830K(2568704K), 0.0175248 secs] [Times: user=0.03 sys=0.01, real=0.02 secs] 
2021-02-12T20:17:19.399+0000: [GC (Allocation Failure) [PSYoungGen: 778420K->29288K(900608K)] 1232662K->483537K(2580480K), 0.0225306 secs] [Times: user=0.05 sys=0.03, real=0.02 secs] 
2021-02-12T20:17:20.012+0000: [GC (Allocation Failure) [PSYoungGen: 790120K->18446K(962560K)] 1244369K->472704K(2642432K), 0.0210335 secs] [Times: user=0.04 sys=0.01, real=0.02 secs] 
2021-02-12T20:17:20.738+0000: [GC (Allocation Failure) [PSYoungGen: 866830K->18574K(975360K)] 1321088K->472840K(2655232K), 0.0235178 secs] [Times: user=0.07 sys=0.01, real=0.02 secs] 
2021-02-12T20:17:21.412+0000: [GC (Allocation Failure) [PSYoungGen: 866958K->31878K(1034240K)] 1321224K->486152K(2714112K), 0.0243945 secs] [Times: user=0.04 sys=0.04, real=0.03 secs] 
2021-02-12T20:17:22.599+0000: [GC (Allocation Failure) [PSYoungGen: 964742K->53206K(1047040K)] 1419016K->507488K(2726912K), 0.0283320 secs] [Times: user=0.08 sys=0.03, real=0.03 secs] 
2021-02-12T20:17:23.132+0000: [GC (Allocation Failure) [PSYoungGen: 986070K->23551K(1113088K)] 1440352K->477840K(2792960K), 0.0177533 secs] [Times: user=0.06 sys=0.00, real=0.02 secs] 
2021-02-12T20:17:23.604+0000: [GC (Allocation Failure) [PSYoungGen: 1037311K->28486K(1121280K)] 1491600K->482783K(2801152K), 0.0183161 secs] [Times: user=0.03 sys=0.03, real=0.02 secs] 
2021-02-12T20:17:24.024+0000: [GC (Allocation Failure) [PSYoungGen: 1042246K->36085K(1196032K)] 1496543K->490390K(2875904K), 0.0191460 secs] [Times: user=0.04 sys=0.03, real=0.02 secs] 
2021-02-12T20:17:24.584+0000: [GC (Allocation Failure) [PSYoungGen: 1139957K->50496K(1199616K)] 1594262K->504809K(2879488K), 0.0207042 secs] [Times: user=0.05 sys=0.01, real=0.02 secs] 
2021-02-12T20:17:25.046+0000: [GC (Allocation Failure) [PSYoungGen: 1154368K->47787K(1273344K)] 1608681K->502108K(2953216K), 0.0271859 secs] [Times: user=0.07 sys=0.03, real=0.02 secs] 
2021-02-12T20:17:25.520+0000: [GC (Allocation Failure) [PSYoungGen: 1225899K->50015K(1271296K)] 1680220K->504344K(2951168K), 0.0199173 secs] [Times: user=0.06 sys=0.01, real=0.02 secs] 
2021-02-12T20:17:26.012+0000: [GC (Allocation Failure) [PSYoungGen: 1228127K->28438K(1347584K)] 1682456K->482776K(3027456K), 0.0222568 secs] [Times: user=0.04 sys=0.02, real=0.03 secs] 
2021-02-12T20:17:26.519+0000: [GC (Allocation Failure) [PSYoungGen: 1290518K->21046K(1350656K)] 1744856K->475392K(3030528K), 0.0208783 secs] [Times: user=0.04 sys=0.01, real=0.02 secs] 
2021-02-12T20:17:27.004+0000: [GC (Allocation Failure) [PSYoungGen: 1283126K->51072K(1436672K)] 1737472K->505426K(3116544K), 0.0248668 secs] [Times: user=0.06 sys=0.03, real=0.03 secs] 
2021-02-12T20:17:27.523+0000: [GC (Allocation Failure) [PSYoungGen: 1401216K->49452K(1437184K)] 1855570K->503966K(3117056K), 0.0230231 secs] [Times: user=0.07 sys=0.00, real=0.03 secs] 
2021-02-12T20:17:28.038+0000: [GC (Allocation Failure) [PSYoungGen: 1399596K->42078K(1528832K)] 1854110K->496648K(3208704K), 0.0247465 secs] [Times: user=0.06 sys=0.02, real=0.02 secs] 
2021-02-12T20:17:28.670+0000: [GC (Allocation Failure) [PSYoungGen: 1491038K->24493K(1531392K)] 1945608K->479087K(3211264K), 0.0582659 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] 
2021-02-12T20:17:29.633+0000: [GC (Allocation Failure) [PSYoungGen: 1473453K->31079K(1612800K)] 1928047K->486008K(3292672K), 0.0336889 secs] [Times: user=0.05 sys=0.02, real=0.04 secs] 
2021-02-12T20:17:30.843+0000: [GC (Allocation Failure) [PSYoungGen: 1575783K->46063K(1622528K)] 2030712K->501032K(3302400K), 0.0422580 secs] [Times: user=0.09 sys=0.01, real=0.04 secs] 
2021-02-12T20:17:32.433+0000: [GC (Allocation Failure) [PSYoungGen: 1590767K->24292K(1703424K)] 2045736K->480558K(3383296K), 0.0506315 secs] [Times: user=0.08 sys=0.02, real=0.05 secs] 
2021-02-12T20:17:34.324+0000: [GC (Allocation Failure) [PSYoungGen: 1659108K->24958K(1710592K)] 2115374K->481281K(3390464K), 0.0576808 secs] [Times: user=0.13 sys=0.00, real=0.06 secs] 
Heap
 PSYoungGen      total 1710592K, used 1467342K [0x0000000740000000, 0x00000007b2400000, 0x00000007c0000000)
  eden space 1634816K, 88% used [0x0000000740000000,0x0000000798093f40,0x00000007a3c80000)
  from space 75776K, 32% used [0x00000007a3c80000,0x00000007a54dfb78,0x00000007a8680000)
  to   space 73728K, 0% used [0x00000007adc00000,0x00000007adc00000,0x00000007b2400000)
 ParOldGen       total 1679872K, used 456322K [0x0000000640000000, 0x00000006a6880000, 0x0000000740000000)
  object space 1679872K, 27% used [0x0000000640000000,0x000000065bda0b40,0x00000006a6880000)
 Metaspace       used 71040K, capacity 76834K, committed 76948K, reserved 1116160K
  class space    used 9093K, capacity 9852K, committed 9876K, reserved 1048576K

我仍然处于 WTF 模式,为什么它不能处理 25Mb 的数据...

2 个答案:

答案 0 :(得分:1)

您的任务似乎有大量 jvm 对象。答案将分为两部分:

  1. 通过传递--executor-cores 4 减少并行度并增加内存--executor-memory 8g

  2. 将额外的 JVM 参数传递给 master 和 executors 以将 GC 更改为 CG1 --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC" --conf "spark.driver.extraJavaOptions=-XX:+UseG1GC"

  3. 确保你在纱线上运行 --主纱

答案 1 :(得分:0)

take 或 count 等操作会触发 DAG 执行,这需要一些时间来执行。您可以执行以下操作来减少时间:

  1. 缓存或保留中间结果
  2. 如果数据很小,就在单机上运行
  3. 使用 cloudwatch 监控您的 EMR 集群,以检查运行期间的可用 Yarn 内存和容器挂起率,这表明您的作业是否缺乏资源。
相关问题