TB数据写入实木复合地板

时间:2019-01-18 01:54:01

标签: apache-spark hbase

我们在hbase中有2.5TB数据,区域大小是5g或10g,hbase表中有450个区域。并且我们需要转换为spark-sql。以及下面使用的方法:

1.snapshot hbase表。

2。通过newHadoopAPIRDD读取hfile

3。写入实木复合地板。

val hconf = HBaseConfiguration.create()
hconf.set("hbase.rootdir", "/hbase")
hconf.set("hbase.zookeeper.quorum", HbaseToSparksqlBySnapshotParam.zookeeperQurum)
hconf.set(TableInputFormat.SCAN, convertScanToString(scan))

val job = Job.getInstance(hconf)
val path = new Path("/snapshot")
val snapshotName = HbaseToSparksqlBySnapshotParam.snapshotName

TableSnapshotInputFormat.setInput(job, snapshotName, path)
val hbaseRDD = spark.sparkContext.newAPIHadoopRDD(job.getConfiguration, classOf[TableSnapshotInputFormat],classOf[ImmutableBytesWritable], classOf[Result]) 
val rdd = hbaseRDD.map{
    case(_,result) => 
    ...
    Row
 }
val df = spark.createDataFrame(rdd, schema)
df.write.parquet("/test")

num-executors executor-memory executor-cores   run_time
16              6               2                  error following
5               15              8                  6hours

我不知道如何设置参数(num-executors,executor-memory,executor-cores),并且运行速度更快。当我只得到一个区域(10g)时,我将参数用作num-executors 1,executor-memory 3g,executor-cores 1,它运行14分钟。

我使用spark2.1.0

错误:

19/01/18 00:55:26 INFO TaskSetManager: Finished task 386.0 in stage 0.0 (TID 331) in 1187343 ms on hdh68 (executor 5) (331/470)
19/01/18 00:55:31 INFO TaskSetManager: Starting task 423.0 in stage 0.0 (TID 363, hdh68, executor 1, partition 423, NODE_LOCAL, 6905 bytes)
19/01/18 00:55:31 INFO TaskSetManager: Finished task 383.0 in stage 0.0 (TID 328) in 1427677 ms on hdh68 (executor 1) (332/470)
19/01/18 00:57:36 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 10.
19/01/18 00:57:36 INFO DAGScheduler: Executor lost: 10 (epoch 0)
19/01/18 00:57:36 INFO BlockManagerMasterEndpoint: Trying to remove executor 10 from BlockManagerMaster.
19/01/18 00:57:36 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(10, hdh68, 40679, None)
19/01/18 00:57:36 INFO BlockManagerMaster: Removed 10 successfully in removeExecutor
19/01/18 00:57:36 INFO DAGScheduler: Shuffle files lost for executor: 10 (epoch 0)
19/01/18 00:57:36 WARN DFSClient: Slow ReadProcessor read fields took 114044ms (threshold=30000ms); ack: seqno: 49 reply: 0 downstreamAckTimeNanos: 0, targets: [DatanodeInfoWithStorage[10.41.2.68:50010,DS-26a72d43-2e16-41a9-9a71-99593f14ab6f,DISK]]
19/01/18 00:57:39 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 6.6 GB of 6.6 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
19/01/18 00:57:39 ERROR YarnScheduler: Lost executor 10 on hdh68: Container killed by YARN for exceeding memory limits. 6.6 GB of 6.6 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
19/01/18 00:57:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 10 from BlockManagerMaster.
19/01/18 00:57:39 INFO BlockManagerMaster: Removal of executor 10 requested
19/01/18 00:57:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 10
19/01/18 00:57:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.41.2.68:49771) with ID 17
19/01/18 00:57:52 INFO TaskSetManager: Starting task 387.1 in stage 0.0 (TID 364, hdh68, executor 17, partition 387, NODE_LOCAL, 6906 bytes)
19/01/18 00:57:52 INFO TaskSetManager: Starting task 417.1 in stage 0.0 (TID 365, hdh68, executor 17, partition 417, NODE_LOCAL, 6906 bytes)
19/01/18 00:57:53 INFO BlockManagerMasterEndpoint: Registering block manager hdh68:38247 with 3.0 GB RAM, BlockManagerId(17, hdh68, 38247, None)
19/01/18 00:58:42 INFO TaskSetManager: Starting task 426.0 in stage 0.0 (TID 366, hdh68, executor 13, partition 426, NODE_LOCAL, 6905 bytes)
19/01/18 00:58:42 INFO TaskSetManager: Finished task 396.0 in stage 0.0 (TID 341) in 1014645 ms on hdh68 (executor 13) (333/470)
java.io.IOException: Broken pipe
        at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
        at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
        at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
        at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
        at io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:139)
        at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:121)
        at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:287)
        at io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:237)
        at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:314)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:802)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:313)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:770)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1256)
        at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:781)
        at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:773)
        at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:754)
        at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
        at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:781)
        at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:773)
        at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:754)
        at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
        at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:781)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:807)
        at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:818)
        at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:799)
        at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:835)
        at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1017)
        at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:256)
        at org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194)
        at org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:150)
        at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
        at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
        at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:748)
19/01/18 00:59:58 INFO TaskSetManager: Starting task 428.0 in stage 0.0 (TID 367, hdh68, executor 17, partition 428, NODE_LOCAL, 6906 bytes)
19/01/18 00:59:58 WARN TaskSetManager: Lost task 387.1 in stage 0.0 (TID 364, hdh68, executor 17): java.nio.channels.ClosedChannelException
        at org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:60)
        at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:179)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
        at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:748)

19/01/18 01:00:04 INFO TaskSetManager: Starting task 387.2 in stage 0.0 (TID 368, hdh68, executor 14, partition 387, NODE_LOCAL, 6906 bytes)
19/01/18 01:00:04 INFO TaskSetManager: Finished task 385.0 in stage 0.0 (TID 330) in 1471628 ms on hdh68 (executor 14) (334/470)
19/01/18 01:00:12 INFO TaskSetManager: Starting task 429.0 in stage 0.0 (TID 369, hdh68, executor 8, partition 429, NODE_LOCAL, 6905 bytes)
19/01/18 01:00:12 INFO TaskSetManager: Finished task 392.0 in stage 0.0 (TID 337) in 1220925 ms on hdh68 (executor 8) (335/470)
19/01/18 01:00:27 INFO TaskSetManager: Starting task 430.0 in stage 0.0 (TID 370, hdh68, executor 11, partition 430, NODE_LOCAL, 6906 bytes)
19/01/18 01:01:16 INFO TaskSetManager: Finished task 390.0 in stage 0.0 (TID 335) in 1353136 ms on hdh68 (executor 14) (337/470)
19/01/18 01:01:30 INFO TaskSetManager: Starting task 432.0 in stage 0.0 (TID 372, hdh68, executor 5, partition 432, NODE_LOCAL, 6906 bytes)
19/01/18 01:01:30 INFO TaskSetManager: Finished task 393.0 in stage 0.0 (TID 338) in 1244707 ms on hdh68 (executor 5) (338/470)
java.io.IOException: Broken pipe
        at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
        at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
        at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
        at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
        at io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:139)
        at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:121)
        at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:287)
        at io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:237)
        at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:314)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:802)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:319)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:748)
19/01/18 01:01:59 INFO TaskSetManager: Starting task 434.0 in stage 0.0 (TID 374, hdh68, executor 17, partition 434, NODE_LOCAL, 6905 bytes)
19/01/18 01:01:59 WARN TaskSetManager: Lost task 417.1 in stage 0.0 (TID 365, hdh68, executor 17): java.nio.channels.ClosedChannelException
        at org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:60)
        at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:179)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
        at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:748)

更新:

cluser是伪分布式的。内存Tota 200g,VCores Total64。和root.root队列资源,该队列没有其他应用程序:

Used Resources: <memory:171776, vCores:44>
Num Active Applications:    7
Num Pending Applications:   0
Min Resources:  <memory:0, vCores:0>
Max Resources:  <memory:204800, vCores:64>
Steady Fair Share:  <memory:54614, vCores:0>
Instantaneous Fair Share:   <memory:102400, vCores:0>
Preemptable:    true

0 个答案:

没有答案