Question

因此，基本上我希望在同一节点/执行器上运行多个任务来从共享内存中读取数据。为此，我需要一些初始化函数，在任务启动之前将数据加载到内存中。如果Spark为Executor启动提供了一个钩子，我可以将这个初始化代码放在该回调函数中，任务只在这个启动完成后运行。

所以，我的问题是，Spark是否提供了这样的钩子？如果没有，用哪种方法，我可以实现同样的目标？

Answer 1

您不必运行应用程序的多个实例即可运行多个任务（即一个应用程序实例，一个Spark任务）。多个线程可以使用相同的SparkSession对象并行提交Spark任务。

所以它可能会像这样工作：

应用程序启动并运行初始化函数以在内存中加载共享数据。比方说，进入一个SharedData类对象。
创建SparkSession
创建一个线程池，每个线程都可以访问（SparkSession，SharedData）对象
每个线程使用共享的SparkSession和SharedData创建Spark任务对象。
根据您的使用案例，应用程序会执行以下操作之一：
- 等待所有任务完成，然后关闭Spark Session
- 在循环中等待新请求到达，并根据需要使用线程池中的线程创建新的Spark任务。

当您想要使用setJobDescription分配任务描述或使用setJobGroup将任务分配给任务时，SparkContext（sparkSession.sparkContext）非常有用，因此可以取消相关任务同时使用cancelJobGroup。您还可以调整使用同一池的任务的优先级，有关详细信息，请参阅https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application。

Answer 2

Spark的“共享数据”解决方案正在使用广播 - 您在驱动程序应用程序中加载数据一次，Spark将其序列化并发送给每个执行程序（一旦）。如果任务使用该数据，Spark将在任务执行之前确保它在那里。例如：

reverse:
    addi $t0,$zero,0 #zeroing all the t registers to clear them for the next part.
    addi $t1,$zero,0
    addi $t2,$zero,0
    addi $t3,$zero,0
    addi $t4,$zero,0
    la $a1,str      
    lb $a2,strLen       
    add $a1,$a1,$a2     
    add $t0,$a2,$zero   
    la $t3, strBuffer #here is the new buffer
    loop:
        subi $a1,$a1,1  
        beqz $t0,exit   
        subi $t0,$t0,1  
        lb $t4,0($a1)     # i load the string backwards byte by byte
        sb $t4,0($t3)     # i store it in the string buffer
        addi $t3,$t3,1     # i increment the memory adress of the buffer so that i can save the bytes one after the other
        j loop
exit:                      #I know my labels have to be changed but i will clean it later
la $a0,str_msg3           #print a leading message
li $v0,4
syscall
la $t8,strBuffer           #load the adress of the buffer and the string
la $t9, str
loop2:
    lb $t7,0($t8)          #load the first byte of the buffer
    beqz $t7,exit2         #check if its null 
    sb $t7,0($t9)          #store the byte in the strings adress at the first index
    addi $t8,$t8,1         #incrementing the adresses 
    addi $t9,$t9,1
    j loop2
exit2:                     #printing the result
la $a0,str
li $v0,4
syscall
li $v0,10
syscall

或者，如果您想避免将数据读入驱动程序内存并将其发送给执行程序，则可以在Scala object MySparkTransformation { def transform(rdd: RDD[String], sc: SparkContext): RDD[Int] = { val mySharedData: Map[String, Int] = loadDataOnce() val broadcast = sc.broadcast(mySharedData) rdd.map(r => broadcast.value(r)) } }中使用lazy值来创建一个填充的值每个JVM一次，在Spark的情况下，每个执行者一次。例如：

object

实际上，每个执行者都会有// must be an object, otherwise will be serialized and sent from driver object MySharedResource { lazy val mySharedData: Map[String, Int] = loadDataOnce() } // If you use mySharedData in a Spark transformation, // the "local" copy in each executor will be used: object MySparkTransformation { def transform(rdd: RDD[String]): RDD[Int] = { // Spark won't include MySharedResource.mySharedData in the // serialized task sent from driver, since it's "static" rdd.map(r => MySharedResource.mySharedData(r)) } }的一份副本。

Spark中的Executor Startup是否有钩子？

2 个答案: