Question

我有一个问题需要解决以下问题：数据从 kinesis 以 JSON 的形式进入的位置：

{
 datatype: "datatype_1"
 id : "id_1"
 data : {...}
}

然后，流中的所有记录都需要通过一个查找函数，将数据类型和 id 作为参数传递，以找到一组唯一的位置，以将项目写入 JSON。

即

def get_locations(id: String, datatype: String): Array[String] = //custom logic here

结果数组的样子

 [ "s3:///example_bucket/example_folder_1", "s3:///example_bucket2/example_folder_2"]

我的问题是如何最有效地按数据类型和 id 对来自流的记录进行分组并写入各个 s3 位置。我希望做如下事情：

        sparkSession.readStream.format("kinesis")
          .option("streamName", kinesis_stream_name)
          .option("initialPosition", "latest")
          .option("region", aws_region)
          .load()
          //more transforms
          .select(
            col("datatype"),
            col("id"),
            col("data")
          )

// Not sure how I can do what's below

//          .write.partitionBy("id", "datatype")
//          .format("json")
//          .option("compression","gzip")
//          .save(get_locations("id","datatype")) //saving to all locations in result array

Answer 1

我建议您在运行时的代码中创建存储桶作为最佳实践，您可以使用 node.js aws S3 API 或您的运行时语言 API

正如您在评论中所说，您正在从运行时获取参数。但是，作为对您问题的回答，这里有一个函数可以创建一个包含名称中的 id 的存储桶（您可以将其更改为您喜欢的格式），然后在该存储桶中，您将拥有许多基于分区的文件保存时的数据帧：

import java.util
import com.amazonaws.regions.Regions
import com.amazonaws.services.s3.model.AmazonS3Exception
import com.amazonaws.services.s3.{AmazonS3, AmazonS3ClientBuilder}

def get_locations(id: String, datatype: String) = {

//you can configure the default region to the adequat region 
//of course 
      val s3: AmazonS3 = AmazonS3ClientBuilder.standard.withRegion(Regions.DEFAULT_REGION).build

object CreateBucket {
  def getBucket(bucket_name: String): Bucket = {
          var named_bucket = null.asInstanceOf[Bucket]
          val buckets: util.List[Bucket] = s3.listBuckets
          import scala.collection.JavaConversions._
          for (b <- buckets) {
            if (b.getName.equals(bucket_name)) named_bucket = b
          }
          named_bucket
        }

        def createBucket(bucket_name: String): Bucket = {
          var b = null.asInstanceOf[Bucket]
          if (s3.doesBucketExistV2(bucket_name)) {
            System.out.format("Bucket %s already exists.\n", bucket_name)
            b = getBucket(bucket_name)
          }
          else try b = s3.createBucket(bucket_name)
          catch {
            case e: AmazonS3Exception =>
              System.err.println(e.getErrorMessage)
          }
          b
        }
      }


      //change your bucket name here if
      //you like
      val bucket_name = "bucket_" + id

      var my_bucket = null.asInstanceOf[Bucket]
      if (s3.doesBucketExistV2(bucket_name)) {
        System.out.format("Bucket %s already exists.\n", bucket_name)
        my_bucket = CreateBucket.getBucket(bucket_name)
      }
      else try my_bucket = s3.createBucket(bucket_name)
      catch {
        case e: AmazonS3Exception =>
          System.err.println(e.getErrorMessage)
      }
      my_bucket
    }

    //I don't know how you will get those parameters
    

    var id = " "
    var datatype = " "
    df.write.partitionBy("id", "dataType")
      .format("json")
      .option("compression", "gzip")
      .save(get_locations(id, datatype).toString)

不要忘记将依赖项添加到 maven 或 build.sbt 中，并使用您在 aws (sdk) 中已有的版本：

 <dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>aws-java-sdk-s3</artifactId>
    <version>1.11.979</version>
 </dependency>

Spark (Scala) 有条件地将 JSON 写入多个动态输出位置

1 个答案: