我有一个问题需要解决以下问题:数据从 kinesis 以 JSON 的形式进入的位置:
{
datatype: "datatype_1"
id : "id_1"
data : {...}
}
然后,流中的所有记录都需要通过一个查找函数,将数据类型和 id 作为参数传递,以找到一组唯一的位置,以将项目写入 JSON。
即
def get_locations(id: String, datatype: String): Array[String] = //custom logic here
结果数组的样子
[ "s3:///example_bucket/example_folder_1", "s3:///example_bucket2/example_folder_2"]
我的问题是如何最有效地按数据类型和 id 对来自流的记录进行分组并写入各个 s3 位置。我希望做如下事情:
sparkSession.readStream.format("kinesis")
.option("streamName", kinesis_stream_name)
.option("initialPosition", "latest")
.option("region", aws_region)
.load()
//more transforms
.select(
col("datatype"),
col("id"),
col("data")
)
// Not sure how I can do what's below
// .write.partitionBy("id", "datatype")
// .format("json")
// .option("compression","gzip")
// .save(get_locations("id","datatype")) //saving to all locations in result array
答案 0 :(得分:0)
我建议您在运行时的代码中创建存储桶作为最佳实践,您可以使用 node.js aws S3 API 或您的运行时语言 API
正如您在评论中所说,您正在从运行时获取参数。 但是,作为对您问题的回答,这里有一个函数可以创建一个包含名称中的 id 的存储桶(您可以将其更改为您喜欢的格式),然后在该存储桶中,您将拥有许多基于分区的文件保存时的数据帧:
import java.util
import com.amazonaws.regions.Regions
import com.amazonaws.services.s3.model.AmazonS3Exception
import com.amazonaws.services.s3.{AmazonS3, AmazonS3ClientBuilder}
def get_locations(id: String, datatype: String) = {
//you can configure the default region to the adequat region
//of course
val s3: AmazonS3 = AmazonS3ClientBuilder.standard.withRegion(Regions.DEFAULT_REGION).build
object CreateBucket {
def getBucket(bucket_name: String): Bucket = {
var named_bucket = null.asInstanceOf[Bucket]
val buckets: util.List[Bucket] = s3.listBuckets
import scala.collection.JavaConversions._
for (b <- buckets) {
if (b.getName.equals(bucket_name)) named_bucket = b
}
named_bucket
}
def createBucket(bucket_name: String): Bucket = {
var b = null.asInstanceOf[Bucket]
if (s3.doesBucketExistV2(bucket_name)) {
System.out.format("Bucket %s already exists.\n", bucket_name)
b = getBucket(bucket_name)
}
else try b = s3.createBucket(bucket_name)
catch {
case e: AmazonS3Exception =>
System.err.println(e.getErrorMessage)
}
b
}
}
//change your bucket name here if
//you like
val bucket_name = "bucket_" + id
var my_bucket = null.asInstanceOf[Bucket]
if (s3.doesBucketExistV2(bucket_name)) {
System.out.format("Bucket %s already exists.\n", bucket_name)
my_bucket = CreateBucket.getBucket(bucket_name)
}
else try my_bucket = s3.createBucket(bucket_name)
catch {
case e: AmazonS3Exception =>
System.err.println(e.getErrorMessage)
}
my_bucket
}
//I don't know how you will get those parameters
var id = " "
var datatype = " "
df.write.partitionBy("id", "dataType")
.format("json")
.option("compression", "gzip")
.save(get_locations(id, datatype).toString)
不要忘记将依赖项添加到 maven 或 build.sbt 中,并使用您在 aws (sdk) 中已有的版本:
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-s3</artifactId>
<version>1.11.979</version>
</dependency>