Question

我正在使用Confluent的KafkaAvroDerserializer反序列化通过Kafka发送的Avro对象。我想将收到的数据写到Parquet文件中。我希望能够将数据附加到同一个拼花地板上，并使用分区创建一个拼花地板。

我设法用AvroParquetWriter创建了一个Parquet-但是我没有找到如何添加分区或追加到同一文件的方法：

在使用Avro之前，我曾使用spark来编写镶木地板-通过spark编写具有分区的镶木地板并使用附加模式是微不足道的-我是否应该尝试从我的Avro对象创建Rdds并使用spark来创建镶木地板？

Answer 1

我想将实木复合地板写入HDFS

就个人而言，我不会为此使用Spark。

相反，我会使用HDFS Kafka Connector。这是一个配置文件，可以帮助您入门。

>>> from Crypto.Cipher import AES
>>> from base64 import b64decode
>>> with open('output.txt') as f:
...     aes = AES.new('00112233445566778899aabbccddeeff', AES.MODE_CBC, IV='a2a8a78be66075c94ca5be53c8865251'.decode('hex'))
...     print(aes.decrypt(b64decode(f.read())))
...
L�|�L   ��O�*$&9�

如果您要基于字段而不是文字“ Kafka分区”编号的HDFS分区，请参考name=hdfs-sink # List of topics to read topics=test_hdfs connector.class=io.confluent.connect.hdfs.HdfsSinkConnector # increase to be the sum of the partitions for all connected topics tasks.max=1 # the folder where core-site.xml and hdfs-site.xml exist hadoop.conf.dir=/etc/hadoop # the namenode url, defined as fs.defaultFS in the core-site.xml hdfs.url=hdfs://hdfs-namenode.example.com:9000 # number of messages per file flush.size=10 # The format to write the message values format.class=io.confluent.connect.hdfs.parquet.ParquetFormat # Setup Avro parser value.converter=io.confluent.connect.avro.AvroConverter value.converter.schema.registry.url=http://schema-registry.example.com:8081 value.converter.schemas.enable=true schema.compatibility=BACKWARD上的配置文档。如果要自动进行Hive集成，请参阅有关该文档。

让我们说您确实想使用Spark，但是，您可以尝试AbsaOSS/ABRiS来读取Avro DataFrame，然后您应该能够执行类似FieldPartitioner的操作（不是精确的代码，因为我没有尝试过

如何用Java中的分区将Avro对象写入Parquet？如何将数据附加到同一地板上？

1 个答案: