Avro架构演变

时间:2013-03-11 23:20:22

标签: avro

我有两个问题:

  1. 是否可以使用相同的阅读器并解析用两个兼容的模式编写的记录,例如:与Schema V2相比,Schema V1只有一个额外的可选字段,我希望读者理解这两者?我认为这里的答案是否定的,但如果是,我该怎么做?

  2. 我尝试使用Schema V1撰写记录并使用Schema V2阅读,但收到以下错误:

    org.apache.avro.AvroTypeException:找到foo,期待foo

  3. 我使用了avro-1.7.3和:

       writer = new GenericDatumWriter<GenericData.Record>(SchemaV1);
       reader = new GenericDatumReader<GenericData.Record>(SchemaV2, SchemaV1);
    

    以下是两个模式的示例(我也尝试过添加命名空间,但没有运气)。

    架构V1:

    {
    "name": "foo",
    "type": "record",
    "fields": [{
        "name": "products",
        "type": {
            "type": "array",
            "items": {
                "name": "product",
                "type": "record",
                "fields": [{
                    "name": "a1",
                    "type": "string"
                }, {
                    "name": "a2",
                    "type": {"type": "fixed", "name": "a3", "size": 1}
                }, {
                    "name": "a4",
                    "type": "int"
                }, {
                    "name": "a5",
                    "type": "int"
                }]
            }
        }
    }]
    }
    

    架构V2:

    {
    "name": "foo",
    "type": "record",
    "fields": [{
        "name": "products",
        "type": {
            "type": "array",
            "items": {
                "name": "product",
                "type": "record",
                "fields": [{
                    "name": "a1",
                    "type": "string"
                }, {
                    "name": "a2",
                    "type": {"type": "fixed", "name": "a3", "size": 1}
                }, {
                    "name": "a4",
                    "type": "int"
                }, {
                    "name": "a5",
                    "type": "int"
                }]
            }
        }
    },
    {
                "name": "purchases",
                "type": ["null",{
                        "type": "array",
                        "items": {
                                "name": "purchase",
                                "type": "record",
                                "fields": [{
                                        "name": "a1",
                                        "type": "int"
                                }, {
                                        "name": "a2",
                                        "type": "int"
                                }]
                        }
                }]
    }]
    } 
    

    提前致谢。

3 个答案:

答案 0 :(得分:10)

我遇到了同样的问题。这可能是avro的一个bug,但你可能可以通过在“purchase”字段中添加“default”:null来解决这个问题。

查看我的博客了解详情:http://ben-tech.blogspot.com/2013/05/avro-schema-evolution.html

答案 1 :(得分:0)

你可以做到与此相反。意味着您可以解析数据模式1并从模式2中写入数据。因为在写入时它会将数据写入文件,如果我们在读取时没有提供任何字段,那么就可以了。但是如果我们写的字段比读取的少,那么在读取时不会识别额外的字段,这样会产生错误。

答案 2 :(得分:0)

最好的方法是使用模式映射来维护Confluent Avro模式注册表之类的模式。

Key Take Aways:

1.  Unlike Thrift, avro serialized objects do not hold any schema.
2.  As there is no schema stored in the serialized byte array, one has to provide the schema with which it was written.
3.  Confluent Schema Registry provides a service to maintain schema versions.
4.  Confluent provides Cached Schema Client, which checks in cache first before sending the request over the network.
5.  Json Schema present in “avsc” file is different from the schema present in Avro Object.
6.  All Avro objects extends from Generic Record
7.  During Serialization : based on schema of the Avro Object a schema Id is requested from the Confluent Schema Registry.
8.  The schemaId which is a INTEGER is converted to Bytes and prepend to serialized AvroObject.
9.  During Deserialization : First 4 bytes are removed from the ByteArray.  4 bytes are converted back to INTEGER(SchemaId)
10. Schema is requested from the Confluent Schema Registry and using this schema the byteArray is deserialized.

http://bytepadding.com/big-data/spark/avro/avro-serialization-de-serialization-using-confluent-schema-registry/