您的位置:首页 > 其它

avro tricks and pitfalls

2017-07-16 13:29 288 查看
Use avro Reflection to serialize/deserialize object: (As ofversion 1.8.1)
Schema schema =ReflectData.AllowNull.get().getSchema(obj.getClass());byte[] arr = null;final DatumWriterwriter = new ReflectDatumWriter(schema);finalByteArrayOutputStream out = new ByteArrayOutputStream(10*1024);final BinaryEncoder encoder =EncoderFactory.get().binaryEncoder(out, null); writer.write(obj, encoder);encoder.flush();arr = out.toByteArray();

Schemaschema = ReflectData.AllowNull.get().getSchema(targetClass);final DatumReaderreader = new ReflectDatumReader(schema);final Decoder decoder =DecoderFactory.get().binaryDecoder(arr, null);Object readObj= reader.read(null,decoder);

By default ReflectData.get().getSchema is unable to handnull value for attributes that are of type: Object or collection of object.NullPointerException will be thrown. Note: ReflectDatumWriter uses reflectionon field directly. Use ReflectData.AllowNull.get() instead

By default ReflectDataWriter does not handle cyclic objectgraph: Ie. Class A contains an attributeof Class B and Class B contains an attribute of Class A. A StackOverflowErrorwill be thrown
See: https://issues.apache.org/jira/browse/AVRO-695

For collections, List and Map are fully supported. However,Set attribute is only partially supported with ReflectDataWriter. You need to explicitly declare actual type ofSet in class field declaration.
Eg.
private Set<String> components
Error: java.lang.RuntimeException:java.lang.NoSuchMethodException: java.util.Set.<init>()atorg.apache.avro.specific.SpecificData.newInstance(SpecificData.java:344)atorg.apache.avro.reflect.ReflectDatumReader.newArray(ReflectDatumReader.java:100)at org.apache.avro.reflect.ReflectDatumReader.readArray(ReflectDatumReader.java:133)
private HashSet<String>components
Works fine

Some native java types like Date, BigDecimal etc are notsupported until the recent version of avro. Avro has introduced LogicalType thatenhances primitive types with additional information. Eg Date is a logic typeof int and time-micros is a logic type of long.

To write to a ByteArrayOutputStream, BinaryEncoder.flush()must be called after write operation is performed, otherwise, you are likely toget an empty byte array.

ReflectDatumWriter accept two types of constructors: with aschema as parameter or with a class as a parameter. The former is moreflexible, as you can customize the schema building yourself. ReflectData.getSchema()already uses an internal schema cache to boost performance. From analysis, wecan see that building schema is quite expensive, so it is worth consideringbuilding the schema on system start rather than in serialization operation

EncoderFactory has two configuration parameters: bufferSizeand blockSize. Large buffer size can improve performance when serializing largeobject
DirectBinaryEncoder: no write buffering, not recommended forwriting large data
BinaryEncoder:
BlockingBinaryEncoder

Thread safety: a DatumReader instance may be used inmultiple threads. Encoder and Decoder are not thread-safe, butDatumReader and DatumWriter are

Advanced avro techniques such as schema reusing, inheritanceetc
https://www.infoq.com/articles/ApacheAvro

Customizing serialization/deserialization for special javaclass that is not natively supported by avro. (Eg date) requires a specialconversion class
Eg
GenericData genericData = new GenericData();genericData.addLogicalTypeConversion(newDateConversion());
DatumWriter<GenericRecord>datumWriter = new GenericDatumWriter<GenericRecord>(schema, genericData);DatumReader<GenericRecord> datumReader = newGenericDatumReader<GenericRecord>(schema, schema, genericData);

Multiple schemas:
By default, ReflectionData create nested schemas, which is verylengthy and hard to maintain.

Avro supports multiple schema definition in one schema file,provided that earlier type definition in the schema file does not havedependency on later ones. Eg.
{"type" : "record",
"name" :"TestObject3",
"namespace": "de.hybris.core.network.serialization",
"fields" :[ {"name" : "components",
"type" : [ "null",{"type" : "array",
"items" : "string",
"java-class" :"java.util.HashSet"
}
],
"default" : null
},
{"name" : "parent",
"type" : [ "null", de.hybris.core.network.serialization.TestObject1],
"default" : null
}
]
},
{"type" : "record",
"name" : "TestObject1",
"namespace" : "de.hybris.core.network.serialization",

}
Will throw an exception when parsing the schema file. Another major limitation with single schemafile, only fields for the first schema is accessible.
An alternative way is to create multiple schema definitionfiles and write a utility class to auto expand it to nested form as explainedin https://www.infoq.com/articles/ApacheAvro
Still cyclic schema definition dependency is not allowed,
Another thing worth note is that avro does not allow enclosing‘ ” ’ for type reference for “values”attribute in array type or “items” attribute in map type.

Performance:
Serialization TimeDeserialization TimeBinary form data size
Java serialization1342647
Avro reflection datum serialization25221158
Avro generic record datum serialization231230
As you can see, avro has a great advantage in terms of datasize over java serialization. However, avro reflectionserialization/deserialization is even slower than java. Avro generic recordserialization/deserialization yields best performance but substantial amount ofcoding effort is needed especially when object structure is complex
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  avro