I am hitting a bug when using java protocol buffer classes as the object model for RDDs in Spark jobs,
For my application, my ,proto file has properties that are repeated string. For example
message OntologyHumanName
{
repeated string family = 1;
}
From this, the 2.5.0 protoc compiler generates Java code like
private com.google.protobuf.LazyStringList family_ = com.google.protobuf.LazyStringArrayList.EMPTY;
If I run a Scala Spark job that uses the Kryo serializer I get the following error
Caused by: java.lang.NullPointerException
at com.google.protobuf.UnmodifiableLazyStringList.size(UnmodifiableLazyStringList.java:61)
at java.util.AbstractList.add(AbstractList.java:108)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
... 40 more
The same code works fine with spark.serializer=org.apache.spark.serializer.JavaSerializer.
My environment is CDH QuickStart 5.5 with JDK 1.8.0_60