Saturday, February 20, 2016

Spark: Read/Write Sequence files

    Recently, I got some extraordinary demands for Spark RDD. I need an RDD which supports multi-key value pair, followed by IO operations. As usual, I started from python and I found that applying saveAsSequenceFile does not always work. After some searches, I assume it is because of the writing type of sequence file. Which is not demanded when writing, however, is demanded when reading(sequenceFile). After all, if I did not specify the output type, how should I know what type to read?

    After that, I think I had to quit my obsession on leveraging Spark with Python. I switched the language to Scala, which is fully supported to Spark. Nonetheless, story did not end here perfectly like the fairy tale of prince charming and snow white. I tried to write the key with classes implemented Writable, which would cause an implicit conversion on RDD to SequenceFileRDDFunctions. Unfortunately, I noticed that both key and value has to implemented Serializable, which is reasonable. If the key and value are not serializable, how do we pass them between workers?

    At last, to simplify all problems, I decided to transform my keys to a String with concatenation. Since the beginning, all of my keys are String for sure.

TL;DR - The key and value of an RDD have to implement Serializable. If you want to save the RDD, the kay and value have to implement Writable.

    Well I guess you are here for how to read/write a sequence file. If you meet the above requirement. You can just do
val path: String = "your/path/for/rdd/here";
val rdd = sc.parallelize(

List(("Raccoon", 1), ("Squirrel", 2), ("Ferret", 3))
);
rdd.saveAsSequenceFile(path);

val readRdd = sc.sequenceFile(path, classOf[Text], classOf[IntWritable]);
// here, the type you have read is Tuple2[Text, IntWritable].
// Do not forget to transform them into Tuple[String, Int].
val usableRdd = readRdd.map{ case (key, value) =>
(key.toString, value.get)
};

No comments:

Post a Comment