In this example, we will detail a very simple implementation of the page rank algorithm and how input/output works in Giraph. At the end of this short tutorial, you should have a simple working piece of code that will run on a real cluster.
Giraph implements bulk synchronous parallel computing model (http://en.wikipedia.org/wiki/Bulk_synchronous_parallel) specifically for graph processing. Graphs are composed of vertices and edges. We have four types that need to be defined by the user. Note that if you intend to use primitive based objects for any of the types, they are all available (i.e. ShortWritable, IntWritable, LongWritable, FloatWritable, DoubleWritable, BooleanWritable). They are lots of other types available as well (i.e. TextWritable, BytesWritable, MapWritable, etc.). Note that if you are not using a type (i.e. vertex value or edge value), you can set them to NullWritable.
For the vertex id (I), we'll use LongWritable, which allows us to load user ids from the facebook graph. For the vertex value (V), we'll use DoubleWritable, which will be the page rank value. For the edge value (E), we'll use DoubleWritable to represent the edge weights. For the message value (M), we'll use DoubleWritable, which represents the page rank value that a vertex propagates to my neighbors.
There are two types of input vertex-centric and edge-centric. Some algorithms need vertex centric data, such as pairs of vertex ids and initial page ranks. Some algorithms only look at edges. For example, connected components can be run without any vertex values. Some have input from multiple sources (i.e. vertex values from vertex-centric input and edges from edge-centric input). You need to think about your input sources and what makes sense for you.
Giraph can convert an input source into vertex-centric or edge-centric output (i.e. HDFS files, HBase, mySQL, etc.) as long as someone writes the code to convert the source data into a graph. In Facebook, we typically load from /store to Hive.
A vertex input format can create vertex ids, vertex values, and edges. It can also create a subset of that data, such as simply vertex ids and vertex values (in that case, you'll likely want an edge input Format to load the edges).
This vertex input format loads a simple format where a line is a vertex, begining with the vertex id as a long and than destination edge ids. The initial vertex values are 0 and the edges do not have any weights.
After the Giraph computation, you need to store your data back to persistent storage (i.e. HDFS, HBase, or a Hive table).
A vertex output format can dump anything to persistent storage in a vertex-centric way. You can still dump only edges if you choose, or any other type of data. This is pretty flexible. For our example aplication, we simply want to dump the vertex id and the associated final page rank.