Giraph - Input/Output in Giraph

Overview

Any real-world Giraph application reads input data from some sort of (usually distributed) storage, runs several computation supersteps, and then writes back the output. The input data contains a representation of the graph and, often, some metadata on the vertices or edges. The output data often consists of final vertex values, but can also contain the graph itself, possibly modified.

For example, in a standard PageRank implementation, the input will contain the edges in the graph, and the output will consist of the final PageRank values for all vertices.

Giraph itself doesn't dictate a specific form of storage. Instead, the user implements a generic API which converts the preferred type of data to and from Giraph's main classes (Vertex and Edge). For example, one can build on top of Hadoop's TextInputFormat in order to read the graph from plain text files stored in HDFS. Giraph provides several examples of text-based formats (see the org.apache.giraph.io.formats package in giraph-core) and libraries to read from other types of storage (giraph-hive, giraph-hbase, etc...).

Main interfaces

Giraph's I/O builds on top of Hadoop's input/output format API. This allows us to easily incorporate existing Hadoop formats.
There are two main ways the input graph may be layed out: the directed edges may be grouped by source vertex (i.e., represented as an adjacency list), or they may appear in arbitrary order (as is often the case with relational storage, where each record corresponds to an edge). In the first case, any metadata for the vertex can be read together with its out-edges. This is achieved by implementing VertexInputFormat. In the second case, edges will be read by means of an EdgeInputFormat. If there is additional data for the vertices, it will be read separately by a VertexValueInputFormat.

To summarize, VertexInputFormat is usually used by itself, whereas EdgeInputFormat may be used in combination with VertexValueInputFormat.

Output can be done both on a per-vertex and a per-edge basis: a VertexOutputFormat will specify what data to write for each vertex while EdgeOutputFormat will specify what data to write for each edge. This usually means (some function of) the vertex value, but nothing prevents us from writing back the edges instead.

Let's have a quick look at the base classes:

VertexInputFormat<I, V, E>: the getSplits() method returns a list of logical splits of the input data, given a hint provided by the infrastructure (usually the total number of input threads across all workers). The createVertexReader() method returns a VertexReader for the given split.
VertexReader<I, V, E>: this is where the user defines how to create vertices from an input split. The interface is similar to Hadoop's RecordReader: the infrastructure will call getCurrentVertex() to read vertices, until nextVertex() returns false, meaning the input split is finished.
VertexValueInputFormat<I, V>: similar to the above, but creates a VertexValueReader instead.
VertexValueReader<I, V>: here we are not reading edges, so the methods to define are getCurrentVertexId() and getCurrentVertexValue().
EdgeInputFormat<I, E>: as usual, splits the input via getSplits() and creates an EdgeReader via createEdgeReader().
EdgeReader<I, E>: the main methods are getCurrentSourceId(), which returns the source vertex id, and getCurrentEdge(), which returns an Edge<I, E> (i.e., the target vertex id, possibly with an edge value).
VertexOutputFormat<I, V, E>: modeled based on the Hadoop OutputFormat class, this class is intended for output vertices and related edges after the computation. The createVertexWriter returns a VertexWriter to save the vertices. Additionally getOutputCommiter returns an OutputCommiter used to guarantee that the output process is correctly committed and checkOutputSpecs is used to check that the correct setup before running the computation.
VertexWriter<I, V, E>: this is where the user defines how to write vertices and possibly edges. The infrastructure just provides an initialize and a close method to deal with the initial and final part of the output. It also inherits SimpleVertexWriter#writeVertex which is the main function used to actually save the vertices.
EdgeOutputFormat<I, V, E>: modeled similar to VertexOutputFormat, this class is intended for output edges after the computation. The createEdgeWriter returns a EdgeWriter to save the edges. Additionally getOutputCommiter returns an OutputCommiter used to guarantee that the output process is correctly committed and checkOutputSpecs is used to check that the correct setup before running the computation.
EdgeWriter<I, V, E>: this class is similar to VertexWriter providing initialization and closing facilities. It is inteded to save edges and the main function that needs to be extended by the user for such purpose is writeEdge.