This project has retired. For details please refer to its Attic page.
Giraph - Hive Input/Output with Giraph

Overview

Giraph uses the HiveIO library written by Facebook to read and write Hive data.

Make sure you read the general IO guide and the HiveIO README before continuing here. Those two give a lot of helpful background information.

Input

To read a Vertex-oriented table you will use HiveVertexInputFormat as your VertexInputFormat. To read an Edge-oriented table you will use HiveEdgeInputFormat as your EdgeInputFormat. You should not need to extend these classes, but rather just set them in the GiraphConfiguration and configure them as this guide describes. Under the hood these classes use HiveVertexReader and HiveEdgeReader.

You will need to implement the HiveToVertex and/or HiveToEdge interfaces to read vertices or edges, respectively. These interfaces extend the Iterator interface specialized with either Vertex or EdgeWithSource respectively. They get initialized with an Iterator of HiveReadableRecords. You can only go through the records once (it is not an Iterable). It is up to you how to map records to Vertices and/or Edges. If your data fits the common use case of a single row mapping directly to a single vertex/edge, using the SimpleHiveToVertex and SimpleHiveToEdge as base classes will make things easier. We chose Java's Iterator interface as it is the most generic while allowing for easily plugging in existing algorithms (for example Guava's functional idioms).

Giraph uses HiveIO's HiveInputDescription to tell which Hive table to read from. This class contains the database name, table name, an optional partition filter, and an optional list of column names to fetch. The database name is "default" if you don't set it. The partition filter can be an arbitrary boolean expression like date="2013-02-04" AND type="awesome". The list of columns defaults to an empty list which will grab all columns. You can specify certain columns to limit the fetching and improve performance.

Output

To write to a Hive table you will use HiveVertexOutputFormat as your VertexOutputFormat. Similar to the input case above, you should not need to extend this class, but rather just use it directly and configure it. Under the hood it uses a HiveVertexWriter.

You need to implement a VertexToHive for Giraph to know how to convert vertices to Hive records. This interface takes a Vertex to write, a HiveRecord that you should fill, and a HiveRecordSaver. You must call HiveRecordSaver.save() after filling in the record data or no writing will occur. Once you call this the record will get serialized, so you are free to reuse the record object to write multiple rows. You can write as many records as you want for each Vertex, leaving it up to you how to map the vertices to Hive rows. If your data fits the common use case of a single vertex mapping directly to a single row, using the SimpleVertexToHive as a base class will make things easier.

Giraph uses HiveIO's HiveOutputDescription to tell which Hive table to write to. This class contains the database name, table name, and a map of partition values. The database and table name are as described in the input section above. The map of partition values is required if you are writing to a partitioned table.

Configuring your job

Giraph comes with a HiveGiraphRunner that makes using all of the classes mentioned above easier for you. We recommend you use this instead of the regular GiraphRunner. You can run this class with -h or -help to see all the options.

Hive figures out where the metastore is using information in the HiveConf which reads from the environment. In our experience it works best to run your job using $HIVE_HOME/bin/hive --service jar [your jar] org.apache.giraph.hive.HiveGiraphRunner [options] instead of $HADOOP_HOME/bin/hadoop jar ... as this sets up all of the necessary variables.

If you choose to do it on your own you need to create and initialize the HiveInputDescription and HiveOutputDescription mentioned above. HiveIO has a notion of profiles, like namespaces, which it uses to allow reading and writing to multiple tables from the same process. In Giraph we use vertex_input_profile, edge_input_profile, and vertex_output_profile as our profiles. You need to register your descriptions with HiveIO using HiveApiInputFormat.setProfileInputDesc and HiveApiOutputFormat.initProfile. Additionally you will need to set the HiveToVertex, HiveToEdge and/or VertexToHive configuration parameters to the classes you implemented.

Make sure to take a look at the HiveGiraphRunner code if you are going down the DIY route.