Skip to content
vadasg edited this page Nov 17, 2012 · 8 revisions

The following must be installed before use:

groovy 1.8.8 or newer (there is a known bug in json parsing in groovy 1.8.6)
ruby-yajl

Parsing into graph representation

The python script AutomatedParallelParser.py automates downloading and parsing the data in parallel. Review the settings at the top of this script, paying particular attention to

#system specific settings
sortMem = '2G'   #memory for sort. maximize.
threads = 2 

#start and end hours to fetch from GitHubArchive
startHour = '2012-03-12-01'  #set to 'beginning' for earliest possible
endHour = '2012-11-09-23'    #set to 'now' for last possible

It will take a few hours to parse the entire archive. Start the script with

$ export LC_ALL="C"
$ python AutomatedParallelParser.py batch

This script will download GitHub Archive files in the specified time range, uncompress them, and preformat the records. The downloaded and preformatted files are placed in the scratchDir specified in the script. If parsing is interrupted for some reason and restarted, the locally cached files will be used, saving time. To delete these files, use

$ python AutomatedParallelParser.py clean

Importing graph into Titan

To start loading, first review the options at the beginning of ImportGitHubArchive.groovy then do

$ gremlin -e ImportGitHubArchive.groovy <path to vertex file> <path to edge file>

Loading the data on an m1.xlarge Titan/HBase instance on Amazon EC2 takes about 10 hours. When it is finished you should see a summary like this:

Done. Summary:
18482214 events
17009381 vertices
47704461 edges
32357.591 seconds elapsed

Clone this wiki locally