Sample Software Project¶
Software projects will involve running an analysis on a data set. You will be provided with a Hadoop cluster running MapReduce version 1 (see HadoopClusterAccess.html). Your goal will be to use the Hadoop cluster to run a “Big Data” computation.
One possible approach is the Terabyte Sort procedure. The components are:
- TeraGen: create the data
- TeraSort: analyze the data using MapReduce
- TeraValidate: validation of the output
Invocation¶
The teragen
command accepts two parameters:
- number of 100-byte rows
- the output directory
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar teragen $COUNT /user/$USER/tera-gen
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar terasort /user/$USER/tera-gen /user/$USER/tera-sort
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar teravalidate /user/$USER/tera-sort /user/$USER/tera-validate
Exercise¶
Run the Terabyte Sort procedure for various sizes of data:
- 1 GB
- 10 GB
- 100 GB
For each component (tera{gen,sort,validate}
), report the execution time,
data read and written (in GB) as well as the cumulative values.
Other Software Projects¶
The following are projects from previous classes:
- Use R to analyze a particular dataset (business or sports)