Hadoop/Mahout: Installation and first steps

 

Installing Mahout

All files are installed and run from your local host and file system. To get started, follow these basic steps:

  • Download the latest stable Mahout version and unzip (e.g. to ~/RAID/mahout)
    Attention: the Mahout book uses the 0.5 version
  • Create MAHOUT_HOME and HADOOP_CONF_DIR environment variables by adding  this to your .bashrc file (and reload it using "source ~/.bashrc")
      export MAHOUT_HOME=~/RAID/mahout
      export HADOOP_CONF_DIR=$HADOOP_INSTALL/conf
  • Add the Mahout bin directory to your path by adding this to your .bashrc:
     export PATH=$PATH:$MAHOUT_HOME/bin
  • Add the MAHOUT_HOME variable to your Java classpath by adding this to your .bashrc:
     export CLASSPATH=$MAHOUT_HOME/lib:$CLASSPATH
  • Add the Mahout JARs to the Hadoop classpath. In "$HADOOP_HOME/conf/hadoop-env.sh" uncomment the line starting with "export HADOOP_CLASSPATH=" and add (after the "="):
     $MAHOUT_HOME/xxx.jar 

 

Testing Mahout

Based on the "Quick Start" tutorial (Mahout web page) and from the Mahout clustering example

  • Start Hadoop if not done yet
  • Download some data to RAID/test_data, e.g. this one from the UCI Machine Learning Repository
  • Copy the data to HDFS
    hadoop fs -mkdir testdata
     hadoop fs -put synthetic_control.data testdata
  • Perform clustering
     $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
  • The clustering output is in SequenceFile format which is not human readable. Hence, we need to use the clusterdump utility provided by Mahout. (Hint: the utility reads data from HDFS but writes to local file system.)
         $MAHOUT_HOME/bin/mahout clusterdump --input output/clusters-10-final --pointsDir output/clusteredPoints --output clusteranalyze.txt
  • Optional: copy the data back to your local file system
      hadoop fs -get output ~/RAID/mahout.results
  • Test Mahout's visualization example (Java sources can be found in the Mahout example class: org.apache.mahout.clustering.display)
          For the final result: $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.display.DisplayClustering
          For intermediate steps: $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.display.DisplayKMeans

Last updated

Wednesday, 29 May 2013