Hadoop Build Install and Run Guide

Documents referred

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
The above document is a good guide to setup single node cluster. Below, I have summarized the build and setup process for ARM64 systems on RedHat based distributions.

Single-node Cluster Setup

  1. Packages required
    • # yum -y install git autoconf automake libtool gcc-c++ cmake ant vim zlib-devel openssl-devel svn cpan libssh2-devel iptables-services tree bzip2 perl-devel
  2. Install Aarch64 Open JDK 1.8 (latest)
    • Go to unzipped JDK8 folder

$ export JAVA_HOME=$PWD

$ cd jre/lib/security/

$ rm cacerts

$ (for CentOS 7) ln --symbolic /etc/pki/java/cacerts .

$ (for Debian Jessie) ln --symbolic /etc/ssl/certs/java/cacerts .

  1. Install Protobuf 2.6.0
    • Clone source and install Protocol Buffers 2.6.0: Here are commands to install it: (Installation guide in protobuf/README.txt file for more details)

      $ git clone https://github.com/google/protobuf.git
      $ cd protobuf

      $ git checkout v2.6.0
      $ ./autogen.sh
      $ ./configure
      $ make
      $ make check
      $ make install

      Set the path in ~/.bashrc or ~/.bash_profile:

      • export LD_LIBRARY_PATH=/usr/local/lib/

      Source the env files:
      $ . ~/.bashrc
      OR
      $ . ~/.bash_profile

      Check installation
      $ which protoc

      • /usr/local/bin/protoc

      $ protoc --version

      • libprotoc 2.6.0

  2. Install Maven
  3. Download source and install Hadoop 3.0.0
    1. Checkout hadoop 3.0.0 from https://github.com/apache/hadoop

    2. Change hadoop-project/pom.xml to reflect the protobuf version you are using
      • $ vi hadoop-project/pom.xml

    3. Search for <protobuf.version> tag using '/' and change the version to 2.6.0

    4. Build:
      • $ mvn clean package -Pdist,native -DskipTests -Dtar -Dmaven.javadoc.skip=true -X -e

    5. Binaries will be in hadoop-dist/target
      • $ cp hadoop-dist/target/hadoop-*.tar.gz /usr/local/
        $ cd /usr/local
        $ tar xzf hadoop-*.tar.gz
        $ mv hadoop-3.0.0-SNAPSHOT hadoop
        $ cd hadoop

    6. Add environment variables and source the bash_profile or bashrc file:
      • export HADOOP_HOME=/usr/local/hadoop
        export HADOOP_PREFIX=$HADOOP_HOME
        export PATH=$PATH:$HADOOP_HOME/bin/ export HADOOP_MAPRED_HOME=$HADOOP_HOME
        export YARN_HOME=$HADOOP_HOME
        # Some convenient aliases and functions for running Hadoop-related commands

        unalias fs &> /dev/null
        alias fs="bin/hadoop fs"
        unalias hls &> /dev/null
        alias hls="fs -ls"

  4. Create partitions on disks to be used for HDFS
    1. # fdisk /dev/sdb

      Press n to create a new partition (/dev/sdb1 in this case)
      Press w to save changes
      To print the partitions, do 'fdisk –l'

    2. Then make the filesystem:
      • # mkfs.ext4 /dev/sdb1

    3. Then mount this new partition to temp location
      • # mkdir -p /var/local/dev/sdb1
        # mount /dev/sdb1 /var/local/dev/sdb1
        # df -h

    4. To unmount the partition, use the following command. This is not needed in setup.
      • # umount /dev/sdb1

    5. Make sure the new partition gets mounted at boot time
      • Have a look at /etc/fstab:
        • Create an entry for sdb1 in /etc/fstab that mounts it to /var/local/dev/sdb1

          reboot machine to try if it works
          check with df -h

    6. Same process can be used to add additional disks
  5. Add Hadoop User and Configure SSH
    1. Add the user hdadmin and the group hadoop to your local machine.
      • # groupadd hadoop
        # adduser -g hadoop hdadmin

    2. Change ownership of hadoop dir to the newly created user:group
      • # cd /usr/local
        # sudo chown -R hdadmin:hadoop hadoop

    3. Change hdadmin password
      • # passwd hdadmin

    4. Log on to hdadmin user and configure SSH access
      • # su - hdadmin
        # ssh-keygen -t rsa -P ""
        # cat /home/hdadmin/.ssh/id_rsa.pub >> /home/hdadmin/.ssh/authorized_keys # ssh localhost (Make sure you get ssh access without entering password) # exit

      If that didn’t work, then try the following and try the above again:
      • # chmod -R 700 $HOME/$user/.ssh/
    5. Make a symlink for easy access to examples jar file # ln -s share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-SNAPSHOT.jar hadoop-examples.jar
  6. Configuration files
    1. HDFS Site: $HADOOP_HOME/etc/hadoop/hdfs-site.xml Note: You may create directories in your disk partitions as follows (login as hdadmin before you create dirs. The ownership should be hdadmin:hadoop):
      • /mnt/disks/01/
        ├── hadoop
        │ └── hdfs
        │ ├── data
        │ └── namenode
        ├── local
        └── tmpdir

      Configuration:
      • <configuration>

        • <property>

          • <name>dfs.replication</name> <value>1</value>

          </property> <property>

          • <name>dfs.name.dir</name>

<value>/mnt/disks/01/hadoop/hdfs/namenode</value>

  • </property> <property>

    • <description> Disk paths given in comma separated values format</description> <name>dfs.data.dir</name> <value>/mnt/disks/01/hadoop/hdfs/data</value>

    </property>

  • </configuration>

  1. Core site: $HADOOP_HOME/etc/hadoop/core-site.xml
    • <configuration>

      • <property>

        • <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> <description>can specify the full domain name if required</description>

        </property>

      </configuration>

  2. Yarn site: $HADOOP_HOME/etc/hadoop/yarn-site.xml
    • <configuration> <!-- Site specific YARN configuration properties -->

      • <property>

        • <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value>

        </property> <property>

        • <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value>

        </property> <property>

        • <name>yarn.nodemanager.local-dirs</name> <value></value>

        </property>

      </configuration>

  3. Mapred site: copy $HADOOP_HOME/etc/hadoop/mapred-site.xml.template to mapred-site.xml in same directory
    • <configuration>

      • <property>

        • <name>mapreduce.framework.name</name> <value>yarn</value>

        </property> <property>

        • <name>mapreduce.map.memory.mb</name> <value>1024</value>

        </property> <property>

        • <name>mapreduce.reduce.memory.mb</name> <value>2048</value>

        </property> <property>

        • <name>mapreduce.map.java.opts</name> <value>-Xmx768m</value>

        </property> <property>

        • <name>mapreduce.reduce.java.opts</name> <value>-Xmx1536m</value>

        </property>

      </configuration>

  4. JAVA_HOME: Edit etc/hadoop/hadoop-env.sh file to set JAVA_HOME
  1. Start Hadoop
    1. Format Namenode and Start DFS (Namenode, Secondary namenode and Datanodes)
      • # $HADOOP_HOME/bin/hadoop namenode -format # $HADOOP_HOME/sbin/start-dfs.sh
      b. Start YARN (Resource Manager and Node Manager)
      • # $HADOOP_HOME/sbin/start-yarn.sh
      c. Check logs for any errors for each process which is started: $HADOOP_HOME/logs d. Check if all processes are running using 'jps' command: e. Check the File System:
      • # $HADOOP_HOME/bin/hadoop fs -ls / Found 3 items drwxrwxrwx - root supergroup 0 2015-05-15 12:55 /mapred drwxrwxrwx - root supergroup 0 2015-05-15 12:56 /tmp drwxrwxrwx - root supergroup 0 2015-05-15 12:56 /user
      f. Check the HDFS status
      • sudo -u root bin/hdfs dfsadmin -report

      g. Check the web interface at: http://<ip-of-hadoop-system>:50070

      • Check Disk space of the datanode on the datanodes tab
      h. Leave safemode (if dfs is in safe mode)
      • # bin/hadoop dfsadmin -safemode leave
  2. Running Jobs
    1. Run a sample job first to calculate value of Pi:
      • # $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-SNAPSHOT.jar pi -Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory -libjars $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT.jar 16 10000

      b. Run TestDFSIO
      • Write:
        # bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT-tests.jar TestDFSIO -write -nrFiles 1 -fileSize 10GB
        Read:
        # bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT-tests.jar TestDFSIO -read -nrFiles 1 -fileSize 10GB

        Check if the file was created and the size of the file is correct:

        Verify Logs:

        • $HADOOP_HOME/TestDFSIO_result.log

      c. Run TeraSort:

      1. TeraGen

        • # $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-SNAPSHOT.jar teragen -Ddfs.blocksize=512M -Dio.file.buffer.size=131072 -Dmapreduce.map.java.opts=-Xmx1536m -Dmapreduce.map.memory.mb=4096 -Dmapreduce.task.io.sort.mb=256 -Dyarn.app.mapreduce.am.resource.mb=4096 -Dmapreduce.job.maps=64 1000000000 teragen-flags-100GB-input

        ii. TeraSort

        • # $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-SNAPSHOT.jar terasort -Ddfs.blocksize=512M -Dio.file.buffer.size=131072 -Dmapreduce.map.java.opts=-Xmx1536m -Dmapreduce.map.memory.mb=4096 -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.Lz4Codec -Dmapreduce.reduce.java.opts=-Xmx1536m -Dmapreduce.reduce.memory.mb=2048 -Dmapreduce.task.io.sort.factor=100 -Dmapreduce.task.io.sort.mb=768 -Dyarn.app.mapreduce.am.resource.mb=4096 -Dmapred.reduce.tasks=100 -Dmapreduce.terasort.output.replication=1 teragen-flags-100GB-input teragen-flags-100GB-sorted

        iii. TeraValidate

        • # $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-SNAPSHOT.jar teravalidate -Ddfs.blocksize=512M -Dio.file.buffer.size=131072 -Dmapreduce.map.java.opts=-Xmx1536m -Dmapreduce.map.memory.mb=4096 -Dmapreduce.reduce.java.opts=-Xmx1536m -Dmapreduce.reduce.memory.mb=2048 -Dmapreduce.task.io.sort.mb=256 -Dyarn.app.mapreduce.am.resource.mb=4096 -Dmapred.reduce.tasks=1 teragen-flags-100GB-sorted teragen-flags-100GB-validated

Multi-node Cluster Setup

  1. On all nodes, make sure Hadoop is setup to run in single node config. Then make the following changes.
  2. Update Hosts file on each machine of the cluster. Edit /etc/hosts and add entries like the following as per your cluster config:
    • 10.0.2.123 master 10.0.2.1 slave-1 10.0.2.2 slave-2 10.0.2.4 slave-3
  3. From the Master machine, copy the public id to all slave nodes:

    ssh-copy-id -i /home/hdadmin/.ssh/id_rsa.pub hdadmin@slave-1
    ssh-copy-id -i /home/hdadmin/.ssh/id_rsa.pub hdadmin@slave-2
    ssh-copy-id -i /home/hdadmin/.ssh/id_rsa.pub hdadmin@slave-3

  4. On all machines, add the following line to $HADOOP_HOME/etc/hadoop/masters:
    • master
    If files is not present, create one.
  5. On all machines, add the following line to $HADOOP_HOME/etc/hadoop/slaves:
    • master
      slave1
      slave2
      slave3

  6. On all machines, change configuration as follows:
    1. Change the following property in $HADOOP_HOME/etc/hadoop/core-site.xml
      • <property>

        • <name>fs.defaultFS</name> <value>hdfs://master:54310</value> <description>can specify the full domain name if required</description>

        </property>

      b. Add the following to $HADOOP_HOME/etc/hadoop/mapred-site.xml
      • <property>

        • <name>mapred.job.tracker</name> <value>master:54311</value>

        </property>

      c. Add the following to $HADOOP_HOME/etc/hadoop/yarn-site.xml
      • <property>

        • <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value>

        </property> <property>

        • <name>yarn.resourcemanager.address</name> <value>master:8032</value>

        </property> <property>

        • <name>yarn.resourcemanager.webapp.address</name> <value>master:8088</value>

        </property> <property>

        • <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8031</value>

        </property> <property>

        • <name>yarn.resourcemanager.admin.address</name> <value>master:8033</value>

        </property>

      d. In $HADOOP_HOME/etc/hadoop/hdfs-site.xml, change replication (=num of nodes or 3, whichever is smallest. 3 is default)
  7. Format namenode on master and clean all datanodes
    1. On Master,
      • # $HADOOP_HOME/bin/hadoop namenode -format
      b. On all nodes,
      • # rm -rf /var/local/dev/sd*/hadoop/hdfs/data/*
  8. On MASTER,
    • # $HADOOP_HOME/sbin/start-dfs.sh
      # $HADOOP_HOME/sbin/start-yarn.sh

  9. Confirm all nodes are up and running
    1. Check the web UI http://master:50070 and see if all the datanodes are shown b. Check if there are any unhealthy nodes on http://master:8042/ c. Make sure $HADOOP_HOME/logs/ have no errors d. Try out sample workloads as given above in single node config notes above

LEG/Engineering/BigData/HadoopBuildInstallAndRunGuide (last modified 2016-10-26 19:45:45)