ODPi Hadoop Cluster Setup Guide


To build and setup ODPi Hadoop, follow instructions on following pages: Build Guide and Setup Guide

Setup steps are summarized below:

## Copy all rpms from ODPi build directory to a common directory say ~/odpi_rpms
cd ~/odpi_rpms
yum install -y createrepo
createrepo --database .
yum install -y [a-q]*
yum install -y [s-y]*

## Install Linaro Open JDK 1.7 1510 release
cd ~
wget http://openjdk.linaro.org/releases/jdk7-server-release-1510.tar.xz
tar -xf jdk7-server-release-1510.tar.xz
mv jdk7-server-release-1510 openjdk1.7_1510

vim ~/.bashrc
export JAVA_HOME=/home/nbhoyar/openjdk1.7_1510
export PATH=${JAVA_HOME}/bin:${PATH}

export HADOOP_HOME=/usr/lib/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_PREFIX=$HADOOP_HOME
export PATH=/usr/lib/hadoop/libexec:/etc/hadoop/conf:$HADOOP_HOME/bin/:$PATH
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=/usr/lib/hadoop-hdfs
export YARN_HOME=/usr/lib/hadoop-yarn
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib/native"
. ~/.bashrc

## Check whether Hadoop and Java were installed successfully
hadoop version
java -version

See this document for help on creating mount partitions and filesystems: Hadoop Install Guide

In this example, I have mounted all my drives in the following location: /var/local/dev/sd*2 where sd*2 will be sdb2, sdc2, sdd2 and so on.

— Create this tree in all data drives after mounting:
├── hadoop
│   └── hdfs
│       ├── data
│       └── namenode
├── lost+found
├── root.root-local
└── root.root-tmpdir


## Add hduser and change permissions and ownership of hadoop directories
groupadd hadoop
adduser -g hadoop hduser
groupadd supergroup
usermod -a -G supergroup hduser
usermod hduser -a -G wheel
passwd hduser 

(change passwd to 'hduser')


rm -rf /var/local/dev/sd*/hadoop/hdfs/data/*
chown -R hduser:hadoop /var/local/dev/sd*/
chown -R hdfs:hdfs /var/local/dev/sd*/hadoop/hdfs
mkdir -p /var/local/dev/sdb2/yarn.yarn-local
chown -R yarn:yarn /var/local/dev/sdb2/yarn.yarn-local

mkdir -p /var/local/dev/sdc2/hduser.hadoop-tmpdir
chown -R hduser:hadoop /var/local/dev/sdc2/hduser.hadoop-tmpdir

chown hduser:hadoop /usr/lib/hadoop
chmod 750 /usr/lib/hadoop
mkdir -p /home/hduser/hadoop/tmp
chown -R hduser:hadoop /home/hduser/hadoop
chmod 750 /home/hduser/hadoop
sudo mkdir -p /app/hadoop/tmp
sudo chown hduser:hadoop /app/hadoop/tmp
sudo chmod 750 /app/hadoop/tmp


## Log in to hduser and configure password-less SSH access
su - hduser 
ssh-keygen -t rsa -P ""

cat /home/hduser/.ssh/id_rsa.pub >> /home/hduser/.ssh/authorized_keys

## Test SSH access
ssh localhost
## Exit 
exit

## Change System config and Hadoop configurations

sudo vim /etc/sysctl.conf
— add the below lines and save
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
— now restart the system

sudo vim /etc/hadoop/conf/hadoop-env.sh
— uncomment line # export HADOOP_OPTS=-Djava.net.preferIPV4stack=true

vim ~/.bashrc
export JAVA_HOME=/home/nbhoyar/openjdk1.7_1510
export PATH=${JAVA_HOME}/bin:${PATH}

export HADOOP_HOME=/usr/lib/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_PREFIX=$HADOOP_HOME
export PATH=/usr/lib/hadoop/libexec:/etc/hadoop/conf:$HADOOP_HOME/bin/:$PATH
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=/usr/lib/hadoop-hdfs
export YARN_HOME=/usr/lib/hadoop-yarn
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib/native"
. ~/.bashrc

cd /etc/hadoop/conf

sudo vim core-site.xml
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/var/local/dev/sdc2/hduser.hadoop-tmpdir</value>
    <description>A base for other temporary directories.</description>
  </property>

  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:54310</value>
    <description>The name of the default file system.  A URI whose
     scheme and authority determine the FileSystem implementation.  The
     uri's scheme determines the config property (fs.SCHEME.impl) naming
     the FileSystem implementation class.  The uri's authority is used to
     determine the host, port, etc. for a filesystem.</description>
  </property>

sudo vim mapred-site.xml
  <property>
    <name>mapreduce.map.memory.mb</name>
    <value>7168</value>
  </property>
  <property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>9216</value>
  </property>
  <property>
    <name>mapreduce.map.java.opts</name>
    <value>-Xmx4g</value>
  </property>
  <property>
    <name>mapreduce.reduce.java.opts</name>
    <value>-Xmx8g</value>
  </property>
  <property>
    <name>mapreduce.map.cpu.vcores</name>
    <value>1</value>
  </property>
  <property>
    <name>mapreduce.task.io.sort.mb</name>
    <value>2000</value>
  </property>
  <property>
    <name>mapreduce.job.reduces</name>
    <value>4</value>
  </property>

sudo vim hdfs-site.xml
— Change value of the following property
  <property>
     <name>dfs.datanode.data.dir</name>
     <value>file:///var/local/dev/sdb2/hadoop/hdfs/data,file:///var/local/dev/sdc2/hadoop/hdfs/data,file:///var/local/dev/sdd2/hadoop/hdfs/data,file:///var/local/dev/sde2/hadoop/hdfs/data,file:///var/local/dev/sdf2/hadoop/hdfs/data</value>
  </property>

sudo vim yarn-site.xml
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>2048</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>65000</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>65000</value>
  </property>

Change value of following property
  <property>
    <description>List of directories to store localized files in.</description>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/var/local/dev/sdb2/yarn.yarn-local</value>
  </property>

## Format hadoop namenode and start all processes to check if everything is working
sudo /etc/init.d/hadoop-hdfs-namenode init
for i in hadoop-hdfs-namenode hadoop-hdfs-datanode ; do sudo service $i start ; done
sudo /usr/lib/hadoop/libexec/init-hdfs.sh
sudo /etc/init.d/hadoop-yarn-resourcemanager start
sudo /etc/init.d/hadoop-yarn-nodemanager start

MULTI NODE SETUP

Follow the guide for multi node setup given here: Multi Node Setup

After master node is started, go to individual slave nodes and start datanode and nodemanager on each.

Run TeraSort

Note: -Dmapred.map.tasks should be equal to number of cores in the cluster for TeraGen. Also, change number of reduce jobs property in mapred-site.xml to suit your cluster needs.

/usr/bin/time hadoop jar $HADOOP_MAPRED_HOME/hadoop-mapreduce-examples-2.6.0.jar teragen -Ddfs.blocksize=512M -Dmapred.map.tasks=32 1024000000 teragen-flags-1TB-input

/usr/bin/time hadoop jar $HADOOP_MAPRED_HOME/hadoop-mapreduce-examples-2.6.0.jar terasort -Ddfs.blocksize=512M -Dio.file.buffer.size=131072 -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.Lz4Codec -Dmapreduce.terasort.output.replication=1 teragen-flags-1TB-input teragen-flags-1TB-sorted

/usr/bin/time hadoop jar $HADOOP_MAPRED_HOME/hadoop-mapreduce-examples-2.6.0.jar teravalidate -Ddfs.blocksize=512M -Dio.file.buffer.size=131072  -Dmapred.reduce.tasks=1 teragen-flags-1TB-sorted teragen-flags-1TB-validated

Extra commands if needed:

Restart all processes (Can be used to start/stop all as well)
for i in hadoop-hdfs-namenode hadoop-hdfs-datanode hadoop-yarn-resourcemanager hadoop-yarn-nodemanager; do sudo service $i restart; done

Turn Safe mode off:

Hdfs dfsadmin -safemode leave

LEG/Engineering/BigData/ODPiHadoopMultinodeClusterSetup (last modified 2016-03-21 23:00:41)