Apache Hadoop is an open source framework that allows distributed storage and processing of large scale data across the clusters of computers using simple programing languages. Hadoop is designed to scale up from a single server or node to thousands of nodes, each node offers its local computation and storage. It is designed to detect and handle failures at the application layer, hence it delivers a very reliable performance. Hadoop provides services for data storage, access and processing, as well as it also provides the maximum security.
The Hadoop framework includes the following core modules.
- Hadoop Common – This module contains the library and utilities needed by other hadoop modules.
- Hadoop YARN – This module contains the framework for job scheduling of user’s application and a platform for cluster resource management.
- Hadoop Distributed File System (HDFS) – a distributed file-system that can store the data on distributed nodes and provides high-throughput access to user’s application data.
- Hadoop MapReduce – This is YARN-based system for parallel processing of large datasets using MapReduce programming model.
Although Hadoop has many features, a few notable ones are –
- Scalability and Performance – Hadoop is highly scalable thus its distributed approach for processing the large data parallelly in each local node in the cluster enables hadoop to store and process the data at blazing speed.
- Minimum Cost – Hadoop is open source and free software, you can run Hadoop on any infrastructure according to the requirement, you can run on a single server as well as in multiple servers.
- Reliability – Hadoop does not rely on hardware for fault tolerance but it has its own fault detection and handling mechanism, if a node fails in the cluster then Hadoop automatically re-populate the remaining data to be processed and distributes it among available nodes.
- Flexibility – Nodes from the clusters can be added or removed dynamically without stopping hadoop.
Requirements
You will only need a VPS or server running a CentOS 7.x minimal installation with root access on it to install Apache Hadoop as there are no specific minimum hardware requirement. In this tutorial we will be using root
user account to execute the commands, if you are not logged in as root user then use sudo
command before all the commands that we are going to run, or you can also use su
command to login to root
user account.
Installation
Hadoop is written in Java, hence before installing Apache Hadoop we will need to install Java first. To install Java in your system first we will need to download the RPM file using the following command.
wget --no-cookies --no-check-certificate --header "Cookie:oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u91-b14/jdk-8u91-linux-x64.rpm"
Once the file is downloaded you can run the following command to install it.
yum localinstall jdk-8u91-linux-x64.rpm
Once the Java is installed in your system, you can check the version of Java using the following command.
java -version
It should show you the following output.
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
It shows that Java is successfully installed and working on your computer.
Now we will need to add a non sudo user dedicated to Hadoop which will be used to configure Hadoop, to add a user run the following command.
useradd hadoop
Then you will need to change the password of the user using the following command.
passwd hadoop
You will be asked twice to provide the new password for your Hadoop user. Now you will need to configure the SSH keys for your new user so that the user can securely log into hadoop without any password. Use the following commands to login to your newly created user and then generating SSH keys for the user.
su - hadoop
ssh-keygen -t rsa
The first command will log you as your newly created user and the second command will generate RSA key pair, both private and public key. You will be asked to enter the filename to save the key, simply press enter to use the default file name. Then you will be asked to provide a passphrase for your key, press enter to use no password. Once done you will see following output.
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
22:93:c4:fd:f0:e2:81:c3:6b:7c:a0:1c:18:e6:53:34 hadoop@vps.liptanbiswas.com
The key's randomart image is:
+--[ RSA 2048]----+
| E |
| .... |
|.. .o o |
|oo.o o + |
|.o. O + S |
| ..+ B + |
| o + o |
| . . |
| |
+-----------------+
Now run the following command to add the generated key as authorised key so that user can login using the private key.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
Run exit
command to get back to your root user account.
Now browse Apache Hadoop’s official site to download the latest binery version of the software.
wget http://mirror.rise.ph/apache/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
Now extract the tar ball using the following command.
tar xfz hadoop-2.7.2.tar.gz
Now create a new directory to store hadoop files, and move all the Hadoop files to that directory using the following command.
mkdir /opt/hadoop
mv hadoop-2.7.2/* /opt/hadoop
Now we will need to give our Hadoop user all the permissions over Hadoop file. Use the following command for same.
chown -R hadoop:hadoop /opt/hadoop/
We can see all the Hadoop files using the following command.
ls /opt/hadoop
You will see following output.
LICENSE.txt README.txt etc lib sbin
NOTICE.txt bin include libexec share
Now we will need to setup environment variables for both Java and Apache Hadoop. Login to the dedicated hadoop user account using the following command.
su - hadoop
Now edit .bash_profile
file using your favorite editor, in this tutorial we will be using nano
, you can use whichever you want. If you don’t have nano
installed, you can run yum install nano
command to install nano
on your system. Append these lines at the end of the file and save the file once done.
export JAVA_HOME=/usr/java/default
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Initialize the variables using the following command.
source .bash_profile
Once done, you can now check if the environment variables are now set or not. Run the following command.
echo $JAVA_HOME
It should provide the following output.
/usr/java/default
Also run the following command.
echo $HADOOP_HOME
It should show you following output.
/opt/hadoop
Configuring Hadoop
Hadoop has many configuration files, which are located at $HADOOP_HOME/etc/hadoop/
directory. You can view the list of configuration files using the following command.
ls $HADOOP_HOME/etc/hadoop/
You will see following output.
capacity-scheduler.xml httpfs-env.sh mapred-env.sh
configuration.xsl httpfs-log4j.properties mapred-queues.xml.template
container-executor.cfg httpfs-signature.secret mapred-site.xml.template
core-site.xml httpfs-site.xml slaves
hadoop-env.cmd kms-acls.xml ssl-client.xml.example
hadoop-env.sh kms-env.sh ssl-server.xml.example
hadoop-metrics.properties kms-log4j.properties yarn-env.cmd
hadoop-metrics2.properties kms-site.xml yarn-env.sh
hadoop-policy.xml log4j.properties yarn-site.xml
hdfs-site.xml mapred-env.cmd
As we are installing Hadoop on a single node in pseudo distributed mode. We will need to edit some configuration files in order for Hadoop to work.
Use the following command to navigate to your Hadoop configuration directory.
cd $HADOOP_HOME/etc/hadoop/
Edit hadoop-env.sh
using the following command.
nano hadoop-env.sh
Scroll down below to find these lines.
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use.
export JAVA_HOME=${JAVA_HOME}
Edit the lines to set JAVA path into Hadoop config files, it is important that we setup Java path here, otherwise Hadoop will not be able to use Java.
# The java implementation to use.
export JAVA_HOME=/usr/java/default/
After this we will be editing is core-site.xml
, which contains configuration of the port number used by HDFS. Use your favorite editor to edit the file.
nano core-site.xml
Scroll down to find these lines in configuration.
<!-- Put site-specific property overrides in this file. -->
Now change the configuration as shown below.
<!-- Put site-specific property overrides in this file. -->
fs.default.name
hdfs://localhost:9000
Save the file and exit from editor. Now we will need to edit second file which is hdfs-site.xml
using the following command.
nano hdfs-site.xml
Scroll down to find these lines in configuration file.
<!-- Put site-specific property overrides in this file. -->
Insert the following configuration in this file.
<!-- Put site-specific property overrides in this file. -->
dfs.replication
1
dfs.name.dir
file:///opt/hadoop/hadoopdata/namenode
dfs.data.dir
file:///opt/hadoop/hadoopdata/datanode
Now we will need to create two directories to store namenode and datanode using the following commands.
mkdir /opt/hadoop/hadoopdata
mkdir /opt/hadoop/hadoopdata/namenode
mkdir /opt/hadoop/hadoopdata/datanode
Now copy mapred-site.xml.template
as file mapred-site.xml
using the following command.
cp mapred-site.xml.template mapred-site.xml
And then edit the file to insert the following configuration.
mapreduce.framework.name
yarn
Finally edit yarn-site.xml
to find these lines.
<!-- Site specific YARN configuration properties -->
Enter the following configuration to make it as specified below.
<!-- Site specific YARN configuration properties -->
yarn.nodemanager.aux-services
mapreduce_shuffle
We have now configured Hadoop to work on a single node cluster. Now we can initialize HDFS file system by formatting the namenode directory using the following command.
hdfs namenode -format
You will see the following output which is trimmed down to show only the important messages.
16/06/30 15:29:14 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = rackvoucher.com/198.50.190.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.7.2
STARTUP_MSG: classpath = 16/06/30 15:29:15 INFO namenode.FSImage: Allocated new BlockPoolId: BP-635153041-198.50.190.11-1467314955591
////////////////////////////////
16/06/30 15:29:15 INFO common.Storage: Storage directory /opt/hadoop/hadoopdata/namenode has been successfully formatted.
16/06/30 15:29:15 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
16/06/30 15:29:15 INFO util.ExitUtil: Exiting with status 0
16/06/30 15:29:15 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at vps.liptanbiswas.com/198.50.190.11
************************************************************/
Now we can start Hadoop cluster, navigate to $HADOOP_HOME/sbin
directory using the following command.
cd $HADOOP_HOME/sbin
If you run ls
command here, you will find the list of all executable files to run Hadoop.
distribute-exclude.sh start-all.cmd stop-balancer.sh
hadoop-daemon.sh start-all.sh stop-dfs.cmd
hadoop-daemons.sh start-balancer.sh stop-dfs.sh
hdfs-config.cmd start-dfs.cmd stop-secure-dns.sh
hdfs-config.sh start-dfs.sh stop-yarn.cmd
httpfs.sh start-secure-dns.sh stop-yarn.sh
kms.sh start-yarn.cmd yarn-daemon.sh
mr-jobhistory-daemon.sh start-yarn.sh yarn-daemons.sh
refresh-namenodes.sh stop-all.cmd
slaves.sh stop-all.sh
You can start Hadoop services by executing the following commands.
start-dfs.sh
You will see output similar to this.
16/06/30 16:06:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /opt/hadoop/logs/hadoop-hadoop-namenode-vps.out
localhost: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-vps.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /opt/hadoop/logs/hadoop-hadoop-secondarynamenode-vps.out
16/06/30 16:07:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Now start YARN using the following command.
start-yarn.sh
You will be shown an output similar to this.
starting yarn daemons
starting resourcemanager, logging to /opt/hadoop/logs/yarn-hadoop-resourcemanager-vps.out
localhost: starting nodemanager, logging to /opt/hadoop/logs/yarn-hadoop-nodemanager-vps.out
You can check the status of the services using the following command.
jps
You will see output similar to this.
2209 ResourceManager
2498 NodeManager
1737 NameNode
1866 DataNode
2603 Jps
2031 SecondaryNameNode
This shows that Hadoop is successfully running on the server.
You can now browse the Apache Hadoop services through your browser. By default Apache Hadoop namenode service is started on port 50070
. Go to following address using your favorite browser.
http://Your-Server-IP:50070
For example if IP address of your server is 198.50.190.11
then you will have to browse
http://198.50.190.11:50070
You will see a page similar to shown below.
HP_NO_IMG/data/uploads/users/2a78e75d-343a-47ce-9d84-14a6ba54abbc/1570616680.png” alt=”” />
To view Hadoop clusters and all applications, browse the following address into your browser.
http://Your-Server-IP:8088
You will see the following screen where you can get all the information about the application.
HP_NO_IMG/data/uploads/users/2a78e75d-343a-47ce-9d84-14a6ba54abbc/343175640.png” alt=”” />
You can view NodeManager information at
http://Your-Server-IP:8042
You will see the information about NodeManager.
HP_NO_IMG/data/uploads/users/2a78e75d-343a-47ce-9d84-14a6ba54abbc/195538411.png” alt=”” />
You can get information on Secondary NameNode by browsing the following web address.
http://Your-Server-IP:50095
You will see a page similar to shown below.
HP_NO_IMG/data/uploads/users/2a78e75d-343a-47ce-9d84-14a6ba54abbc/1886774692.png” alt=”” />
Conclusion
In this tutorial we have learnt to install Apache Hadoop on a single node with Pseudo distribution mode. You can now easily install Apache Hadoop on your Cloud server or Dedicated server with CentOS 7.x for both development and productive purpose.