Want your very own server? Get our 1GB memory, Xeon V4, 25GB SSD VPS for £10.00 / month.

Apache Hadoop is an open source framework that allows distributed storage and processing of large scale data across the clusters of computers using simple programing languages. Hadoop is designed to scale up from a single server or node to thousands of nodes, each node offers its local computation and storage. It is designed to detect and handle failures at the application layer, hence it delivers a very reliable performance. Hadoop provides services for data storage, access and processing, as well as it also provides the maximum security.

The Hadoop framework includes the following core modules.

Hadoop Common – This module contains the library and utilities needed by other hadoop modules.
Hadoop YARN – This module contains the framework for job scheduling of user’s application and a platform for cluster resource management.
Hadoop Distributed File System (HDFS) – a distributed file-system that can store the data on distributed nodes and provides high-throughput access to user’s application data.
Hadoop MapReduce – This is YARN-based system for parallel processing of large datasets using MapReduce programming model.

Although Hadoop has many features, a few notable ones are –

Scalability and Performance – Hadoop is highly scalable thus its distributed approach for processing the large data parallelly in each local node in the cluster enables hadoop to store and process the data at blazing speed.
Minimum Cost – Hadoop is open source and free software, you can run Hadoop on any infrastructure according to the requirement, you can run on a single server as well as in multiple servers.
Reliability – Hadoop does not rely on hardware for fault tolerance but it has its own fault detection and handling mechanism, if a node fails in the cluster then Hadoop automatically re-populate the remaining data to be processed and distributes it among available nodes.
Flexibility – Nodes from the clusters can be added or removed dynamically without stopping hadoop.

Requirements

You will only need a VPS or server running a CentOS 7.x minimal installation with root access on it to install Apache Hadoop as there are no specific minimum hardware requirement. In this tutorial we will be using root user account to execute the commands, if you are not logged in as root user then use sudo command before all the commands that we are going to run, or you can also use su command to login to root user account.

Installation

Hadoop is written in Java, hence before installing Apache Hadoop we will need to install Java first. To install Java in your system first we will need to download the RPM file using the following command.

    wget --no-cookies --no-check-certificate --header "Cookie:oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u91-b14/jdk-8u91-linux-x64.rpm"

Once the file is downloaded you can run the following command to install it.

    yum localinstall jdk-8u91-linux-x64.rpm

Once the Java is installed in your system, you can check the version of Java using the following command.

    java -version

It should show you the following output.

    java version "1.8.0_91"
    Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
    Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)

It shows that Java is successfully installed and working on your computer.

Now we will need to add a non sudo user dedicated to Hadoop which will be used to configure Hadoop, to add a user run the following command.

    useradd hadoop

Then you will need to change the password of the user using the following command.

    passwd hadoop

You will be asked twice to provide the new password for your Hadoop user. Now you will need to configure the SSH keys for your new user so that the user can securely log into hadoop without any password. Use the following commands to login to your newly created user and then generating SSH keys for the user.

    su - hadoop
    ssh-keygen -t rsa

The first command will log you as your newly created user and the second command will generate RSA key pair, both private and public key. You will be asked to enter the filename to save the key, simply press enter to use the default file name. Then you will be asked to provide a passphrase for your key, press enter to use no password. Once done you will see following output.

    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
    Created directory '/home/hadoop/.ssh'.
    Enter passphrase (empty for no passphrase):
    Enter same passphrase again:
    Your identification has been saved in /home/hadoop/.ssh/id_rsa.
    Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
    The key fingerprint is:
    22:93:c4:fd:f0:e2:81:c3:6b:7c:a0:1c:18:e6:53:34 hadoop@vps.liptanbiswas.com
    The key's randomart image is:
    +--[ RSA 2048]----+
    |   E             |
    |  ....           |
    |.. .o o          |
    |oo.o o +         |
    |.o. O + S        |
    | ..+ B +         |
    |  o + o          |
    |   . .           |
    |                 |
    +-----------------+

Now run the following command to add the generated key as authorised key so that user can login using the private key.

    cat ~/.ssh/id_rsa.pub &gt;&gt; ~/.ssh/authorized_keys
    chmod 0600 ~/.ssh/authorized_keys

Run exit command to get back to your root user account.

Now browse Apache Hadoop’s official site to download the latest binery version of the software.

    wget http://mirror.rise.ph/apache/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz

Now extract the tar ball using the following command.

    tar xfz hadoop-2.7.2.tar.gz

Now create a new directory to store hadoop files, and move all the Hadoop files to that directory using the following command.

    mkdir /opt/hadoop
    mv hadoop-2.7.2/* /opt/hadoop

Now we will need to give our Hadoop user all the permissions over Hadoop file. Use the following command for same.

    chown -R hadoop:hadoop /opt/hadoop/

We can see all the Hadoop files using the following command.

    ls /opt/hadoop

You will see following output.

    LICENSE.txt  README.txt  etc      lib      sbin
    NOTICE.txt   bin         include  libexec  share

Now we will need to setup environment variables for both Java and Apache Hadoop. Login to the dedicated hadoop user account using the following command.

    su - hadoop

Now edit .bash_profile file using your favorite editor, in this tutorial we will be using nano, you can use whichever you want. If you don’t have nano installed, you can run yum install nano command to install nano on your system. Append these lines at the end of the file and save the file once done.

    export JAVA_HOME=/usr/java/default
    export PATH=$PATH:$JAVA_HOME/bin
    export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
    export HADOOP_HOME=/opt/hadoop
    export HADOOP_INSTALL=$HADOOP_HOME
    export HADOOP_MAPRED_HOME=$HADOOP_HOME
    export HADOOP_COMMON_HOME=$HADOOP_HOME
    export HADOOP_HDFS_HOME=$HADOOP_HOME
    export YARN_HOME=$HADOOP_HOME
    export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
    export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Initialize the variables using the following command.

    source .bash_profile

Once done, you can now check if the environment variables are now set or not. Run the following command.

    echo $JAVA_HOME

It should provide the following output.

    /usr/java/default

Also run the following command.

    echo $HADOOP_HOME

It should show you following output.

    /opt/hadoop

Configuring Hadoop

Hadoop has many configuration files, which are located at $HADOOP_HOME/etc/hadoop/ directory. You can view the list of configuration files using the following command.

    ls $HADOOP_HOME/etc/hadoop/

You will see following output.

    capacity-scheduler.xml      httpfs-env.sh            mapred-env.sh
    configuration.xsl           httpfs-log4j.properties  mapred-queues.xml.template
    container-executor.cfg      httpfs-signature.secret  mapred-site.xml.template
    core-site.xml               httpfs-site.xml          slaves
    hadoop-env.cmd              kms-acls.xml             ssl-client.xml.example
    hadoop-env.sh               kms-env.sh               ssl-server.xml.example
    hadoop-metrics.properties   kms-log4j.properties     yarn-env.cmd
    hadoop-metrics2.properties  kms-site.xml             yarn-env.sh
    hadoop-policy.xml           log4j.properties         yarn-site.xml
    hdfs-site.xml               mapred-env.cmd

As we are installing Hadoop on a single node in pseudo distributed mode. We will need to edit some configuration files in order for Hadoop to work.

Use the following command to navigate to your Hadoop configuration directory.

    cd $HADOOP_HOME/etc/hadoop/

Edit hadoop-env.sh using the following command.

    nano hadoop-env.sh

Scroll down below to find these lines.

    # The only required environment variable is JAVA_HOME.  All others are
    # optional.  When running a distributed configuration it is best to
    # set JAVA_HOME in this file, so that it is correctly defined on
    # remote nodes.

    # The java implementation to use.
    export JAVA_HOME=${JAVA_HOME}

Edit the lines to set JAVA path into Hadoop config files, it is important that we setup Java path here, otherwise Hadoop will not be able to use Java.

    # The java implementation to use.
    export JAVA_HOME=/usr/java/default/

After this we will be editing is core-site.xml, which contains configuration of the port number used by HDFS. Use your favorite editor to edit the file.

    nano core-site.xml

Scroll down to find these lines in configuration.

    <!-- Put site-specific property overrides in this file. -->

Now change the configuration as shown below.

    <!-- Put site-specific property overrides in this file. -->



      fs.default.name
        hdfs://localhost:9000

Save the file and exit from editor. Now we will need to edit second file which is hdfs-site.xml using the following command.

    nano hdfs-site.xml

Scroll down to find these lines in configuration file.

    <!-- Put site-specific property overrides in this file. -->

Insert the following configuration in this file.

    <!-- Put site-specific property overrides in this file. -->



     dfs.replication
     1



      dfs.name.dir
      file:///opt/hadoop/hadoopdata/namenode



      dfs.data.dir
      file:///opt/hadoop/hadoopdata/datanode

Now we will need to create two directories to store namenode and datanode using the following commands.

    mkdir /opt/hadoop/hadoopdata
    mkdir /opt/hadoop/hadoopdata/namenode
    mkdir /opt/hadoop/hadoopdata/datanode

Now copy mapred-site.xml.template as file mapred-site.xml using the following command.

    cp mapred-site.xml.template mapred-site.xml

And then edit the file to insert the following configuration.



      mapreduce.framework.name
       yarn

Finally edit yarn-site.xml to find these lines.



    <!-- Site specific YARN configuration properties -->

Enter the following configuration to make it as specified below.



    <!-- Site specific YARN configuration properties -->

    yarn.nodemanager.aux-services
    mapreduce_shuffle

We have now configured Hadoop to work on a single node cluster. Now we can initialize HDFS file system by formatting the namenode directory using the following command.

    hdfs namenode -format

You will see the following output which is trimmed down to show only the important messages.

    16/06/30 15:29:14 INFO namenode.NameNode: STARTUP_MSG:
    /************************************************************
    STARTUP_MSG: Starting NameNode
    STARTUP_MSG:   host = rackvoucher.com/198.50.190.11
    STARTUP_MSG:   args = [-format]
    STARTUP_MSG:   version = 2.7.2
    STARTUP_MSG:   classpath = 16/06/30 15:29:15 INFO namenode.FSImage: Allocated new BlockPoolId: BP-635153041-198.50.190.11-1467314955591

    ////////////////////////////////

    16/06/30 15:29:15 INFO common.Storage: Storage directory /opt/hadoop/hadoopdata/namenode has been successfully formatted.
    16/06/30 15:29:15 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
    16/06/30 15:29:15 INFO util.ExitUtil: Exiting with status 0
    16/06/30 15:29:15 INFO namenode.NameNode: SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at vps.liptanbiswas.com/198.50.190.11
    ************************************************************/

Now we can start Hadoop cluster, navigate to $HADOOP_HOME/sbin directory using the following command.

    cd $HADOOP_HOME/sbin

If you run ls command here, you will find the list of all executable files to run Hadoop.

    distribute-exclude.sh    start-all.cmd        stop-balancer.sh
    hadoop-daemon.sh         start-all.sh         stop-dfs.cmd
    hadoop-daemons.sh        start-balancer.sh    stop-dfs.sh
    hdfs-config.cmd          start-dfs.cmd        stop-secure-dns.sh
    hdfs-config.sh           start-dfs.sh         stop-yarn.cmd
    httpfs.sh                start-secure-dns.sh  stop-yarn.sh
    kms.sh                   start-yarn.cmd       yarn-daemon.sh
    mr-jobhistory-daemon.sh  start-yarn.sh        yarn-daemons.sh
    refresh-namenodes.sh     stop-all.cmd
    slaves.sh                stop-all.sh

You can start Hadoop services by executing the following commands.

    start-dfs.sh

You will see output similar to this.

    16/06/30 16:06:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Starting namenodes on [localhost]
    localhost: starting namenode, logging to /opt/hadoop/logs/hadoop-hadoop-namenode-vps.out
    localhost: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-vps.out
    Starting secondary namenodes [0.0.0.0]
    0.0.0.0: starting secondarynamenode, logging to /opt/hadoop/logs/hadoop-hadoop-secondarynamenode-vps.out
    16/06/30 16:07:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Now start YARN using the following command.

    start-yarn.sh

You will be shown an output similar to this.

    starting yarn daemons
    starting resourcemanager, logging to /opt/hadoop/logs/yarn-hadoop-resourcemanager-vps.out
    localhost: starting nodemanager, logging to /opt/hadoop/logs/yarn-hadoop-nodemanager-vps.out

You can check the status of the services using the following command.

jps

You will see output similar to this.

    2209 ResourceManager
    2498 NodeManager
    1737 NameNode
    1866 DataNode
    2603 Jps
    2031 SecondaryNameNode

This shows that Hadoop is successfully running on the server.

You can now browse the Apache Hadoop services through your browser. By default Apache Hadoop namenode service is started on port 50070. Go to following address using your favorite browser.

    http://Your-Server-IP:50070

For example if IP address of your server is 198.50.190.11 then you will have to browse

    http://198.50.190.11:50070

You will see a page similar to shown below.

HP_NO_IMG/data/uploads/users/2a78e75d-343a-47ce-9d84-14a6ba54abbc/1570616680.png” alt=”” />

To view Hadoop clusters and all applications, browse the following address into your browser.

    http://Your-Server-IP:8088

You will see the following screen where you can get all the information about the application.

HP_NO_IMG/data/uploads/users/2a78e75d-343a-47ce-9d84-14a6ba54abbc/343175640.png” alt=”” />

You can view NodeManager information at

    http://Your-Server-IP:8042

You will see the information about NodeManager.

HP_NO_IMG/data/uploads/users/2a78e75d-343a-47ce-9d84-14a6ba54abbc/195538411.png” alt=”” />

You can get information on Secondary NameNode by browsing the following web address.

    http://Your-Server-IP:50095

You will see a page similar to shown below.

HP_NO_IMG/data/uploads/users/2a78e75d-343a-47ce-9d84-14a6ba54abbc/1886774692.png” alt=”” />

Conclusion

In this tutorial we have learnt to install Apache Hadoop on a single node with Pseudo distribution mode. You can now easily install Apache Hadoop on your Cloud server or Dedicated server with CentOS 7.x for both development and productive purpose.

Want your very own server? Get our 1GB memory, Xeon V4, 25GB SSD VPS for £10.00 / month.

Get a Cloud Server