A Hadoop Cluster @Home using CentOS 6.5, Cloudera Manager 4.8.1 and Cloudera Parcels : EPISODE #2

 STEP 2 : Hadoop Cluster Installation : Getting your Virtual Machines  READY

Cloudera Manager Supports :

  • ◾Red Hat Enterprise Linux 5.7 and CentOS 5.7, 64-bit
  • ◾Red Hat Enterprise Linux 6.2 and 6.4, and CentOS 6.2 and 6.4, 64-bit
  • ◾Firefox 11 or later or
  • ◾Google Chrome or
  • ◾Internet Explorer 9

Supported Databases for Cloudera Manager

  • ◾Cloudera Manager requires several databases.
  • ◾The Cloudera Manager server stores information about configured services, role assignments, configuration history, commands, users, and running processes in a database of its own.
  • ◾The Activity Monitor, Service Monitor, Report Manager, and Host Monitor also each use a database to store information.
  • ◾The embedded PostgreSQL database
  • ◾MySQL:◦5.0 ◦5.1 ◦5.5
  • ◾•Oracle◦10g Release 2 ◦11g Release 2
  • ◾•PostgreSQL◦8.1 ◦8.3◦8.4◦9.1

CDH Version Support :

  • •Cloudera Manager supports CDH3 Update 1 (cdh3u1) or later and CDH4.0 or later


  • ◾Upgrade to the latest CDH minor release with just a few mouse clicks, and
  • ◾Even without taking any downtime on your cluster
  • ◾Requires Cloudera Manager 4.5 and later

Why Cloudera Manager :

  • The Cloudera Manager Installer enables you to install Cloudera Manager and
  • bootstrap an entire CDH cluster, requiring only that you have SSH access to your cluster’s machines, and that those machines have Internet access.

The Cloudera Manager Installer will automatically:

  • Detect the operating system on the Cloudera Manager host
  • Install the package repository for Cloudera Manager and the Java Runtime
  • Environment (JRE)
  • Install the JRE if it’s not already installed
  • Install and configure an embedded PostgreSQL database
  • Install and run the Cloudera Manager Server
  • Once server installation is complete, you can browse to Cloudera Manager’s web interface and use the cluster installation wizard to set up your CDH cluster.

Getting your Virtual Machines  READY:

  • Decide the number of Nodes required to form your cluster.
  • Make that many number of Virtual Machines as mentioned in my previous blog.
  1. Edit /etc/resolv.conf
  • resolv.conf  : is the name of a computer file used in various operating systems to configure the Domain Name System (DNS) resolver library
  • is the resolver configuration file
  • contains information that determines the operational parameters of the DNS resolver.

The DNS resolver allows applications running in the operating system to translate human-friendly domain names into the numeric IP addresses that are required for access to resources on the local area network or the Internet.

  • search example.com
  • nameserver

A name server is a computer server that hosts a network service for providing responses to queries against a directory service.

  • It maps a human-recognizable identifier to a system-internal, often numeric identification or addressing component.
  • This service is performed by the server in response to a network service protocol request.
  • An example of a name server is the server component of the Domain Name System (DNS), one of the two principal name spaces of the Internet.
  • The most important function of these DNS servers is the translation (resolution) of human-memorable domain names and hostnames into the corresponding numeric Internet Protocol (IP) addresses
  • A domain name (for instance, “example.com”) is an identification string that defines a realm of administrative autonomy, authority or control on the Internet.
  • Domain names are formed by the rules and procedures of the Domain Name System (DNS)
  • Any name registered in the DNS is a domain name
  • hostname is a label that is assigned to a device connected to a computer network and that is used to identify the device in various forms of electronic communication such as the World Wide Web, e-mail or Usenet. Hostnames may be simple names consisting of a single word or phrase, or they may be structured.

Domain Name System (DNS) is a hierarchical distributed naming system for computers, services, or any resource connected to the Internet or a private network. It associates various information with domain names assigned to each of the participating entities.

Domain Name System is that it serves as the phone book for the Internet by translating human-friendly computer hostnames into IP addresses. For example, the domain name http://www.example.com translates to the addresses (IPv4) and 2606:2800:220:6d:26bf:1447:1097:aa7 (IPv6). Unlike a phone book, the DNS can be quickly updated,

File name to edit is /etc/resolv.conf and not /etc/resolve.conf

By default this file should be populated, else edit it with appropriate values as below :

  • domain localdomain
  • search localdomain
  • nameserver

2.  Edit  /etc/sysconfig/network

  • HOSTNAME=www.ab.com

(Hostname reflect in command prompt eg :- [root@www adminuser]#

About sysconfig-network files : http://www.centos.org/docs/5/html/5.2/Deployment_Guide/s2-sysconfig-network.html

3. Disable the Selinux an all nodes 

Edit /etc/selinux/config

  • SELINUX=disabled —- # — change the value from enforcing to disabled

Check using the command : Sestatus

You can also change the policy live like this:

  • setenforce 0 ‘to disable               
  • setenforce 1 ‘to enable

Reboot the VM for the change to reflect the change


  • A Linux firewall is software based firewall that provides protection between your server (workstation) and damaging content on the Internet or network.
  • It will try to guard your computer against both malicious users and software such as viruses/worms.

Turn off firewall on boot:

  • chkconfig iptables off

 5. Edit your hosts file

vi /edit/hosts ——-add your hosts in the file

6. Generate ssh-key file  SSH 

Verify SSH installation

The first step is to check whether SSH is installed on your nodes. We can easily do this

by use of the “which” UNIX command:

[hadoop-user@master]$ which ssh /usr/bin/ssh

[hadoop-user@master]$ which sshd


[hadoop-user@master]$ which ssh-keygen /usr/bin/ssh-keygen

If you instead receive an error message such as this,

/usr/bin/which: no ssh in (/usr/bin:/bin:/usr/sbin…

install OpenSSH (www.openssh.com) via a Linux package manager or by downloading the source directly. (Better yet, have your system administrator do it for you.)

Generate SSH key pair

Having verified that SSH is correctly installed on all nodes of the cluster, we use sshkeygen on the master node to generate an RSA key pair. Be certain to avoid entering a passphrase, or you’ll have to manually enter that phrase every time the master node attempts to access another node.

[hadoop-user@master]$ ssh-keygen -t rsa

Generating public/private rsa key pair.

Enter file in which to save the key (/home/hadoop-user/.ssh/id_rsa): Enter passphrase (empty for no passphrase):

Enter same passphrase again:

Your identification has been saved in /home/hadoop-user/.ssh/id_rsa. Your public key has been saved in /home/hadoop-user/.ssh/id_rsa.pub.

After creating your key pair, your public key will be of the form

[hadoop-user@master]$ more /home/hadoop-user/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA1WS3RG8LrZH4zL2/1oYgkV1OmVclQ2OO5vRi0Nd K51Sy3wWpBVHx82F3x3ddoZQjBK3uvLMaDhXvncJG31JPfU7CTAfmtgINYv0kdUbDJq4TKG/fuO5q J9CqHV71thN2M310gcJ0Y9YCN6grmsiWb2iMcXpy2pqg8UM3ZKApyIPx99O1vREWm+4moFTg YwIl5be23ZCyxNjgZFWk5MRlT1p1TxB68jqNbPQtU7fIafS7Sasy7h4eyIy7cbLh8x0/V4/mcQsY 5dvReitNvFVte6onl8YdmnMpAh6nwCvog3UeWWJjVZTEBFkTZuV1i9HeYHxpm1wAzcnf7az78jT IRQ== hadoop-user@master

and we next need to distribute this public key across your cluster.

Distribute public key and validate logins

Albeit a bit tedious, you’ll next need to copy the public key to every slave node as well as the master node:

[hadoop-user@master]$ scp ~/.ssh/id_rsa.pub hadoop-user@target:~/master_key

Manually log in to the target node and set the master key as an authorized key (or append to the list of authorized keys if you have others defined).

[hadoop-user@target]$ mkdir ~/.ssh

[hadoop-user@target]$ chmod 700 ~/.ssh

[hadoop-user@target]$ mv ~/master_key ~/.ssh/authorized_keys

[hadoop-user@target]$ chmod 600 ~/.ssh/authorized_keys

After generating the key, you can verify it’s correctly defined by attempting to log in to the target node from the master:

[hadoop-user@master]$ ssh target

The authenticity of host ‘target (xxx.xxx.xxx.xxx)’ can’t be established. RSA key fingerprint is 72:31:d8:1b:11:36:43:52:56:11:77:a4:ec:82:03:1d. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added ‘target’ (RSA) to the list of known hosts. Last login: Sun Jan 4 15:32:22 2009 from master

After confirming the authenticity of a target node to the master node, you won’t be prompted upon subsequent login attempts.

[hadoop-user@master]$ ssh target Last login: Sun Jan 4 15:32:49 2009 from master

7. Edit /etc/ssh/ssh_config

  • change the strictHostkeychecking to No from yes..

8. Update the packages of the system

  • yum –y update

9. Apply all the previous steps  on all nodes and change ip address and host-names accordingly to the /etc/hosts file
10. Download the cloudera-manager-installer.bin in the machine which would act as the Master server

CM Installer Download : http://archive.cloudera.com/cm4/installer/

To be continued ..Stay tuned


About shalishvj : My Experience with BigData

6+ years of experience using Bigdata technologies in Architect, Developer and Administrator roles for various clients. • Experience using Hortonworks, Cloudera, AWS distributions. • Cloudera Certified Developer for Hadoop. • Cloudera Certified Administrator for Hadoop. • Spark Certification from Big Data Spark Foundations. • SCJP, OCWCD. • Experience in setting up Hadoop clusters in PROD, DR, UAT , DEV environments.
This entry was posted in Hadoop Cluster Installation. Bookmark the permalink.

One Response to A Hadoop Cluster @Home using CentOS 6.5, Cloudera Manager 4.8.1 and Cloudera Parcels : EPISODE #2

  1. Michael Hoffmann says:

    Hello! Interesting posts. A few things from my own experience:
    – I would set SELinux to permissive and not turn it off completely. That way you can still see what it may complain about and take note for environments where it’s mandatory.
    – Similar for the firewall/iptables: leave it on and selectively open ports as needed. NOT a good idea to turn it off in a commercial production environment, so you might as well learn in your own test systems what’s needed to get it work while iptables are active.
    – Remember that by default a RHEL/Centos machine will use LVM, which is definitely not advised for HDFS. So, remember to plan to add some additional disks that will be JBOD and formatted directly with ext3/ext4/xfs. Of course, virtualisation isn’t all that advisable for something datanodes anyway (regardless of what VMware says)

    Looking forward to how you configure the namenodes/jobtrackers and datanodes/tasktrackers. So far, I’ve found I have more control doing installations “by hand” than using Cloudera Manager. Would like to hear your experiences!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s