In our previous post, we saw how to setup a single node cluster. Here, we look at setting up a multi node cluster from the existing single node setup. Multi-node cluster setup is very much similar to steps with slight modifications where reference to localhost needs to be replaced by the hostname of the respective datanode and few additional setup instructions.
From now on, we call single node as master and other nodes as slaves. (one master, many slaves) We will continue to add one datanode to master
- The namenode (Master) and datanodes (Slaves) should contain the same linux distribution with same user name.
- The Hadoop framework installed in Master and Slaves should be of the same version.
- The Hadoop framework should be installed in same path (/user/nameofmasterorslave/Hadoop).
- Java is required in Master and Slaves, same version of Java is recommended (>=6) — I have never experimented setting up different versions of java in an Hadoop multi-node cluster.
The Core steps:
- Ensure master node on stop-sh.sh
- Identify the hostname of all slaves (in terminal: $hostname -f — to identify hostname of any machine)
- Update the configuration files in Hadoop where referenced to localhost, replace by result of “hostname -f” of MASTER NODE (core-site.xml, mapred-site.xml, masters,slaves)
- Copy the configuration files from Master machine to Slave machines
scp conf/* hostname:/data/hadoop-1.0.1/conf/
example: cd hadoop
If you are using Amazon AWS: scp conf/* ip-10-239-40-120.ec2.internal:/data/hadoop/conf
- Always clear hdfs directory in slave machine:
rm -rf /data/hdfstmp
- In master, edit slavefile, add one more line <@hostname>
- Start all the processes
- if it is waiting for the first time while executing start-all.sh just say ‘yes’
- To view all processes that are running on master node
- To see the processes running on slave nodes
Note: Multinode cluster can have multiple secondary namenodes, While setting up change only the core-site and format namenode