How to Hadoop at home with Raspberry Pi - Part 3

In the home stretch now to completing my Raspberry Pi Hadoop cluster. But first a quick summary. I started this personal project about 3 weeks ago because of my interest in “Big Data”, data analytics and data engineering. I’m also taking Udacity’s Nanodegree in Data Analytics so I figured getting my hands dirty with Hadoop was a great way to dive in. That being said, this is Part 3 of 3, I’ve setup Raspberry Pi in part 1 then installed and configured Hadoop in a single node configuration in part 2. Now it’s time to get the cluster working.

This is not a tutorial. Think of it more as a journey, there’s no nice step-by-step process here, I’m going to make mistakes, get errors, fix them and try to move on.

  1. Part 1: Setting up Raspberry Pi and network configurations
  2. Part 2: Hadoop single node setup, testing and prepping the cluster
  3. Part 3: Hadoop cluster setup, testing and final thoughts

Hadoop and distributed data processing

Now that we’re talking…

over SSH, it’s time to start things up and make sure all of our services are running properly on all 3 nodes. I log into node1 and start the services.

1. $su hduser
2. $cd $HADOOP_HOME/sbin
3. $./start-dfs.sh
4. $./start-yarn.sh

From node1, the services are up and running but let us see what’s happening on node 2 and 3.

NodeManager and DataNode should be running on both nodes
$ssh node2 (node3)
$jps

On node 2, services are running but not on node 3. I could start things directly on node 3 but I shouldn’t have to so it’s time to stop and re-start both dfs and YARN (executed from node1).

Still nothing. Repeated a few times, need to do some research.

The problem: dataNode service is only running on one or two nodes and sometimes the service is starting up and then shutting down automatically.

The Solution: Strangely, it’s pretty much the same as I encountered before when formatting the nameNode. I believe copying the SD Card and the nameNode fiasco resulting in my cloned cards having out of sync ClusterIDs. So again, I just followed the steps below, then restarted everything. Once done, all services across each node were running and stable. Yay!

Delete hdfs storage, add permissions and repeat for all nodes
1. rm -rf /opt/hadoop/hadoop_data
2. $sudo mkdir -p /opt/hadoop/hadoop_data/hdfs/namenode (not required for nodes 2 and 3)
3. $sudo mkdir -p /opt/hadoop/hadoop_data/hdfs/datanode
4. $sudo chown hduser:hadoop /opt/hadoop/hadoop_data/hdfs -R
5. $sudo chmod 750 /opt/hadoop/hadoop_data/hdfs
Format node1 namenode
$hdfs namenode -format
Start up hadoop on node 1 and run jps to validate things are running
1. $cd $HADOOP_HOME/sbin
2. $./start-dfs.sh
3. $./start-yarn.sh
4. $jps (test on all nodes)

Testing…

your creation is probably the most exciting and frustrating part of any personal project. Now that I have my cluster actually running I need to do a few tests to make sure things are working as expected.

Similar to the single node test, I’m going to try the wordCount and Pi examples.

Copy the file,check HDFS for the file then run wordCount on the file
1. $hdfs dfs -copyFromLocal /opt/hadoop/LICENSE.txt /license.txt
2. $./hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /license.txt /license

The job is running but it seems like everything is running on node1. To be 100% sure about this, I can always do some network and CPU monitoring. Raspberry Pi comes pre-installed with such a tool called top. However, I heard nmon is pretty good, if not better, so time to test it out.

Install nmon and run it on each node
$sudo apt-get install nmon
$nmon
I have c(CPU),m(memory) and n(network) options open across terminals

Now time to rerun and see what’s happening…Yup, only node1 is doing work, I also figured out how to view the individual tasks in the YARN web UI, default 8088 port.

So with things only running on node1 there are a couple of reasons and solutions for this, that I know.

  1. All blocks of the file are on one dataNode. And this happened because I uploaded the file locally to HDFS but the local node I’m on (node1) is also acting/configured as a dataNode. And with replication set to 1, and Hadoop running jobs close to the data, everything gets processed on node1.
  2. Block size is too big. When I uploaded the test file into HDFS it breaks the file into blocks which it then replicates across servers. However, if the block is 10MB but the file is 1MB there’s no need to break up the file, meaning only 1 block is required and hence no distributed processing.

Both of these are easy to fix. First, I need a bigger test file and secondly, I need to change some configurations across the nodes.

Get test file
Downloaded a 35MB text file for testing
Remove node1 as a dataNode
$sudo nano /opt/hadoop/etc/hadoop/slaves
remove: node1
Update block size (hdfs-site.xml)
35MB test file isn't that big so add a block size parameter of 5MB, forcing Hadoop to break up the file.
<property>
<name>dfs.block.size</name>
<value>5242880</value>
</property>
Update replication (hdfs-site.xml)
<property>
<name>dfs.replication</name>
<value>2</value>
</property>

Now in my system I have a nameNode (node1) and 2 dataNodes (node2 and node3). Time to re-test…And re-test again and again. Data is being replicated across the 2 dataNodes but now the job isn’t running at all. The CLI is stuck on “INFO mapreduce.Job: Running job: job_145…”

Time for more investigation. And when reviewing the logs, or the YARN web UI, I keep seeing:

Accepted: waiting for AM container to be allocated...

The job is basically stuck in “Accepted” state. On SO there seems to be a lot of different “solutions” to this problem. The solution that worked for me now seems so obvious…I’m using YARN, a resource manager, its configuration is local and used by each node in the cluster but each node in the cluster needs to know where the resource manager is within the cluster.

Therefore, time for another configuration update of the yarn-site.xml file.

Update yarn-site.xml on all nodes with the resource manager hostname
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node1</value>
</property>

Now YARN knows that node1 is the resource manager, but more importantly, node2 and node3 also knows who the resource manager is. Start things up again, and this time go to node1:8088 in the browser and you should see both node2 and node3 listed as active and running.

Testing the redux…

will be an error free event, fingers crossed. With what appears to be a working distributed system, it’s time to test Hadoop with the wordCount example one last time.

$./yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /testFile.txt /testFile
Note: notice I used "yarn" instead of "hadoop" to run the job, this is another test. I noticed this change in a few articles and I'm still not sure what the difference is but both seem to work the same

SUCCESS!

With nmon monitoring all 3 nodes and also reviewing the tasks via browser, you can see how the map jobs were shared among both node2 and node3. In part 2, I recorded a few timed runs of Pi on the single node, and now with a distributed cluster, I should see some better times…Sounds like another test is coming.

hadoop-mapreduce-examples-2.7.2.jar pi 10 10
(Estimated value of Pi is 3.200)
Original Pi tests:
Job Finished in 346.618 seconds
Job Finished in 392.751 seconds
Job Finished in 370.214 seconds
Job Finished in 361.656 seconds
Distributed Pi tests:
Job Finished in 142.409 seconds
Job Finished in 164.503 seconds
Job Finished in 122.628 seconds
Job Finished in 133.072 seconds
Job Finished in 116.997 seconds

It is just me, or is 117 seconds a beautiful thing to see? With the cluster working, data processing is distributed across the nodes in parallel and working more efficiently. Yay! Finally, I have a Raspberry Pi Hadoop cluster.

Final Thoughts

Damn…

that was involved. I didn’t think it would be easy but with so many articles and tutorials out there, I definitely didn’t think I’d run into so many issues. Part of that, I believe, was due to the new version of Hadoop. With Hadoop 2.x, a number of things changed such as system/folder paths, the introduction of YARN, MapReduce 2 and MapReduce configuration changes to integrate with YARN. Not to mention this is running on a RaspberryPi, memory and CPU resources were a problem as well.

For my final thoughts, I wont get into a bunch of technical “gotcha” moments, instead since this was a personal journey, here are my personal thoughts…

  • Building a Hadoop cluster on Raspberry Pi was an amazing learning and satisfying experience
  • So much more to learn. Hadoop configurations will make or break your cluster’s performance — or even prevent it from working in the first place
  • I would love to bring 117 down too 100 seconds with optimizing Hadoop configuration, so we’ll see how that goes
  • I hope my experience here will add some benefit to the library of online Hadoop 2 + Raspberry Pi articles.
  • Only the beginning. Going to setup Hive next! And then Pig.

Update: I’ve had a few requests about providing a step-by-step setup and final working config files. And after reading over my 3 articles, I think that’s definitely a reasonable request, so look out for it in the near future.

RaspberryPi Hadoop cluster. BigData on a small scale.