How to Hadoop at home with Raspberry Pi

How to Hadoop at home with Raspberry Pi - Part 2

25 Mar 2016

To summarize what I’m doing and how I got here: A couple of weeks ago I decided to dive into the world of Hadoop from my interest in data engineering and analysis. And what’s the best way to do that? Build a Raspberry Pi Hadoop Cluster, of course!

This is not a tutorial. Think of it more as a journey, there’s no nice step-by-step process here, I’m going to make mistakes, get errors, fix them and try to move on.

If you want to follow along, you should probably start with Part 1 which covers setting up Raspberry and some limited network configurations. In this part of the series, I’ll be installing and configuring Hadoop for a single node installation.

Before I forget and we get too far ahead of ourselves, this is all Hadoop 2 with YARN implementation. Hadoop 2 comes with some significant changes and you can also use Hadoop 2 without using YARN (or something like that) which caused me some headaches. But enough of that, let us begin.

Hadoop group, users and SSH…

is pretty straight forward to setup. I’m simplifying things and only creating one group and user instead of a separate user for HDFS, MapReduce and YARN, which seems to be recommended.

Add a group, a user and then add the user to the group

1. $sudo addgroup hadoop
2. $sudo adduser --ingroup hadoop hduser
3. $sudo adduser hduser sudo

You'll need to enter a password and other typical user info, I entered a password but just used blanks/default values for everything else

Hadoop requires SSH access to communicate and manage its nodes, i.e. nameNodes, secondaryNameNodes, dataNodes, etc. We’ll be using an empty password/phase because we don’t want to have to enter it every time the nodes talk to each other.

Switch users and create SSH key with no passphrase

1. $su hduser
2. $mkdir ~/.ssh
3. $ssh-keygen -t rsa -P ""
4. $cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys

Verify that you can make a connection to node1. I’m not using any special ports so this was pretty straight forward for me.

If you are prompted to trust node1 / unknown_host, type yes

1. $su hduser
2. $ssh node1
3. $exit

Hands-on with Hadoop

Hadoop 2.7.2 with…

Yarn is not the same as just running Hadoop 1.x. I learnt that the hard way realizing that you shouldn’t (for the 2nd time actually) blindly follow other tutorials out there. The problem is, 95% of them reference v1 and not v2. (MapReduce 2 is also a big change in Hadoop 2.x which causes a few problems for me you’ll see later). Anyways, a couple of days of research later:

YARN is…

Yet Another Resource Negotiator. It’s a layer between HDFS (the file system) and everything else, such as MapReduce and Hive (don’t ask me what Hive is, I haven’t gotten there yet). But needless to say, MapReduce and Hive are analogous to applications which plug into the YARN framework. Plug any app into YARN and it will take care of the resourcing. YARN also allows for non-batch processing apps to be used which is a big deal because Hadoop is all about batch progressing.

Download and install Hadoop with…

these steps below, I went through them pretty easily and quickly although running into issues wasn’t a rare as I thought it would be…

Download and install via terminal

1. $cd ~/
2. $wget http://apachemirror.ovidiudan.com/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
3. $sudo mkdir /opt
4. $sudo tar -xvzf hadoop-2.7.2.tar.gz -C /opt/
5. $cd /opt
6. $sudo mv hadoop-2.7.2 hadoop
7. $sudo chown -R hduser:hadoop hadoop

In the above, I’m downloading Hadoop, making a /opt directory (which is apparently a typical folder name to have on Raspberry), expand the tar into /opt, move everything to a better folder name and then give your hduser permission to this new folder. The only issue I came across with these steps was No space left of device, this was due to using the re-formatted SD Card. You’ll need to use the “Expand rootfs / root partition to fill SD Card” option available in rasp-config.

With that done, it’s time to setup some system environment variables. I went minimal on this but you can obviously go all out here as I’m pretty sure half of this isn’t needed.

Add to the end of /etc/bash.bashrc the following export lines

$sudo nano /etc/bash.bashrc

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_INSTALL/bin

Apply those changes: $ source ~/.bashrc

Now time for a test, I’m going to exit the terminal (or you can just open a new one) and switch users, then type “hadoop version”.

Test installation, you should get version Hadoop 2.7.2

1. $exit
2. $su hduser
3. $hadoop version

Time to configure Hadoop…

which is both straight forward but also complicated. In the end, it really was just a matter of understanding Hadoop, but also reminding myself that this is only a single-node cluster at the moment with a minimal config setup.

Hadoop environment variables, uncomment/update the two export lines

$sudo nano /opt/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_HEAPSIZE=250

Next up, more configuration files. Paths here are a bit different than in Hadoop 1.x, in v2 you can find these files under /etc/hadoop/

I went through the config files and updated/added to each one as follows…

core-site.xml

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://node1:9000</value>
  </property>
</configuration>

…

mapred-site.xml

Note: This is where we'll tell MapReduce to use the YARN framework. The file doesn't exist so you'll need to make a copy from mapred-site.template.xml and edit it

$cp mapred-site.xml.template mapred-site.xml

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

…

hdfs-site.xml

Note: The replication property is set to 1, by default it's 3, with single node we don't need it to replicate blocks 3 times. NameNode and dataNode paths are self explanatory.

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/opt/hadoop/hadoop_data/hdfs/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.name.dir</name>
    <value>file:/opt/hadoop/hadoop_data/hdfs/datanode</value>
  </property>
</configuration>

…

yarn-site.xml

Note: Tell NodeManagers there's an auxiliary service called mapreduce.shuffle to implement and provide the class name in order to implement the service.

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
</configuration>

Next, I need to create the directories and log folders named above. And remember to give your hduser access, otherwise you’ll end up with a few errors when trying to write (which is exactly what happened to me).

Create folders and permissions
1. $sudo mkdir -p /opt/hadoop/hadoop_data/hdfs/namenode
2. $sudo mkdir -p /opt/hadoop/hadoop_data/hdfs/datanode
3. $sudo chown hduser:hadoop /opt/hadoop/hadoop_data/hdfs -R
4. $sudo chmod 750 /opt/hadoop/hadoop_data/hdfs

Formatting and starting up HDFS is…

the last step in the process, prior to actually testing out my creation.

$cd $HADOOP_INSTALL
$hdfs namenode -format

Output near the bottom: /opt/hadoop/hadoop_data/hdfs/namenode has been successfully formatted

Formatting the nameNode gets it prepped for use, just like if you were to format a hard drive or SD Card. Now it’s time to start up the services…

Start, stop and list running services

1. $cd $HADOOP_HOME/sbin
2. $./start-dfs.sh
3. $./start-yarn.sh
4. $jps (to view all running services)
(stop-dfs.sh and stop-yarn.sh do what you imagine they would)

Running the jps command should list the services running on the JVM target / host machine. This cmd was very helpful as sometimes services just didn’t start up (you’ll see why later) and it’s good to know sooner rather than later.

Run the command and verify you have the below services running

$jps

2169 Jps
1232 NameNode
1310 SecondaryNameNode
1350 DataNode
1863 NodeManager
1750 ResourceManager

Because you’re running on a single node installation, my node1 should act as the nameNode and dataNode (plus with YARN, NodeManager and ResourceManager). You can ignore SecondaryNameNode at this time, it’s only just to take snapshots of the nameNode.

Run a test and…

cross your fingers. I now have everything done. Raspberry is set, environment configured, Hadoop installed, user and groups created, Hadoop environment variables set and the Hadoop user has access to all files and folders. The only thing left now is to test my single node installation. And of course Hadoop comes with a bunch of example code I can use as a test. I noticed a “pi” example and figured that should be simple enough of a test for me dive into.

Run a Hadoop provided example, pi, which calculates the value of pi

1. $cd $HADOOP_INSTALL/bin
2. $./hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar pi 16 1000

As an output I should get something like “Estimated value of Pi is 3.1425…”. Things started off so well but I’m writing as it’s running and currently while I wait for my pi outcome, I’m starting to see a number of failures:

16/03/12 20:40:12 INFO mapred.JobClient: map 44% reduce 15%
...
INFO mapreduce.Job: Task Id : blah blah, Status : Failed

But there is progress — map 44% reduce 15% — which is pretty exciting and that has to count for something, right? Okay, now the whole thing has failed. Time to take a few screen shots and Google some answers, right? Well, I’d like to say that’s what I did, but it’s not. I had a mini frustration meltdown (when did I convince myself this would be easy???) and took a 3 day timeout from Hadoop.

Back. Wiped the SD Card. Started over.

I followed my own article notes, making updates as I researched what went wrong. What you see above has already been updated with my notes.

Time to try a word count, after all, it is the Hello World of Hadoop, isn’t it? (I picked that up from my research). Copy a file to HDFS and run the example mapReduce wordCount on it.

Copy the file,check HDFS for the file then run wordCount on the file

1. $hdfs dfs -copyFromLocal /opt/hadoop/LICENSE.txt /license.txt
2. $hdfs dfs -ls /
3. $cd /opt/hadoop/bin
4. $./hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /license.txt /license-out.txt

WordCount of the license.txt file worked! No errors and a nice “Job completed successfully” was part of the output near the bottom. Now time to check the actual results.

I first opened the .txt file like a normal person and it was blank (still need to look into why that is) but if you open /part-r-00000, it should have all words in the license.txt file and their number of occurrences — success!

hdfs dfs -copyToLocal /license-out.txt ~/
nano ~/license-out.txt/part-r-00000

Update: that “file” I opened was actually a directory. Hadoop creates a license-out.txt directory with _SUCCESS and part-x-yyyyy files within it.

Update: What are Success and part-r-xxxxx files in Hadoop

I’m now more intrigued than before as to why the first job failed, 2 out of 3?

To delete a directory with non-empty files in it try
$hadoop fs -rm -r /license-out.txt

Basically, I’m going to run another wordCount (but on another file), wordMean and rerun the pi test. Moments later…both wordCount and wordMean were successful. However, the first try at pi failed again.

Error trying to run hadoop-mapreduce-examples-2.7.2.jar pi 10 10

16/03/17 22:03:58 INFO mapreduce.Job: Task Id: ..., Status: Failed
Exception from container-launch

I did a little more research and keep seeing a re-occurring theme around setting the classPath for HADOOP_HOME and others. An example of my problem is here on SO for container-launch error .

Update: I’m not pretty sure this was not my problem

Time for a third attempt, I made a few changes (you already have them if you followed along) and then re-ran the job. I still received some failed jobs but overall this time it all worked.

3nd attempt at calculating Pi
Job job_1458..._001 completed successfully
...
Job Finished in 346.618 seconds
Estimated value of Pi is 3.200

Now I’m a bit sceptical about the results, time for a undo and then a rerun to see if I can replicate the error and/or successful results.

4th, 5th and 6th attempt - approx. 15MB test file

test 4 - with $YARN_HOME, etc setup
Job Finished in 392.751 seconds
Estimated value of Pi is 3.200

test 5 - without
Job Finished in 361.656 seconds
File does not exist: hdfs://localhost:9000/user/hduser/QuasiMonteCarlo_.../out/reduce-out

test 6 - with $YARN_HOME, etc setup
Job Finished in 370.214 seconds
Estimated value of Pi is 3.200

Notes from two more tests (7 and 8)
It seems that Hadoop is always looking for the /reduce-out directory. The 2nd attempt after a "file does not exist" always results in a successful job...of note, wordCount and wordMean always work on the 1st attempt.

I’m now somewhat satisfied with things, particularly because wordCount and wordMean are always successful. Calculating pi seems to be a hit and miss but I think things are “good enough” for now. Time to move on.

Hadoop on the browser…

can be found on port 50070 by default and all applications of the cluster can be found at port 8088. I found these two links to be extremely helpful, browsing the logs and file system can be quite simple from the browser.

Default ports for Hadoop and its application cluster

http://node1:50070
http://node1:8088

That’s it! I put together a single node Hadoop installation on a Raspberry Pi. For the most part it all works, but it’s not a cluster just yet and having a single node version of Hadoop defeats the purpose of using Hadoop. So time to start a cluster.

Backup and reuse…

your SD Card to make things easier for setting up the next two RaspberryPi’s. Now, I jumped straight into this and had a little extra manual work in the end, so for you, you should do some prep work first before cloning / backing up your card.

Shit happens — I formatted nameNode and dataNode trying to start fresh, didn’t want the clones to have test data on them. Problem is, when you format the nameNode things stop working. To the best of my knowledge, there’s a clusterID that needs to be sync’d between the nameNode and the dataNodes, what you’ll notice is that after you format the nameNode and try to restart services, the dataNode service will never come up. Here is a thread on SO talking about the error message I received: “There are 0 datanode(s) running and no node(s) are excluded in this operation”

The solution I used, because I couldn’t figure out how to update the dataNode clusterID, is to delete the hadoop_data folder, re-create it, add back permissions and start up the services again. Quick and easy solution actually — which took me at least 6hrs before deciding to do it or deciding that I didn’t know how to fix the clusterID.

Back to cloning — Delete the hadoop storage directory and then update the hosts file as shown below.

Delete hdfs storage, add permissions and repeat for all nodes
1. rm -rf /opt/hadoop/hadoop_data
2. $sudo mkdir -p /opt/hadoop/hadoop_data/hdfs/namenode (not required for nodes 2 and 3)
3. $sudo mkdir -p /opt/hadoop/hadoop_data/hdfs/datanode
4. $sudo chown hduser:hadoop /opt/hadoop/hadoop_data/hdfs -R
5. $sudo chmod 750 /opt/hadoop/hadoop_data/hdfs

Edit the etc/hosts file
$sudo nano /etc/hosts
192.168.0.107 node1
192.168.0.108 node2
192.168.0.109 node3

Now it’s time to clone the cards. There’s plenty of good instructions out there on cloning SD Cards properly but here’s the one I used, it’s pretty straight forward: how to create a clone/backup of your Raspberry Pi SD Card from a Mac (via terminal).

Once I finished cloning my 2 SD Cards, I plugged everything in and had all 3 on the network. It’s now time to configure things once again, so I logged into each and went through the following steps below…

Already setup with 3 nodes from part 1 of this series
$sudo nano /etc/hosts

Update ip and hostname for the correct node you're on
$sudo nano /etc/network/interfaces
$sudo nano /etc/hostname

Update node1 slaves file
$sudo nano /opt/hadoop/etc/hadoop/slaves
add: node1, node2 and node3 each on a separate line

Test the SSH connection between nodes without requiring passwords. You may need to copy things around.
Example - from Node1
$su hduser
$ssh node2

If you need to enter the password, you'll need to copy the key.To copy the key - there's an easier way but this was my process...

$cat ~/.ssh/id_rsa.pub
$ssh node2
$nano ~/.ssh/authorized_keys
copy+paste from node1 key into node2 authorized_keys file

The Authorized_keys file for each node should have 3 keys in it. Make sure you can SSH into all nodes without entering a password:

$su hduser
$ssh node1
$exit
$ssh node2
$exit
$ssh node3
$exit

Finally, everything is working, at least 1.5hrs later (please don’t ask why this is taking so long, nothing seems to work for me the first time). Now at this point, I think it’s a good time to be thankful that all of my nodes are talking to each other and take a break before starting up the services. Next step (Part 3) getting Hadoop services up and running across the cluster, testing to make sure things work aka data processing is actually distributed across the nodes and then a little optimizing / tuning.

Questions? Comments? Let me know. Next up, Hadoop cluster