How to Hadoop at home with Raspberry Pi

How to Hadoop at home with Raspberry Pi - Part 1

25 Mar 2016

If you’re new to Raspberry PI and Hadoop and want to build a Hadoop cluster on Raspberry PI, this is definitely the place for you. Together we’re going to learn “Distributed Computing and Big Data Processing” and end up with our very own Raspberry Pi Hadoop Cluster.

This is not a tutorial. Think of it more as a journey, there’s no nice step-by-step process here, I’m going to make mistakes, get errors, fix them and try to move on.

Why…

build a Raspberry Pi Hadoop Cluster? A couple of reasons. I’m actually interested in it, surprise surprise. Plus, I’m a nerd and a techie with a software development background. I also think Data Engineer and Data Analysis are pretty cool and interesting careers, so how better to get to know them than by doing it???

In the end, I’m hoping to have some fun learning about Raspberry Pi, Hadoop and all…most…some of its components. I’m also taking Udacity’s Nanodegree in Data Analytics so I think it’ll be a pretty cool experience trying to learn Data Analytics along with Data Engineering my own Hadoop Cluster.

How…

will I get things done? From scratch and the internet, my dear friend. So far I’ve found some amazing articles and tutorials online for Hadoop, Raspberry Pi and Hadoop on Raspberry Pi. I wont be blindly following any one article but I will definitely not be doing it alone, so any material I reference, I’ll provide the link to inline.

What…

do I already know? Nothing. Well, a little more than nothing. I’m only a couple of weeks into my Udacity program, I know only the basics of what Hadoop is and I’ve never touched a Raspberry Pi before. So that’s where I’ll be starting from. And now it’s time to start…

Raspberry Pi and Hadoop

Let us first have a very brief paragraph on the main hardware and software I’ll be using. It’s going to be brief because 1) I don’t know what I’m talking about and 2) there are plenty of very long articles out there that go into great details, so why repeat them here. So here are the highlights:

Raspberry Pi…

is a very small, and inexpensive, pocket size computer
Model B has a quad core ARM running at 900MHz with 1GB of RAM

Hadoop…

is a framework that allows for distributed computing
main components are HDFS (data storage), YARN (resource management) and MapReduce (data processing)
redundantly stores blocks (pieces of data) across servers
copies the code/query to the data instead of moving the data to the code

Raspberry Pi + Hadoop Cluster…

will comprise of one NameNode and two DataNodes
is not production ready, in case you’re wondering

Setting the Stage

I need to buy some hardware and get some free software. My aim here is to just buy some stuff in as simple and painless a process as possible. You can probably get cheaper and cooler elsewhere.

Amazon+BestBuy:

3x Raspberry Pi 2 Model B(starter kit) = CAD$210
1x 5 port switch = CAD$34
1x mini power bar (already own)
1x wireless mouse and keyboard (already own)

Kit contained: Raspberry Pi, 8GB SD Class 10, Wi-Fi dongle, and a few other things we won’t need for this project.

Update: I decided not to actually use a switch because I already have a wireless network setup and all 3 of my pi’s have wifi dongles

Software:

Hadoop v2.7.2 (just plain vanilla from Apache)
Raspbian (from the NOOB SD Card included in the hardware kit)

Update: Raspbian Wheezy was on the NOOB but I ran into some newbie like issues and then discovered Raspbian Jessie was available so I made the switch. https://www.raspberrypi.org/documentation/installation/installing-images/mac.md

Installing Raspbian…

is ridiculously easy. To start off with I just wanted to get one Raspberry up and running before jumping into anything with Hadoop. So, my setup is:

Raspberry Pi
WiFi USB dongle
HDMI display
SD Card with NOOB (updated, reformatted card)
Wireless mouse and keyboard with USB dongle
Power supply

With all that done, I’ve plugged in Raspberry, it started up with OS install dialog, I selected Raspbian and now the long install process begins.

Update: With my reformatted Jessie SD Card, things are a bit different but you can follow along if you have the NOOB SD Card.

Raspi-config…

pops up once the install is completed. Here we would normally do some simple configuration stuff but because this will also be my nameNode server I’ll also do some more specific changes for that purpose.

Numbers don't map to the rasp-config, just the order
1. Expand Filesystem - No. Using NOOB 
2. Enable Boot to Desktop/Scratch/CLI - Going with Desktop and terminal
4. Internationalization Options - You'll want to change timezone
5. Overclock - After reading the warning I'll go with the Pi2 setting
6. Advanced: Hostname - node1 (will be my NameNode)
7. Advanced: Memory split: 32MB (memory made available to the GPU)
8. Advanced: SSH (will be used later)

Update: this was the original setup with Wheezy but Jessie does away with the Raspi-config, and replaces it with the GUI after you login.

Once that’s all done. Reboot! Raspberry Pi is now up and running. The next two steps are network configurations and Java environment, which look pretty simple so I’ll do that now before calling it a day aka Part 1.

Network configuring and Java…

gave me 10mins of confusion but in the end my /etc/network/interfaces file looks like this:

Use this to find your raspberry's ip address
$hostname -I

Use sudo or root to edit the interfaces file and save
$sudo nano /etc/network/interfaces

#these values are specific to me, use your own from above
iface eth0 inet static
address 192.168.0.107 
netmask 255.255.255.0
gateway 192.168.0.1

Some tutorials talked about checking my /etc/resolv.conf but this has something to do with a “nameserver / DNS” which I’m pretty sure doesn’t apply to my situation (Note-to-self: if something breaks come back here). Now, some tutorials also say reboot and then check for Java which should be pre-installed with NOOB. But why reboot first? So, last step, check for Java

From the terminal: $java -version

Thankfully, my system has Java installed (Java version 1.8.0 was printed on the screen). And I say thankfully because I have no idea what to do if it wasn’t — well, I’d have to install it, which isn’t a big deal but still.

Also, I’m going to update my hosts file to make things a little easier when looking up each machine (once we get the other two nodes up and running).

From the terminal: $sudo nano /etc/hosts
add: 192.168.0.107 node1

From the terminal: $sudo nano /etc/hostname
replace: raspberrypi with node1

Last step, reboot and take a break. I’m done with Raspberry. Pretty straight forward so far. Next step (Part 2) getting Hadoop installed.

Questions? Comments? Let me know. Next up, Hadoop single node installation