Hi Folks,
Sorry for not writing a blog post for a long time. I have been busy :).
To begin with I would like to declare that I am a MOOC-aholic. I have completed over 10 MOOC since 2013. I strongly feel that we have reached a point where we need to teach ourselves how to learn. MOOCS are on of the best ways to learn.
There is a hunger .... well a different kind of hunger or one might even say thirst for knowledge. My aim this year would be:
a) Contribute to Open Source
b) Expert in at-least one programming language
My Rating 4/5
The course was easy enough. Concepts were explained well. What i really liked about this course was that the pace was good not too dull and not to fast/ difficult. I would recommend this course to anyone who wants to get started with AI/ ML.
Its a good over view however the mathematical depth was missing.
I am looking forward to to the advanced version of this course.
After breaking my head on several videos and blogs i finally got it right.
To install hadoop 2.x on mac i would recommend you have
a) Java Installed (If java is not installed install JAVA JDK)
b) Password-less ssh
(Check by typing down below)
ssh localhost
If password-less ssh is not enabled Ensure Remote Login under System Preferences -> Sharing is checked to enable SSH.
(Make sure you are able to ssh password less and while generating keys make sure its blank) (For this tutorial we will configure Hadoop in psedu0-distributed mode -- http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation) Download Hadoop Extract Hadoop
tar -xvzf ~/Downloads/hadoop-2.7.0.tar.gz
And edit the following in the configuration files
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Being a developer it a pain to configure hadoop every time.
I wanted a process that is easily reproducible.
So at a high level steps involved are:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Make sure you have vagrant installed
Config : 3 node , 3.4GB, 1 core each
1 master, 2 slaves
Create a vagrant file (Make sure you have more than 10GB Ram)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
after creating this file all you need to do is 'vagrant up'
Step 1: FDQN
http://en.wikipedia.org/wiki/Fully_qualified_domain_name
change vi /etc/hosts
prepend with following entries
10.10.0.53 hdp.hadoop1.com hadoop1
10.10.0.54 hdp.hadoop2.com hadoop2
10.10.0.55 hdp.hadoop3.com hadoop3
(In All three nodes)
you should be able to ssh using domain names + 'hostname -f' returns fdqn for that machine
Install NTP and NTPD in all nodes
Use the script below to prepare your cluster
./prepare-cluster.sh hosts.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Creating a single node instance of MapR in ubuntu.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Lets first store data from HDFS to our HBase Table. For this we will be using
org.apache.pig.backend.hadoop.hbase
Class HBaseStorage
public HBaseStorage(String columnList) throws org.apache.commons.cli.ParseException,IOException)
Warning: Make sure that your PIG_CLASSPATH refers to all the library files in HBASE,HADOOP and ZOOKEEPER. Doing this will save you countless hours of debugging.
Lets Create a HBase table for the data given below named as testtable.
Make sure that your first column is the ROWKEY while doing an insert to HBase table.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
>> create 'testtable','cf'
>> list 'testtable'
>> scan 'testtable'
Now lets fire up grunt shell
Type in the following commands in the grunt shell
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Pig is quite flexible when schema need to be manipulated.
Consider this data set
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Suppose we needed to define schema after some processing we could cast the columns with their data types
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is a simple PigScript i wrote to understand the concepts of pig.
Well here it goes
Today we will see how to do a simple sum operation in Pig.
Consider this as my input data
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The first example - sum
The second example - group and sum
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Hello People, I increasingly spending more time working on pig. (Thankgod).
This experience has been very valuable as it has increased my knowledge on data.
Pig is an open source project.
Language for developing pig scripts is called pig latin.
Its easier to write pig script than mapreduce code (Its saves the developers time)
Pig is a dataflow language that means it allows the users to describe data.
Pig Latin Scripts are describes a DAG. Directed Acyclic Graph
Developed by Yahoo!.
Aim to this language is to find a sweet spot between SQL and Shell Scripts.
To sum it up Pig is you to build data pipelines / run ETL workloads.
This order is of importance...
a) Data Structures and Algorithms
b) Machine Learning
c) Dev Ops (Linux)
d) Start a youtube channel
e) Spring Framework Learn
I guess this will take a life time to master.
Currently preparing for Hadoop Admin Certification.