Wednesday, December 10, 2014

Installing Ambari from scratch using Vagrant. Step by Step


Being a developer it a pain to configure hadoop every time.
I wanted a process that is easily reproducible.
So at a high level steps involved are:
# FDQN
# Password Less SSH, Disable SELINUX , iptables Off
# NTP
# yum -y install ntp
# wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.5.1/ambari.repo
# cp ambari.repo /etc/yum.repos.d
# yum install ambari-server
# ambari-server setup
# ambari-server start
# Start Ambari in browser u/n admin , pwd admin
# If Somthing Goes Wrong
#‘yum erase ambari-agent’ and ‘yum erase ambari-server’
view raw ambari hosted with ❤ by GitHub

Make sure you have vagrant installed
Config : 3 node , 3.4GB, 1 core each
1 master, 2 slaves
Create a vagrant file (Make sure you have more than 10GB Ram)
Vagrant::Config.run do |config|
config.vm.box = "centos65"
config.vm.customize [
"modifyvm", :id,
"--memory", "3427"
]
config.vm.define :hadoop1 do |hadoop1_config|
hadoop1_config.vm.network :hostonly, "10.10.0.53"
hadoop1_config.vm.host_name = "hdp.hadoop1.com"
end
config.vm.define :hadoop2 do |hadoop2_config|
hadoop2_config.vm.network :hostonly, "10.10.0.54"
hadoop2_config.vm.host_name = "hdp.hadoop2.com"
end
config.vm.define :hadoop3 do |hadoop3_config|
hadoop3_config.vm.network :hostonly, "10.10.0.55"
hadoop3_config.vm.host_name = "hdp.hadoop3.com"
end
end
view raw Vagrantfile hosted with ❤ by GitHub

after creating this file all you need to do is 'vagrant up'

Step 1: FDQN
http://en.wikipedia.org/wiki/Fully_qualified_domain_name
change vi /etc/hosts

prepend with following entries
10.10.0.53   hdp.hadoop1.com hadoop1
10.10.0.54   hdp.hadoop2.com hadoop2
10.10.0.55   hdp.hadoop3.com hadoop3

(In All three nodes)
you should be able to ssh using domain names + 'hostname -f' returns fdqn for that machine

Install NTP and NTPD in all nodes

Use the script below to prepare your cluster
./prepare-cluster.sh hosts.txt
hdp.hadoop1.com
hdp.hadoop2.com
hdp.hadoop3.com
view raw host hosted with ❤ by GitHub

#!/bin/bash
set -x
# Generate SSH keys
ssh-keygen -t rsa
cd ~/.ssh
cat id_rsa.pub >> authorized_keys
cd ~
# Distribute SSH keys
for host in `cat hosts.txt`; do
cat ~/.ssh/id_rsa.pub | ssh root@$host "mkdir -p ~/.ssh; cat >> ~/.ssh/authorized_keys"
cat ~/.ssh/id_rsa | ssh root@$host "cat > ~/.ssh/id_rsa; chmod 400 ~/.ssh/id_rsa"
cat ~/.ssh/id_rsa.pub | ssh root@$host "cat > ~/.ssh/id_rsa.pub"
done
# Distribute hosts file
for host in `cat hosts.txt`; do
scp /etc/hosts root@$host:/etc/hosts
done
# Prepare other basic things
for host in `cat hosts.txt`; do
ssh root@$host "sed -i s/SELINUX=enforcing/SELINUX=disabled/g /etc/selinux/config"
ssh root@$host "chkconfig iptables off"
ssh root@$host "/etc/init.d/iptables stop"
echo "enabled=0" | ssh root@$host "cat > /etc/yum/pluginconf.d/refresh-packagekit.conf"
done
view raw prepare-cluster hosted with ❤ by GitHub

And Vola . Just run the commands below.
# wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.5.1/ambari.repo
# cp ambari.repo /etc/yum.repos.d
# yum install ambari-server
# ambari-server setup
# ambari-server start
# Start Ambari in browser u/n admin , pwd admin
view raw Ambari hosted with ❤ by GitHub

PS : Make sure you had entries to the hosts in Windows. Remember to check if the hostname is pingable from windows.

ping hdp.hadoop1.com

URL : hdp.hadoop1.com:8080

Tuesday, November 11, 2014

MapR SingleNode in Ubuntu

Creating a single node instance of MapR in ubuntu.


# Navigate to this Folder
cd /etc/apt
# Edit sources.list file and add the MapR repositories into it.
vi sources.list
# Add MapR repo entries below
deb http://package.mapr.com/releases/v2.1.2/ubuntu/ mapr optional
deb http://package.mapr.com/releases/ecosystem/ubuntu binary/
# Update repo
sudo apt-get update
# install Map-r hadoop
sudo apt-get install mapr-single-node
view raw Mapr hosted with ❤ by GitHub

Saturday, November 1, 2014

Using Pig to Load and Store data from HBase

Lets first store data from HDFS to our HBase Table. For this we will be using

org.apache.pig.backend.hadoop.hbase
Class HBaseStorage

public HBaseStorage(String columnList) throws org.apache.commons.cli.ParseException,IOException)


Warning: Make sure that your PIG_CLASSPATH refers to all the library files in HBASE,HADOOP and ZOOKEEPER. Doing this will save you countless hours of debugging.

Lets Create a HBase table for the data given below named as testtable.

Make sure that your first column is the ROWKEY while doing an insert to HBase table.

1|Krishna|23
2|Madhuri|37
3|Kalyan|54
4|Shobhana|50
view raw HBase hosted with ❤ by GitHub


Lets Create a Table for this data in HBase.

>> cd $HBASE_HOME\bin
>> ./hbase shell

This will take you to your HBase shell

>> create 'testtable','cf'
>> list 'testtable'
>> scan 'testtable'

Now lets fire up grunt shell

Type in the following commands in the grunt shell
-- Loading the testdata to your relation
A = LOAD '/home/biadmin/testdata' using PigStorage('|') as (id,name,age);
-- Casting to chararray needs to be done
B = foreach A generate (chararray)$0,(chararray)$1,(chararray)$2;
-- We can delemit by either space or comma
C = STORE B INTO 'hbase://testtable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf:name cf:age');
view raw gistfile1.txt hosted with ❤ by GitHub

and TaDa....







Pig Casting and Schema Management

Pig is quite flexible when schema need to be manipulated.

Consider this data set

a,1,55,M,IND
b,2,55,M,US
c,3,56,F,GER
d,4,57,F,AUS
view raw Schema hosted with ❤ by GitHub


Suppose we needed to define schema after some processing we could cast the columns with their data types

-- Load
A = load 'input' using PigStorage(',');
-- this will generate all columns after the first one
B = foreach A generate $1..;
--Suppose you need to cast the
C = FOREACH A generate (chararray)$0,(int)$1,(int)$2,(chararray)$3,(chararray)$4;
dump C;
view raw Cast hosted with ❤ by GitHub


That all for today folks.

Cheers!

Friday, October 31, 2014

Sum in Pig

This is a simple PigScript i wrote to understand the concepts of pig.
Well here it goes

Today we will see how to do a simple sum operation in Pig.

Consider this as my input data
1
2
3
4
1
1
1
view raw input hosted with ❤ by GitHub
.

The first example - sum
The second example - group and sum

-- Plain Sum
B = group A All;
C = foreach B generate SUM($1);
dump C;
-- Group And Sum
A = load 'input' as (number:int);
B = group A by $0;
C = foreach B generate SUM($1);
dump C;
view raw Sum hosted with ❤ by GitHub

Wednesday, October 29, 2014

Installing R with RStudio in Ubuntu

Helllo people,
lets start the installation by opening your terminal first.

and type in the flowering commands.

sudo apt-get install r-base
sudo apt-get install gdebi-core
sudo apt-get install libapparmor1
wget http://download2.rstudio.org/rstudio-server-0.98.1085-amd64.deb
sudo gdebi rstudio-server-0.98.1085-amd64.deb

and here you go with a brand new R-Studio.


Apache Pig Day 1

Hello People,
I increasingly spending more time working on pig. (Thankgod).
This experience has been very valuable as it has increased my knowledge on data.
Pig is an open source project.
Language for developing pig scripts is called pig latin.
Its easier to write pig script than mapreduce code (Its saves the developers time)
Pig is a dataflow language that means it allows the users to describe data.
Pig Latin Scripts are describes a DAG. Directed Acyclic Graph
Developed by Yahoo!.
Aim to this language is to find a sweet spot between SQL and Shell Scripts.
To sum it up Pig is you to build data pipelines / run ETL workloads.
 

Friday, July 25, 2014

TO DO List

This order is of importance...
a) Data Structures and Algorithms
b) Machine Learning
c) Dev Ops (Linux)
d) Start a youtube channel
e) Spring Framework Learn

I guess this will take a life time to master.

Currently preparing for Hadoop Admin Certification.

Monday, June 16, 2014

Pig

I will posting series of posts on Pig.
I personally feel amazed at the simplicity and the power of this language.


Pig Cheat Sheet:

http://mortar-public-site-content.s3-website-us-east-1.amazonaws.com/Mortar-Pig-Cheat-Sheet.pdf








Cheers!
Krishna

Tuesday, April 29, 2014

Crawling - Scrapy

What is Scrapy?

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.


Installation

OS Ubuntu

Installing Dependencies 
sudo apt-get install build-essential libssl-dev libffi-dev python-dev

Install scrapy
sudo pip install Scrapy

The above scripts will install scrapy


Monday, April 28, 2014

Algorithms

I am taking algorithms course in coursera.

Wednesday, April 23, 2014

Python week 4

Week 4 of coursera.
Been busy with odd jobs just not able to finish the assignments and be done with it.
I hope to complete all video lectures and assignments today.


Sunday, April 20, 2014

Julia Meetup

I had an amazing experience organizing the first Julia meetup in Inmobi with Abhijit and Kiran.
Gave my first formal open source talk and it felt great.
Link to my slides - http://www.slideshare.net/KrishnaKalyan3/julia-meetup-bangalore




Friday, April 18, 2014

Distributed Cache - Pig

I had been trying to use Distributed-Cache in Pig.
After a lot of trial and errors behold SUCCESS!
Lets get to the meat.

package UDF;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map.Entry;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.fs.Path;
import org.apache.pig.EvalFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
public class Regex extends EvalFunc<String> {
static HashMap<String, String> map = new HashMap<String, String>();
public List<String> getCacheFiles() {
Path lookup_file = new Path(
"hdfs://localhost.localdomain:8020/user/cloudera/top");
List<String> list = new ArrayList<String>(1);
list.add(lookup_file + "#id_lookup");
return list;
}
public void VectorizeData() throws IOException {
FileReader fr = new FileReader("./id_lookup");
BufferedReader brd = new BufferedReader(fr);
String line;
while ((line = brd.readLine()) != null) {
String str[] = line.split("#");
map.put(str[0], str[1]);
}
fr.close();
}
private String Regex(String tweet) throws ExecException {
// TODO Auto-generated method stub
for (Entry<String, String> entry : map.entrySet()) {
Pattern r = Pattern.compile(map.get(entry.getKey()));
Matcher m = r.matcher(tweet);
if (m.find() == true) {
return entry.getValue();
}
}
return null;
}
public String exec(Tuple input) throws IOException {
if (input == null || input.size() < 1 || input.get(0) == null)
return null;
try {
VectorizeData();
String str = (String) input.get(0);
return Regex(str);
} catch (Exception e) {
throw WrappedIOException.wrap(
"Caught exception processing input row ", e);
}
}
}
view raw gistfile1.txt hosted with ❤ by GitHub
Lets go through the steps.
a)Create an Eval UDF
b)Initialize Distributed Cache using getCachedFiles()
c)Initialize the Data Structure using step b.
d)Finally apply your logic on the data.

Saturday, April 12, 2014

Python Week 3

Week 3 was easy. I also managed to score a whooping 92% in the test.
I am enjoying the mini assignments. Hope to complete every thing.

Cheers!
Krishna

Tuesday, April 8, 2014

Python Week 2

I completed the mini project however i forgot to give my weekly quiz :( . I was mad at my self for doing this after long research i found that i would be loosing around ~2% from my final score.

:(

Wednesday, March 26, 2014

Learning Python



Just for the record i am a big fan of python. Main reason my frustration with JAVA.
I am taking the interactive python course in Coursera.
Just finished week 0.
Wish me luck. I want to complete at least 1 MOOC fully.
Will post weekly updates on how it goes.


HBase

I have always excited with the NOSQL hype. End result HBase certification.
I took up the cloudera certification.

My thoughts:
It was a good investment of time and money.
Really exposes you to BigData NoSql Space.
It improved my over understanding of the NoSQL BigData Eco-System.

Now off to prepare for the Cloudera Admin Program.

Cheers!! 

Wednesday, February 26, 2014

Using Sublime Text - JULIA (Ubuntu)

Installing Sublime Text 2

sudo add-apt-repository ppa:webupd8team/sublime-text-3
sudo apt-get update
sudo apt-get install sublime-text-installer


Run Julia
julia
Pkg.add("ZMQ")
Pkg.add("IJulia")


And then follow the steps in this Site:
https://github.com/karbarcca/Sublime-IJulia