Monday, May 2, 2016

Word2Vec with Apache Spark

I tried to use Word2Vec with Apache Spark. Used the first Harry Potter book as Corpus.

Some Interesting results 

Similarities for "Ron"
Hermione 0.8892348408699036
watch 0.8258942365646362
"and 0.7972607016563416

Similarities for "Hermione"
Ron 0.9096277952194214
"and 0.8301450610160828
Hooch 0.829563319683075

Ron and Hermione end up getting married in the last book.  

Similarities for "Voldemort"
Quirrell's 0.9311719536781311
laughing 0.9307642579078674

Similarities for "Harry"
Hermione 0.7319877743721008
George 0.7252205014228821

Harry Potter Corpus

API Refrence

So what is word2vec. It is a shallow neural network model. In short it tries to predict the contextual word from its surroundings and vice - versa. 

Friday, March 18, 2016

Life and Beyond

Hi Folks,
Sorry for not writing a blog post for a long time. I have been busy :).
To begin with I would like to declare that I am a MOOC-aholic. I have completed over 10 MOOC since 2013. I strongly feel that we have reached a point where we need to teach ourselves how to learn. MOOCS are on of the best ways to learn.

Currently I am following 2 MOOCs
Machine Learning: Classification
Analytics Edge

There is a hunger .... well a different kind of hunger or one might even say thirst for knowledge. My aim this year would be:
a) Contribute to Open Source
b) Expert in at-least one programming language

- Adios Amigo

Thursday, June 11, 2015

Data Mining with Weka

My Rating 4/5
The course was easy enough. Concepts were explained well. What i really liked about this course was that the pace was good not too dull and not to fast/ difficult. I would recommend this course to anyone who wants to get started with AI/ ML.

Its a good over view however the mathematical depth was missing.

Image result for weka

I am looking forward to to the advanced version of this course.

Sunday, May 17, 2015

Installing Apache Zeppelin (Local Mode)

Hi Folks,
Playing around with Apache Zepplin today.

Installation Steps Involved
a) Install Maven
b) Install Git
c) Clone the Zeppelin Repository (
d) mvn install -DskipTests

And vola you have Apache Zeppelin installed

Start/Stop Zeppelin
bin/ start
bin/ stop

Wednesday, May 6, 2015

Installing Hadoop 2.x on MAC (Yosemite)

After breaking my head on several videos and blogs i finally got it right.
To install hadoop 2.x on mac i would recommend you have
a) Java Installed (If java is not installed install JAVA JDK)
b) Password-less ssh
(Check by typing down below)

ssh localhost

If password-less ssh is not enabled
Ensure Remote Login under System Preferences -> Sharing is checked to enable SSH.

ssh-keygen -t rsa -P ""

cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys

ssh localhost

(Make sure you are able to ssh password less and while generating keys make sure its blank)
(For this tutorial we will configure Hadoop in psedu0-distributed mode --

Download Hadoop
Extract Hadoop

tar -xvzf ~/Downloads/hadoop-2.7.0.tar.gz

And edit the following in the configuration files

And VoilĂ  you have a petit Hadoop cluster for yourself.

Wednesday, December 10, 2014

Installing Ambari from scratch using Vagrant. Step by Step

Being a developer it a pain to configure hadoop every time.
I wanted a process that is easily reproducible.
So at a high level steps involved are:

Make sure you have vagrant installed
Config : 3 node , 3.4GB, 1 core each
1 master, 2 slaves
Create a vagrant file (Make sure you have more than 10GB Ram)

after creating this file all you need to do is 'vagrant up'

Step 1: FDQN
change vi /etc/hosts

prepend with following entries hadoop1 hadoop2 hadoop3

(In All three nodes)
you should be able to ssh using domain names + 'hostname -f' returns fdqn for that machine

Install NTP and NTPD in all nodes

Use the script below to prepare your cluster
./ hosts.txt

And Vola . Just run the commands below.

PS : Make sure you had entries to the hosts in Windows. Remember to check if the hostname is pingable from windows.



Tuesday, November 11, 2014

MapR SingleNode in Ubuntu

Creating a single node instance of MapR in ubuntu.

Saturday, November 1, 2014

Using Pig to Load and Store data from HBase

Lets first store data from HDFS to our HBase Table. For this we will be using

Class HBaseStorage

public HBaseStorage(String columnList) throws org.apache.commons.cli.ParseException,IOException)

Warning: Make sure that your PIG_CLASSPATH refers to all the library files in HBASE,HADOOP and ZOOKEEPER. Doing this will save you countless hours of debugging.

Lets Create a HBase table for the data given below named as testtable.

Make sure that your first column is the ROWKEY while doing an insert to HBase table.

Lets Create a Table for this data in HBase.

>> cd $HBASE_HOME\bin
>> ./hbase shell

This will take you to your HBase shell

>> create 'testtable','cf'
>> list 'testtable'
>> scan 'testtable'

Now lets fire up grunt shell

Type in the following commands in the grunt shell

and TaDa....

Pig Casting and Schema Management

Pig is quite flexible when schema need to be manipulated.

Consider this data set

Suppose we needed to define schema after some processing we could cast the columns with their data types

That all for today folks.


Friday, October 31, 2014

Sum in Pig

This is a simple PigScript i wrote to understand the concepts of pig.
Well here it goes

Today we will see how to do a simple sum operation in Pig.

Consider this as my input data

The first example - sum
The second example - group and sum

Wednesday, October 29, 2014

Installing R with RStudio in Ubuntu

Helllo people,
lets start the installation by opening your terminal first.

and type in the flowering commands.

sudo apt-get install r-base
sudo apt-get install gdebi-core
sudo apt-get install libapparmor1
sudo gdebi rstudio-server-0.98.1085-amd64.deb

and here you go with a brand new R-Studio.

Apache Pig Day 1

Hello People,
I increasingly spending more time working on pig. (Thankgod).
This experience has been very valuable as it has increased my knowledge on data.
Pig is an open source project.
Language for developing pig scripts is called pig latin.
Its easier to write pig script than mapreduce code (Its saves the developers time)
Pig is a dataflow language that means it allows the users to describe data.
Pig Latin Scripts are describes a DAG. Directed Acyclic Graph
Developed by Yahoo!.
Aim to this language is to find a sweet spot between SQL and Shell Scripts.
To sum it up Pig is you to build data pipelines / run ETL workloads.

Friday, July 25, 2014

TO DO List

This order is of importance...
a) Data Structures and Algorithms
b) Machine Learning
c) Dev Ops (Linux)
d) Start a youtube channel
e) Spring Framework Learn

I guess this will take a life time to master.

Currently preparing for Hadoop Admin Certification.

Monday, June 16, 2014


I will posting series of posts on Pig.
I personally feel amazed at the simplicity and the power of this language.

Pig Cheat Sheet: