Monday, May 2, 2016

Word2Vec with Apache Spark

I tried to use Word2Vec with Apache Spark. Used the first Harry Potter book as Corpus.

Some Interesting results 

Similarities for "Ron"
Hermione 0.8892348408699036
watch 0.8258942365646362
"and 0.7972607016563416

Similarities for "Hermione"
Ron 0.9096277952194214
"and 0.8301450610160828
Hooch 0.829563319683075

Ron and Hermione end up getting married in the last book.  

Similarities for "Voldemort"
Quirrell's 0.9311719536781311
laughing 0.9307642579078674

Similarities for "Harry"
Hermione 0.7319877743721008
George 0.7252205014228821



Harry Potter Corpus

API Refrence


So what is word2vec. It is a shallow neural network model. In short it tries to predict the contextual word from its surroundings and vice - versa. 


Friday, March 18, 2016

Life and Beyond

Hi Folks,
Sorry for not writing a blog post for a long time. I have been busy :).
To begin with I would like to declare that I am a MOOC-aholic. I have completed over 10 MOOC since 2013. I strongly feel that we have reached a point where we need to teach ourselves how to learn. MOOCS are on of the best ways to learn.

Currently I am following 2 MOOCs
Machine Learning: Classification
Analytics Edge

There is a hunger .... well a different kind of hunger or one might even say thirst for knowledge. My aim this year would be:
a) Contribute to Open Source
b) Expert in at-least one programming language


- Adios Amigo

Thursday, June 11, 2015

Data Mining with Weka

My Rating 4/5
The course was easy enough. Concepts were explained well. What i really liked about this course was that the pace was good not too dull and not to fast/ difficult. I would recommend this course to anyone who wants to get started with AI/ ML.

Its a good over view however the mathematical depth was missing.

Image result for weka

I am looking forward to to the advanced version of this course.

Sunday, May 17, 2015

Installing Apache Zeppelin (Local Mode)

Hi Folks,
Playing around with Apache Zepplin today.

Installation Steps Involved
a) Install Maven
b) Install Git
c) Clone the Zeppelin Repository (https://github.com/apache/incubator-zeppelin)
d) mvn install -DskipTests

And vola you have Apache Zeppelin installed


Start/Stop Zeppelin
bin/zeppelin-daemon.sh start
bin/zeppelin-daemon.sh stop

Wednesday, May 6, 2015

Installing Hadoop 2.x on MAC (Yosemite)

After breaking my head on several videos and blogs i finally got it right.
To install hadoop 2.x on mac i would recommend you have
a) Java Installed (If java is not installed install JAVA JDK)
b) Password-less ssh
(Check by typing down below)

ssh localhost


If password-less ssh is not enabled
Ensure Remote Login under System Preferences -> Sharing is checked to enable SSH.


ssh-keygen -t rsa -P ""


cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys


ssh localhost

(Make sure you are able to ssh password less and while generating keys make sure its blank)
(For this tutorial we will configure Hadoop in psedu0-distributed mode -- http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation)

Download Hadoop
Extract Hadoop

tar -xvzf ~/Downloads/hadoop-2.7.0.tar.gz


And edit the following in the configuration files

edit: etc/hadoop/core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
edit: etc/hadoop/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
view raw gistfile1.txt hosted with ❤ by GitHub

format namenode:
bin/hdfs namenode -format
start dfs:
sbin/start-dfs.sh
Create user directories
bin/hdfs dfs -mkdir /user
bin/hdfs dfs -mkdir /user/<username>
Testing MapReduce
bin/hdfs dfs -put etc/hadoop input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'
Examimne Output
bin/hdfs dfs -cat output/*
Namenode UI: http://localhost:50070/
RM UI : http://localhost:8088
view raw gistfile1.txt hosted with ❤ by GitHub


And Voilà you have a petit Hadoop cluster for yourself.


Saturday, March 21, 2015

Building Pig from Source

Also refer https://cwiki.apache.org/confluence/display/PIG/How+to+set+up+Eclipse+environment
git clone -b spark https://github.com/apache/pig
ant -Dhadoopversion=23 jar
ant -Dhadoopversion=23 eclipse-files
view raw gistfile1.txt hosted with ❤ by GitHub

Wednesday, December 10, 2014

Installing Ambari from scratch using Vagrant. Step by Step


Being a developer it a pain to configure hadoop every time.
I wanted a process that is easily reproducible.
So at a high level steps involved are:
# FDQN
# Password Less SSH, Disable SELINUX , iptables Off
# NTP
# yum -y install ntp
# wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.5.1/ambari.repo
# cp ambari.repo /etc/yum.repos.d
# yum install ambari-server
# ambari-server setup
# ambari-server start
# Start Ambari in browser u/n admin , pwd admin
# If Somthing Goes Wrong
#‘yum erase ambari-agent’ and ‘yum erase ambari-server’
view raw ambari hosted with ❤ by GitHub

Make sure you have vagrant installed
Config : 3 node , 3.4GB, 1 core each
1 master, 2 slaves
Create a vagrant file (Make sure you have more than 10GB Ram)
Vagrant::Config.run do |config|
config.vm.box = "centos65"
config.vm.customize [
"modifyvm", :id,
"--memory", "3427"
]
config.vm.define :hadoop1 do |hadoop1_config|
hadoop1_config.vm.network :hostonly, "10.10.0.53"
hadoop1_config.vm.host_name = "hdp.hadoop1.com"
end
config.vm.define :hadoop2 do |hadoop2_config|
hadoop2_config.vm.network :hostonly, "10.10.0.54"
hadoop2_config.vm.host_name = "hdp.hadoop2.com"
end
config.vm.define :hadoop3 do |hadoop3_config|
hadoop3_config.vm.network :hostonly, "10.10.0.55"
hadoop3_config.vm.host_name = "hdp.hadoop3.com"
end
end
view raw Vagrantfile hosted with ❤ by GitHub

after creating this file all you need to do is 'vagrant up'

Step 1: FDQN
http://en.wikipedia.org/wiki/Fully_qualified_domain_name
change vi /etc/hosts

prepend with following entries
10.10.0.53   hdp.hadoop1.com hadoop1
10.10.0.54   hdp.hadoop2.com hadoop2
10.10.0.55   hdp.hadoop3.com hadoop3

(In All three nodes)
you should be able to ssh using domain names + 'hostname -f' returns fdqn for that machine

Install NTP and NTPD in all nodes

Use the script below to prepare your cluster
./prepare-cluster.sh hosts.txt
hdp.hadoop1.com
hdp.hadoop2.com
hdp.hadoop3.com
view raw host hosted with ❤ by GitHub

#!/bin/bash
set -x
# Generate SSH keys
ssh-keygen -t rsa
cd ~/.ssh
cat id_rsa.pub >> authorized_keys
cd ~
# Distribute SSH keys
for host in `cat hosts.txt`; do
cat ~/.ssh/id_rsa.pub | ssh root@$host "mkdir -p ~/.ssh; cat >> ~/.ssh/authorized_keys"
cat ~/.ssh/id_rsa | ssh root@$host "cat > ~/.ssh/id_rsa; chmod 400 ~/.ssh/id_rsa"
cat ~/.ssh/id_rsa.pub | ssh root@$host "cat > ~/.ssh/id_rsa.pub"
done
# Distribute hosts file
for host in `cat hosts.txt`; do
scp /etc/hosts root@$host:/etc/hosts
done
# Prepare other basic things
for host in `cat hosts.txt`; do
ssh root@$host "sed -i s/SELINUX=enforcing/SELINUX=disabled/g /etc/selinux/config"
ssh root@$host "chkconfig iptables off"
ssh root@$host "/etc/init.d/iptables stop"
echo "enabled=0" | ssh root@$host "cat > /etc/yum/pluginconf.d/refresh-packagekit.conf"
done
view raw prepare-cluster hosted with ❤ by GitHub

And Vola . Just run the commands below.
# wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.5.1/ambari.repo
# cp ambari.repo /etc/yum.repos.d
# yum install ambari-server
# ambari-server setup
# ambari-server start
# Start Ambari in browser u/n admin , pwd admin
view raw Ambari hosted with ❤ by GitHub

PS : Make sure you had entries to the hosts in Windows. Remember to check if the hostname is pingable from windows.

ping hdp.hadoop1.com

URL : hdp.hadoop1.com:8080

Tuesday, November 11, 2014

MapR SingleNode in Ubuntu

Creating a single node instance of MapR in ubuntu.


# Navigate to this Folder
cd /etc/apt
# Edit sources.list file and add the MapR repositories into it.
vi sources.list
# Add MapR repo entries below
deb http://package.mapr.com/releases/v2.1.2/ubuntu/ mapr optional
deb http://package.mapr.com/releases/ecosystem/ubuntu binary/
# Update repo
sudo apt-get update
# install Map-r hadoop
sudo apt-get install mapr-single-node
view raw Mapr hosted with ❤ by GitHub

Saturday, November 1, 2014

Using Pig to Load and Store data from HBase

Lets first store data from HDFS to our HBase Table. For this we will be using

org.apache.pig.backend.hadoop.hbase
Class HBaseStorage

public HBaseStorage(String columnList) throws org.apache.commons.cli.ParseException,IOException)


Warning: Make sure that your PIG_CLASSPATH refers to all the library files in HBASE,HADOOP and ZOOKEEPER. Doing this will save you countless hours of debugging.

Lets Create a HBase table for the data given below named as testtable.

Make sure that your first column is the ROWKEY while doing an insert to HBase table.

1|Krishna|23
2|Madhuri|37
3|Kalyan|54
4|Shobhana|50
view raw HBase hosted with ❤ by GitHub


Lets Create a Table for this data in HBase.

>> cd $HBASE_HOME\bin
>> ./hbase shell

This will take you to your HBase shell

>> create 'testtable','cf'
>> list 'testtable'
>> scan 'testtable'

Now lets fire up grunt shell

Type in the following commands in the grunt shell
-- Loading the testdata to your relation
A = LOAD '/home/biadmin/testdata' using PigStorage('|') as (id,name,age);
-- Casting to chararray needs to be done
B = foreach A generate (chararray)$0,(chararray)$1,(chararray)$2;
-- We can delemit by either space or comma
C = STORE B INTO 'hbase://testtable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf:name cf:age');
view raw gistfile1.txt hosted with ❤ by GitHub

and TaDa....







Pig Casting and Schema Management

Pig is quite flexible when schema need to be manipulated.

Consider this data set

a,1,55,M,IND
b,2,55,M,US
c,3,56,F,GER
d,4,57,F,AUS
view raw Schema hosted with ❤ by GitHub


Suppose we needed to define schema after some processing we could cast the columns with their data types

-- Load
A = load 'input' using PigStorage(',');
-- this will generate all columns after the first one
B = foreach A generate $1..;
--Suppose you need to cast the
C = FOREACH A generate (chararray)$0,(int)$1,(int)$2,(chararray)$3,(chararray)$4;
dump C;
view raw Cast hosted with ❤ by GitHub


That all for today folks.

Cheers!

Friday, October 31, 2014

Sum in Pig

This is a simple PigScript i wrote to understand the concepts of pig.
Well here it goes

Today we will see how to do a simple sum operation in Pig.

Consider this as my input data
1
2
3
4
1
1
1
view raw input hosted with ❤ by GitHub
.

The first example - sum
The second example - group and sum

-- Plain Sum
B = group A All;
C = foreach B generate SUM($1);
dump C;
-- Group And Sum
A = load 'input' as (number:int);
B = group A by $0;
C = foreach B generate SUM($1);
dump C;
view raw Sum hosted with ❤ by GitHub

Wednesday, October 29, 2014

Installing R with RStudio in Ubuntu

Helllo people,
lets start the installation by opening your terminal first.

and type in the flowering commands.

sudo apt-get install r-base
sudo apt-get install gdebi-core
sudo apt-get install libapparmor1
wget http://download2.rstudio.org/rstudio-server-0.98.1085-amd64.deb
sudo gdebi rstudio-server-0.98.1085-amd64.deb

and here you go with a brand new R-Studio.


Apache Pig Day 1

Hello People,
I increasingly spending more time working on pig. (Thankgod).
This experience has been very valuable as it has increased my knowledge on data.
Pig is an open source project.
Language for developing pig scripts is called pig latin.
Its easier to write pig script than mapreduce code (Its saves the developers time)
Pig is a dataflow language that means it allows the users to describe data.
Pig Latin Scripts are describes a DAG. Directed Acyclic Graph
Developed by Yahoo!.
Aim to this language is to find a sweet spot between SQL and Shell Scripts.
To sum it up Pig is you to build data pipelines / run ETL workloads.
 

Friday, July 25, 2014

TO DO List

This order is of importance...
a) Data Structures and Algorithms
b) Machine Learning
c) Dev Ops (Linux)
d) Start a youtube channel
e) Spring Framework Learn

I guess this will take a life time to master.

Currently preparing for Hadoop Admin Certification.

Monday, June 16, 2014

Pig

I will posting series of posts on Pig.
I personally feel amazed at the simplicity and the power of this language.


Pig Cheat Sheet:

http://mortar-public-site-content.s3-website-us-east-1.amazonaws.com/Mortar-Pig-Cheat-Sheet.pdf








Cheers!
Krishna