Wednesday, October 29, 2014

Installing R with RStudio in Ubuntu

Helllo people,
lets start the installation by opening your terminal first.

and type in the flowering commands.

sudo apt-get install r-base
sudo apt-get install gdebi-core
sudo apt-get install libapparmor1
sudo gdebi rstudio-server-0.98.1085-amd64.deb

and here you go with a brand new R-Studio.

Apache Pig Day 1

Hello People,
I increasingly spending more time working on pig. (Thankgod).
This experience has been very valuable as it has increased my knowledge on data.
Pig is an open source project.
Language for developing pig scripts is called pig latin.
Its easier to write pig script than mapreduce code (Its saves the developers time)
Pig is a dataflow language that means it allows the users to describe data.
Pig Latin Scripts are describes a DAG. Directed Acyclic Graph
Developed by Yahoo!.
Aim to this language is to find a sweet spot between SQL and Shell Scripts.
To sum it up Pig is you to build data pipelines / run ETL workloads.

Friday, July 25, 2014

TO DO List

This order is of importance...
a) Data Structures and Algorithms
b) Machine Learning
c) Dev Ops (Linux)
d) Start a youtube channel
e) Spring Framework Learn

I guess this will take a life time to master.

Currently preparing for Hadoop Admin Certification.

Monday, June 16, 2014


I will posting series of posts on Pig.
I personally feel amazed at the simplicity and the power of this language.

Pig Cheat Sheet:


Tuesday, April 29, 2014

Crawling - Scrapy

What is Scrapy?

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.


OS Ubuntu

Installing Dependencies 
sudo apt-get install build-essential libssl-dev libffi-dev python-dev

Install scrapy
sudo pip install Scrapy

The above scripts will install scrapy

Monday, April 28, 2014


I am taking algorithms course in coursera.
I have given up Python because of personal reasons.  

Wednesday, April 23, 2014

Python week 4

Week 4 of coursera.
Been busy with odd jobs just not able to finish the assignments and be done with it.
I hope to complete all video lectures and assignments today.

Sunday, April 20, 2014

Julia Meetup

I had an amazing experience organizing the first Julia meetup in Inmobi with Abhijit and Kiran.
Gave my first formal open source talk and it felt great.
Link to my slides -

Friday, April 18, 2014

Distributed Cache - Pig

I had been trying to use Distributed-Cache in Pig.
After a lot of trial and errors behold SUCCESS!
Lets get to the meat.

Lets go through the steps.
a)Create an Eval UDF
b)Initialize Distributed Cache using getCachedFiles()
c)Initialize the Data Structure using step b.
d)Finally apply your logic on the data.

Saturday, April 12, 2014

Python Week 3

Week 3 was easy. I also managed to score a whooping 92% in the test.
I am enjoying the mini assignments. Hope to complete every thing.


Tuesday, April 8, 2014

Python Week 2

I completed the mini project however i forgot to give my weekly quiz :( . I was mad at my self for doing this after long research i found that i would be loosing around ~2% from my final score.


Wednesday, March 26, 2014

Learning Python

Just for the record i am a big fan of python. Main reason my frustration with JAVA.
I am taking the interactive python course in Coursera.
Just finished week 0.
Wish me luck. I want to complete at least 1 MOOC fully.
Will post weekly updates on how it goes.


I have always excited with the NOSQL hype. End result HBase certification.
I took up the cloudera certification.

My thoughts:
It was a good investment of time and money.
Really exposes you to BigData NoSql Space.
It improved my over understanding of the NoSQL BigData Eco-System.

Now off to prepare for the Cloudera Admin Program.


Wednesday, February 26, 2014

Using Sublime Text - JULIA (Ubuntu)

Installing Sublime Text 2

sudo add-apt-repository ppa:webupd8team/sublime-text-3
sudo apt-get update
sudo apt-get install sublime-text-installer

Run Julia

And then follow the steps in this Site:

Tuesday, November 26, 2013

HBase : Filters

All filters are implemented on the server side.
This is called predicate push down.

You can define a new instance of the filter by using

a)Comparison Filters

Row Filter:
Gives the ability to filter data based on rowkeys.

Family Filter:
Used to filter column Families. Data is retrieved in a column family level.

Qualifier Filter:
Used to filter out specific column qualifier.

Value Filter:
Used to filter out columns with a specific value.

Dependent Column Filter:
It uses timestamp as the reference column and includes all other columns. It lets you specify a dependent column.

b)Dedicated Filters

This filter is used when you have exactly one column that decides if an entire row should be returned or not.

SingleColumnValueExclude Filter:
This is an exclude filter. You will not get the column as a part of your result.

Prefix Filter:
All rows that match this prefix are returned the client.

Page Filter:
You specify the pagesize for your filter. This controls how many rows per page should be returned.

More Later..