Month: October 2013

The Art of Mathematical Programming

Mathematical Programming or Mathematical Optimization or Simply Optimization is one of the most underrated topics in my defense.

A Mathematical Program has three parts. An objective function that contains the decision variables that we would like to optimize. A set of constraints where the objective is limited in terms of maximizing it or minimizing it. And bounds, that defines the range of the decision variables.

There can be an unconstrained optimization program — possibility exists. There can never be a unbounded optimization program. If a math program is unbounded, then optimal solution is only achieved on +/- infinity. (And that is very wrong way of modeling the problem) The word modeling is used in a context where the process of converting a given situation into a mathematical program. It is meant in a broader context called “Mathematical Modeling“. There is also the Statistical Modeling — process of extracting data from given situation and fitting the same into a statistical equation/program in order to derive results is called “Statistical Modeling“.

Before Computer Programming arrived, a program in context of mathematics is a set of equations.

With the invention of the word “analytics”, these two phrases — mathematical and statistical modeling has been looked at with so much hubris by many. Modeling is truly an art. And it is beautiful.

Studying Linear Algebra, Calculus, Real Analysis would help to understand mathematical programming in context of academia. But one has to develop a sense of thinking and mathematical intuition in these topics to truly see thro mathematical programming in order to learn the art of mathematical programming.

Advertisements

Data Science 101

The term data science was first used by Peter Naur in 1974. Lets come back to today’s definition.
The following is how one of the companies advertized to recruit data scientists.

-> experience in big data, hadoop, mapreduce, pig, hive, sqoop, etc…
-> experience of machine learning using Knime, Weka, RapidMiner, Pentaho, Scipy/Numpy algorithms, data structures, statistics, mathematics, optimization modeling
-> experience in Data visualization of all forms
-> experience in SQL, MS-SQL, MySQL, PostgreSQL, Oracle SQL, HBase, cassandra, mongodb, couchdb and other nosql technologies
-> proficient in Fortran, C, C++, C#, Java, J2EE, Python, Ruby, Perl, R, SAS, SPSS, Minitab, AMPL, LINGO, OPL, CPLEX, MOSEK, XPRESS, Matlab, Octave, in Eclipse, Netbeans, Maven, Ant, etc…
-> business intelligence tools, OLAP, IBM Cognos, Lavastorm, etc…
-> proficient in parallell and distributed computing
-> from IITs/MS from US
-> blah, blah, blah . . .

The one candidate I can find who can match all these requirements is . . .

wait for it . . .

Google!

Of course google is not human but it knows (knowledge of) all the above and plus many other things. I’d like to get into the minds of HR professionals and ask them if they truly got the candidate for this job.

Going by what is on the internet a Data Scientist is a blend Economist, Statistician, Operational Researcher, Mathematician, Software developer, System administrator, Computer Scientist, Project Manager. 20 years of grad school and half-a-million US dollars debt would make an ideal data scientist if we go by the internet.

So who is Data Scientist? What is Data Science?

A Data Scientist can discover what we don’t know from data, Obtain predictive, actionable insight from data, develop optimal solutions that are directly algined to minimizing costs and maximizing profits, communicate relevant business stories from data.

The following are somethings DS should be familiar with:

  • Ask questions and create and test hypothesis using Data
  • Data Munging, Data Wrestling, Data Taming
  • Tell the Machine to learn from Data
  • Using Modeling extract optimal data to support business decisions

Data Munging is the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.
Data Wrestling is the ability to turn big data into data products that would generate immediate business value.
Data Taming is finding opportunities in huge data (most likely from web)

In order to become a Data Scientist you need to learn to understand the business better. Of course how the results is achieved (by what software) is completely left to the Data Scientist himself. It cannot be put in as requirements.

All the successful entrepreneurs of our 21st century are Data Scientists!

Apache Pig: for ETL functions, data cleansing and munging

Pig is one of the add-ons in the Hadoop framework that supports and helps executing mapreduce data processing in Hadoop. Yes, Hadoop does support mapreduce natively but Pig makes it easier so you dont have to write a complex java program defining mapper and reducer classes.

Pig includes a scripting language called the Pig Latin that provides many standard database operations that we normally do with data and also UDFs (User defined functions so the user can write his own query method to extract relevant data or munge the data). Data Munging is done using Pig, but not limited to Pig. (Look up Sed and Awk in google)

A common example for Pig Latin script is when web companies bringing in logs from their web servers, cleansing the data, and precomputing common aggregates before loading it into their data warehouse. In this case, the data is loaded onto the grid, and then Pig is used to clean out records from bots and records with corrupt data. It is also used to join web event data against user databases so that user cookies can be connected with known user information.

Installation:

Pig is not needed to be installed in our multi-node cluster. Only on the machine where you run in your Hadoop job, the Master. Remeber there can be more than one master for a multi-node cluster; pig can be installed on these machines too. Also, you can run Pig in localmode if you want to use it locally in your linux operating system.

1. Download Apache Pig

$ wget http://www.eu.apache.org/dist/pig/pig-latest-version/pig-latest-version.tar.gz

2. Extract the Tarball

$ tar –zxvf pig-latest-version.tar.gz

3. Add the following to your previously created hadoop configuration bash script:

export PIG_HOME=/data/pig-latest-version
export JAVA_HOME=/usr/java/jdk1.7.0_05
export HADOOP_HOME=/data/hadoop

export PATH=$PIG_HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
export CLASSPATH=$JAVA_HOME:/data/hadoop/hadoop-core-1.0.1.jar:$PIG_HOME/pig-latest-version.jar

4. Run the configuration file

$ . hadoopconfiguration.sh

5. Now launch Pig,

$ pig

You will see set of instructions followed by grunt shell like this

grunt>

Remember, this whole setup needs to be done in a Master/Masters in multi-node cluster, on single-node cluster where Hadoop framework is up and running.

Multi-level classification using stochastic gradient descent

Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An algorithm that implements classification, especially in a concrete implementation, is known as a classifier.
There are two types of Classification: Binary classification and Multi-level classification

  • Binary classification is the task of classifying the members of a given set of objects into two groups on the basis of whether they have some property or not.
  • Multi-level classification is the problem of classifying instances into more than two classes.

The input here is transcripts of months of postings made in 20 Usenet newsgroups from the early 1990s. Usenet is worldwide Internet discussion system. Newsgroups are typically accessed with newsreaders: applications that allow users to read and reply to postings in newsgroups. These applications act as clients to one or more news servers.

The major set of worldwide newsgroups is contained within nine hierarchies, eight of which are operated under consensual guidelines that govern their administration and naming. The current Big Eight are:

  • comp.* – computer-related discussions (comp.software, comp.sys.amiga)
  • humanities.* – fine arts, literature, and philosophy (humanities.classics, humanities.design.misc)
  • misc.* – miscellaneous topics (misc.education, misc.forsale, misc.kids)
  • news.* – discussions and announcements about news (meaning Usenet, not current events) (news.groups, news.admin)
  • rec.* – recreation and entertainment (rec.music, rec.arts.movies)
  • sci.* – science related discussions (sci.psychology, sci.research)
  • soc.* – social discussions (soc.college.org, soc.culture.african)
  • talk.* – talk about various controversial topics (talk.religion, talk.politics, talk.origins)

The input data set consists of 20 such newsgroups, under each newsgroup we have stories. We dividethe data set into training data and test data (as this is a supervised learning algorithm). The objective is to initially obtain parameters from the trained data set and then test the model’s accuracy by running it with test data set.

Listed below is the “format” of the data set:
From: noname@noname.com (anonymous)
Subject: Re: about the bible quiz answers
Organization: AT&T
Distribution: na
Lines: 18
In article <noname1@noname1.com>, noname1@noname1.com (Anonymous Person1) writes:
>
>
> #12) the quick fox jumped over a lazy frog
> .
the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog
Anonymous

The following steps are required to perform a multi-class classification task:
#1 Create a new directory named 20news-all and copy the complete data set into the directory

$ mkdir 20news-all
$ cp -R 20news-bydate/*/* 20news-all

#2 Create sequence files from the 20 Newsgroups data set: Sequence file format is the intermediate data format to be used from mahout

$ mahout seqdirectory \
> -i 20news-all \
> -o 20news-seq
$ mahout seq2sparse \
> -i 20news-seq \
> -o 20news-vectors -lnorm -nv -wt tfidf
-i → input directory
-o → output directory
-lnorm → lognormalize the output (method of standardization)
-nv → create named vectors
-wt → create TFIDF weights to use in the algorithm

#4 Split the generated vector data set to create two sets:
Training set: Data produce the model used to train the classification algorithm to produce the model
Test set: Data used to test the classification algorithm In this case we are taking 20% of the documents in each category and writing them to the test directory, the rest of the documents are written to the training data directory.

mahout split \
-i 20news-vectors/tfidf-vectors \
-tr 20news-train-vectors \
-te 20news-test-vectors \
-rp 20 \
-ow \
-seq -xm sequential
-tr → training vectors
-te → test vectors
-rp → random selection point
-ow → overwrite
-seq -xm → method of writing the vectors

#5 train the vectors using mahout’s naïve bayes classifier:

mahout trainnb \
-i hdfs://localhost:54310/user/hduser/20news-train-vectors -el \
-o hdfs://localhost:54310/user/hduser/model \
-li hdfs://localhost:54310/user/hduser/labelindex \
-ow -c
-i → input directory
-el → extract the label index (to prepare named vectors)
-o → output directory
-li → the path to store the extracted label index
-ow → overwrite
-c → train complementary

#6 Test the model obtained using mahout naïve bayes:

$ mahout testnb \
> -i hdfs://localhost:54310/user/hduser/20news-test-vectors \
> -m hdfs://localhost:54310/user/hduser/model \> -l hdfs://localhost:54310/user/hduser/labelindex \
> -ow -o hdfs://localhost:54310/user/hduser/20newstesting2
-i → input directory
-m → model used to test the dataset
-l → using the labelindex (incase of named vectors, this option is
required)
-ow → overwrite
-o → output directory

Upon successful execution we could observe the following output – the confusion matrix. (Confusion Matrix: A confusion matrix contains information about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix.)

its the confusion matrix! (aptly named, isn't it?)

its the confusion matrix! (aptly named, isn’t it?)