Algorithms

Market Basket Analysis with Mahout

Also known as Affinity Analysis/Frequent Pattern Mining.
Finding patterns in huge amounts of customer transactional data is called market basket analysis. This is useful where store’s transactional data is readily available. Using market basket analysis, one can find purchasing patterns. Market basket analysis is also called associative rule mining (actually its otherway around) or affinity analysis or frequent pattern mining. This technique is behind all customer promotional offers like buy 1 get 1 free, discounts, complimentary products, etc… that we see in the deparmental stores/supermarket chains.

MBA is one of the ways of recommending products to the customer. If we have customer transaction data, data where the number of items bought by each customer is available (the receipt that we get for buying a product) and imagine we have a million transaction records like this, then we can find out buying patterns. For example, lets assume the lifestyle of people at a certain locality eat chips while drinking beer. Then whoever comes into supermarket with sole purpose of buying beer, could most likely to pick up a packet of chips (crisps if you are from the UK) — now this is a pattern we know from local knowledge.

But what about other patterns that we don’t know/we don’t speak about? MBA helps us to find out such patterns.  Why do we need such pattern information? We can use this information for purchase planning, introducing new products (not just on MBA results based also with the help of statistical hypothesis/inference)

Lets perform market basket analysis using Apache Mahout. You should have Apache Mahout and Hadoop installed, up and running.

For the input dataset, I found an anonymous Belgian supermarket transaction data. This data is available thanks to the courtesy of Dr Tom Briggs from University of Hasselt, Belgium. Please visit http://fimi.ua.ac.be/data/retail.pdf and http://www.luc.ac.be/~brijs/. We will use the retail.dat as input dataset. Each record in the data set contains information about the date of purchase, the receipt number, the article number, the number of items purchased, the article price in Belgian Francs and the customer number.

The data are collected over three non-consecutive periods. The first period runs from half December 1999 to half January 2000. The second period runs from 2000 to the beginning of June 2000. The third and final period runs from the end of August 2000 to the end of November 2000. In between these periods, no data is available, unfortunately. This results in approximately 5 months of data. The total amount of receipts being collected equals 88,163. 5,133 customers have purchased at least one product in the supermarket during the data collection period.

The following steps are required to perform the market basket analysis with Apache Mahout:

1. Copy the data from local disk to HDFS:
hadoop fs -copyFromLocal /home/path to your file/retail.dat retail.dat

2. Execute Mahout’s FPG procedure
mahout fpg -i retail.dat -o patternone -method mapreduce

Note: the number of top items with most frequent pattern is by default 50. Meaning, the output will provide you top 50 items.
you can also run this procedure in sequential mode, just make sure you have the data file in the directory where you execute this.

3. Check the output after processing. It will be in the folder ‘patternone’ in your HDFS.

hadoop fs -ls /user/hduser/patternone

4. Identify the output in form of sequence file. It will be under frequentpatterns folder. Use sequence dumper utility to extract the output.
mahout seqdumper -i hdfs://localhost:54310/user/hduser/patternone/frequentpatterns/part-r-00000

5. Learn to interpret the output. It will be likely to be in following key value pair:

FPM Output

Item is called the key here. ([99],32) means that item 99 seems to be appearing in 32 transactions. ([142, 99],22) means item 142 and 99 seems to be appearing in 22 transactions. We dont know for sure what those items are but in real life situations items will be indexed against its names so you can get output with name of the item.

For example item 99 could be beer. 32 means beer was bought by 32 customers. 142 and 99 appearing in 22 transactions meaning, people have bought beer and chips together in those 22 instances. May be the remaining people already have chips at home :).

This data can be used for further analysis to determine the nature of promotion that could be offered to customers.

Building Recommender Engines with Apache Mahout: Part II

In my previous post we looked at user ratings based recommendations. Here, we are going to build recommender engine based on items.

Item-based Recommendation:
In item-based analysis, items likes recommend related to a particular item are determined. When a user likes a particular item, items related to that item are recommended. As shown in figure, if items A and C are highly similar and a user likes item A, then item C is recommended to the user.

Item-based recommendation calculate recommendations based on items, and not users.

Objective is the same: recommend items to users but approach is different

High level logic:
– for every item i that u has no preference for yet
– for every item j that u has a preference for
– compute a similarity s between i and j
– add u’s preference for j, weighted by s, to a running average
– return the top items, ranked by weighted average

We are going to use Movie lens data (similar to IMDB database) where the ratings are for 3900 movies by 6040 users and has 1,000,209 ratings
Format of the data set:
User ID tab Item ID tab Rating tab Time Stamp

  • Format is similar to other 100,000 dataset, approach would focus on item (and not the user)
  • Ratings are anonymous.
  • Item-based recommender in Mahout supports a distributed execution model and it can be computed using MapReduce

The following steps are required to build a user-based recommendation in Mahout:

#1 Pre-processing the data
awk -F"::" '{print $1","$2","$3}' ml-1m/ratings.dat > ratings.csv
we are using awk to convert the input data file into a comma seperated value format and export the same to a new csv file

#2 Copy the Files to HDFS
hadoop fs -copyFromLocal ratings.csv users.txt hdfs://localhost:54310/user/hduser/itemreco

simple command to copy the files into hadoop distributed file system

#3 Use the following mahout function to compute the recommendation based on product
mahout recommenditembased \
--Dmapred.reduce.tasks=10 \
--similarityClassnameSIMILARITY_PEARSON_CORRELATION \
--input ratings.csv \
--output item-rec-output \
--tempDir item-rec-tmp \
--Dmapred.reduce.tasks=10 \
--usersFile user-ids.txt

Parameter detail:

–Dmapred.reduce.tasks → assiging the number of map/reduce tasks
–similarityClassname → specifying the similarity distance metric. In this case, Pearson Correlation.
–input → path to input file
–output → path to output directory
–tempDir → path to temporary directory (optional parameter, can be used to store temp data)
–usersFile → path to user-ids

The output can be see like this:

mahout5

Building Recommender Engines with Apache Mahout: Part I

Introduction to recommendation:
Recommender systems provide personalized information by learning the user’s interests from traces of interaction with that user.
Two broad types of recommendation:
– User-based recommendation
– Item-based recommendation
Recommendation engines aim to show items of interest to a user. Recommendation engines in essence are matching engines that take into account the context of where the items are being shown and to whom they’re being shown.
Recommendation engines are one of the best ways of utilizing collective intelligence in your application.
Collaborative Filtering: The process of producing recommendations based on, and only based on, knowledge of users’ relationships to item is called collaborative filtering

User-based recommendation:

In user-based analysis, users similar to the user are first determined. As shown in figure, if a user likes item A, then the same item can be recommended to other users who are similar to user A. Similar users can be obtained by using
profile-based information about the user—for example cluster the users based on their attributes, such as age, gender, geographic location, net worth, and so on. Alternatively, you can find similar users using a collaborative-based approach by analyzing the users’ actions.

mahout4

  • Recommendation based on historical ratings and reviews by users
  • Answers the question, How users are similar to other users?
  • Mahout does not use MapReduce to compute user-based recommendation because the user-based recommender is only designed to work within a single JVM

High level logic

– for every item i that u has no preference for yet
– for every other user v that has a preference for i
– compute a similarity s between u and v
– incorporate v's preference for i, weighted by s, into a running average
– return the top items, ranked by weighted average

We are going to use Movie lens data (similar to IMDB database) where the ratings are for 1682 movies by 943 users and has 100,000 ratings
Format of the data set:
User ID tab Item ID tab Rating tab Time Stamp
– Not everyone has seen all the movies
– Each user rated at least 20 movies but not seen them all
– Objective is to recommend movies for 943 users
The following steps are required to build a user-based recommendation in Mahout:

#1 Pre-processing the Data
sed 's/ /,/g' ua.base > final.csv
→ using Sed to convert the tab spaces into comma separated values and export it to into a csv file.

#2 Split the input dataset into training data and test data
mahout splitDataset \
--input reco/final.csv \
--output reco/ \
--trainingPercentage 0.9 \
--probePercentage 0.1

–input → is the input file directory
–output → is the output file directory
–trianingPercentage → data split for training Data
–probePercentage → data split for test Data
Using mahout’s splitDataset function, we are splitting final.csv into two datasets with predefined training and test data percentages.

#3 Use Mahout’s parallelALS to compute the ratings matrix
Use Mahout’s Alternating Least Squares with Weighted Lambda-Regularization to find the decomposition. This method is faster than singular value decomposition
https://cwiki.apache.org/confluence/display/MAHOUT/Collaborative+Filtering+with+ALS-WR
mahout parallelALS \
--input hdfs://localhost:54310/user/hduser/reco/trainingSet/ \
--output hdfs://localhost:54310/user/hduser/reco/als/out \
--numFeatures 20 \
--numIterations 10 \
--lambda 0.065

–input → path to input directory
–output → path to output directory
–numFeatures → dimensions of the feature space
–numIterations → number of iterations
–lambda → the regularization parameter the above procedure will compute the ratings matrix required for the recommendation computation.

#4 Use Mahout’s evaluate Factorization to compute RMSE and MAE of a rating matrix factorization against probes
RMSE – Root Mean Square Error
MAE – Mean absolute Error
these parameters are helpful to measure the accuracy
mahout evaluateFactorization \
--input hdfs://localhost:54310/user/hduser/reco/probeSet/ \
--output hdfs://localhost:54310/user/hduser/reco/als/rmse/ \
--userFeatures hdfs://localhost:54310/user/hduser/reco/als/out/U/ \
--itemFeatures hdfs://localhost:54310/user/hduser/reco/als/out/M/

–input → path to the input directory
–output → path to the output directory
–userFeatures → path to the user feature matrix
–itemFeatures → path to item feature matrix
The output will be stored in RMSE.txt in HDFS file system
– Output obtained was: 0.9760895406876606
– Higher is better, we can proceed to run the main recommendation algorithm

#5 Mahout’s recommendfactorized function will compute recommendations based on ratings matrix
mahout recommendfactorized \
--input hdfs://localhost:54310/user/hduser/reco/als/out/userRa
tings/ \ --output hdfs://localhost:54310/user/hduser/reco/recommendation
s/ \ --userFeatures hdfs://localhost:54310/user/hduser/reco/als/out/U/ \
--itemFeatures hdfs://localhost:54310/user/hduser/reco/als/out/M/ \
--numRecommendations 6 --maxRating 5

–input → path to the input directory
–output → path to the output directory
–userFeatures → path to the user feature matrix
–itemFeatures → path to item feature matrix
–numRecommendations → number of recommendations made to each user
–maxRating → maximum rating on the rating scale

This will start computing recommendations based on the matrix and we can use the option to define how many recommendations needed to be defined
The output will be in the following format:
[userID]
{movieID1:rating, movieID2:rating,movieID3:rating,
movieID4:rating,movieID5:rating, movieID6:rating}

mahout3

Multi-level classification using stochastic gradient descent

Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An algorithm that implements classification, especially in a concrete implementation, is known as a classifier.
There are two types of Classification: Binary classification and Multi-level classification

  • Binary classification is the task of classifying the members of a given set of objects into two groups on the basis of whether they have some property or not.
  • Multi-level classification is the problem of classifying instances into more than two classes.

The input here is transcripts of months of postings made in 20 Usenet newsgroups from the early 1990s. Usenet is worldwide Internet discussion system. Newsgroups are typically accessed with newsreaders: applications that allow users to read and reply to postings in newsgroups. These applications act as clients to one or more news servers.

The major set of worldwide newsgroups is contained within nine hierarchies, eight of which are operated under consensual guidelines that govern their administration and naming. The current Big Eight are:

  • comp.* – computer-related discussions (comp.software, comp.sys.amiga)
  • humanities.* – fine arts, literature, and philosophy (humanities.classics, humanities.design.misc)
  • misc.* – miscellaneous topics (misc.education, misc.forsale, misc.kids)
  • news.* – discussions and announcements about news (meaning Usenet, not current events) (news.groups, news.admin)
  • rec.* – recreation and entertainment (rec.music, rec.arts.movies)
  • sci.* – science related discussions (sci.psychology, sci.research)
  • soc.* – social discussions (soc.college.org, soc.culture.african)
  • talk.* – talk about various controversial topics (talk.religion, talk.politics, talk.origins)

The input data set consists of 20 such newsgroups, under each newsgroup we have stories. We dividethe data set into training data and test data (as this is a supervised learning algorithm). The objective is to initially obtain parameters from the trained data set and then test the model’s accuracy by running it with test data set.

Listed below is the “format” of the data set:
From: noname@noname.com (anonymous)
Subject: Re: about the bible quiz answers
Organization: AT&T
Distribution: na
Lines: 18
In article <noname1@noname1.com>, noname1@noname1.com (Anonymous Person1) writes:
>
>
> #12) the quick fox jumped over a lazy frog
> .
the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog
Anonymous

The following steps are required to perform a multi-class classification task:
#1 Create a new directory named 20news-all and copy the complete data set into the directory

$ mkdir 20news-all
$ cp -R 20news-bydate/*/* 20news-all

#2 Create sequence files from the 20 Newsgroups data set: Sequence file format is the intermediate data format to be used from mahout

$ mahout seqdirectory \
> -i 20news-all \
> -o 20news-seq
$ mahout seq2sparse \
> -i 20news-seq \
> -o 20news-vectors -lnorm -nv -wt tfidf
-i → input directory
-o → output directory
-lnorm → lognormalize the output (method of standardization)
-nv → create named vectors
-wt → create TFIDF weights to use in the algorithm

#4 Split the generated vector data set to create two sets:
Training set: Data produce the model used to train the classification algorithm to produce the model
Test set: Data used to test the classification algorithm In this case we are taking 20% of the documents in each category and writing them to the test directory, the rest of the documents are written to the training data directory.

mahout split \
-i 20news-vectors/tfidf-vectors \
-tr 20news-train-vectors \
-te 20news-test-vectors \
-rp 20 \
-ow \
-seq -xm sequential
-tr → training vectors
-te → test vectors
-rp → random selection point
-ow → overwrite
-seq -xm → method of writing the vectors

#5 train the vectors using mahout’s naïve bayes classifier:

mahout trainnb \
-i hdfs://localhost:54310/user/hduser/20news-train-vectors -el \
-o hdfs://localhost:54310/user/hduser/model \
-li hdfs://localhost:54310/user/hduser/labelindex \
-ow -c
-i → input directory
-el → extract the label index (to prepare named vectors)
-o → output directory
-li → the path to store the extracted label index
-ow → overwrite
-c → train complementary

#6 Test the model obtained using mahout naïve bayes:

$ mahout testnb \
> -i hdfs://localhost:54310/user/hduser/20news-test-vectors \
> -m hdfs://localhost:54310/user/hduser/model \> -l hdfs://localhost:54310/user/hduser/labelindex \
> -ow -o hdfs://localhost:54310/user/hduser/20newstesting2
-i → input directory
-m → model used to test the dataset
-l → using the labelindex (incase of named vectors, this option is
required)
-ow → overwrite
-o → output directory

Upon successful execution we could observe the following output – the confusion matrix. (Confusion Matrix: A confusion matrix contains information about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix.)

its the confusion matrix! (aptly named, isn't it?)

its the confusion matrix! (aptly named, isn’t it?)

Performing document clustering using Apache Mahout k-means

Introduction to k-means clustering:
Clustering is all about organizing items from a given collection into groups of similar items. These clusters could be thought of as sets of items similar to each other in some ways but dissimilar from the items belonging to other clusters.

Clustering a collection involves three things:

  • An algorithm—This is the method used to group the items together.
  • A notion of both similarity and dissimilarity—which items belong in an existing stack and which should start a new one.
  • A stopping condition—this might be the point beyond which books can’t be stacked anymore, or when the stacks are already quite dissimilar.

Running k-means clustering in Apache Mahout:
k-means algorithm in Apache Mahout is mainly for text processing, if you need to process some numerical data, you need to write some utility functions to write the numerical data into sequence-vector format. For the general example “Reuters”, the first few Mahout steps are actually doing some data processing.
To be explicit, for Reuters example, the original downloaded file is in SGML format, which is similar to XML. So we need to first parse these files into document-id and document-text. For this preprocessing job, a much quicker way is to use the Reuters parser given in the Lucene benchmark JAR file. After that we can convert the file into sequenceFiles. SequencesFiles is a file with structure of key-value format. Key is the document id and value is the document content. This step will be done using ‘seqdirectory’, a built-in method in Apache Mahout to convert input files into sequenceFiles. We need to then convert the sequenceFile to index frequencey vectors using mahout’s ‘seq2sparse‘ built-in method. The generated vectors dir should contain the following items:

  • reuters-vectors/df-count
  • reuters-vectors/dictionary.file-0
  • reuters-vectors/frequency.file-0
  • reuters-vectors/tf-vectors
  • reuters-vectors/tfidf-vectors
  • reuters-vectors/tokenized-documents
  • reuters-vectors/wordcount

mahout1

We will then use tfidf-vectors to run kmeans. You could give a ‘fake’ initial center path, as given argument k, mahout will automatically random select k to initial the clustering. On successfully running k-means algorithm, we could use mahout’s ‘clusterdumper‘ utility to export the data from HDFS to a text or a csv file.

The following are the actual codes that are written while performing this task.

Get the data from:
wget http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz

Place it within the example folder from mahout home director:
mahout-0.7/examples/reuters
mkdir reuters
cd reuters
mkdir reuters-out
mv reuters21578.tar.gz reuters-out
cd reuters-out
tar -xzvf reuters21578.tar.gz
cd ..

Mahout Commands

#1 Run the org.apache.lucene.benchmark .utils.ExtractReuters class
${MAHOUT_HOME}/bin/mahout
org.apache.lucene.benchmark.utils.ExtractReuters reuters-out
reuters-text

#2 Copy the file to your HDFS
bin/hadoop fs -copyFromLocal
/home/bigdata/mahout-distribution-0.7/examples/reuters-text
hdfs://localhost:54310/user/bigdata/

#3 Generate sequence-file
mahout seqdirectory -i hdfs://localhost:54310/user/bigdata/reuters-text
-o hdfs://localhost:54310/user/bigdata/reuters-seqfiles -c UTF-8 -chunk 5
-chunk → specifying the number of data blocks
UTF-8 → specifying the appropriate input format

#4 Check the generated sequence-file
mahout-0.7$ ./bin/mahout seqdumper -i
/your-hdfs-path-to/reuters-seqfiles/chunk-0 |less

#5 From sequence-file generate vector file
mahout seq2sparse -i
hdfs://localhost:54310/user/bigdata/reuters-seqfiles -o
hdfs://localhost:54310/user/bigdata/reuters-vectors -ow
-ow → overwrite

#6 Take a look at it should have 7 items by using this command
bin/hadoop fs -ls
reuters-vectors/df-count
reuters-vectors/dictionary.file-0
reuters-vectors/frequency.file-0
reuters-vectors/tf-vectors
reuters-vectors/tfidf-vectors
reuters-vectors/tokenized-documents
reuters-vectors/wordcount
bin/hadoop fs -ls reuters-vectors

#7 Check the vector: reuters-vectors/tf-vectors/part-r-00000
mahout-0.7$ hadoop fs -ls reuters-vectors/tf-vectors

#8 Run canopy clustering to get optimal initial centroids for k-means
mahout canopy -i
hdfs://localhost:54310/user/bigdata/reuters-vectors/tf-vectors -o
hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -dm
org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000

-dm → specifying the distance measure to be used while clustering (here it is cosine distance measure)

#9 Run k-means clustering algorithm
mahout kmeans -i
hdfs://localhost:54310/user/bigdata/reuters-vectors/tfidf-vectors -c
hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -o
hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters -cd 0.1 -ow
-x 20 -k 10

-i → input
-o → output
-c → initial centroids for k-means (not defining this parameter will
trigger k-means to generate random initial centroids)
-cd → convergence delta parameter
-ow → overwrite
-x → specifying number of k-means iterations
-k → specifying number of clusters

#10 Export k-means output using Cluster Dumper tool
mahout clusterdump -dt sequencefile -d hdfs://localhost:54310/user/bigdata/reuters-vectors/dictionary.file-*
-i hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters/clusters-8-
final -o clusters.txt -b 15

-dt → dictionary type
-b → specifying length of each word

The output would look like this (note: this screenshot was on real life document clustering (with big data) and not reuters example):

mahout2