Market Basket Analysis with Mahout

Also known as Affinity Analysis/Frequent Pattern Mining.
Finding patterns in huge amounts of customer transactional data is called market basket analysis. This is useful where store’s transactional data is readily available. Using market basket analysis, one can find purchasing patterns. Market basket analysis is also called associative rule mining (actually its otherway around) or affinity analysis or frequent pattern mining. This technique is behind all customer promotional offers like buy 1 get 1 free, discounts, complimentary products, etc… that we see in the deparmental stores/supermarket chains.

MBA is one of the ways of recommending products to the customer. If we have customer transaction data, data where the number of items bought by each customer is available (the receipt that we get for buying a product) and imagine we have a million transaction records like this, then we can find out buying patterns. For example, lets assume the lifestyle of people at a certain locality eat chips while drinking beer. Then whoever comes into supermarket with sole purpose of buying beer, could most likely to pick up a packet of chips (crisps if you are from the UK) — now this is a pattern we know from local knowledge.

But what about other patterns that we don’t know/we don’t speak about? MBA helps us to find out such patterns.  Why do we need such pattern information? We can use this information for purchase planning, introducing new products (not just on MBA results based also with the help of statistical hypothesis/inference)

Lets perform market basket analysis using Apache Mahout. You should have Apache Mahout and Hadoop installed, up and running.

For the input dataset, I found an anonymous Belgian supermarket transaction data. This data is available thanks to the courtesy of Dr Tom Briggs from University of Hasselt, Belgium. Please visit http://fimi.ua.ac.be/data/retail.pdf and http://www.luc.ac.be/~brijs/. We will use the retail.dat as input dataset. Each record in the data set contains information about the date of purchase, the receipt number, the article number, the number of items purchased, the article price in Belgian Francs and the customer number.

The data are collected over three non-consecutive periods. The first period runs from half December 1999 to half January 2000. The second period runs from 2000 to the beginning of June 2000. The third and final period runs from the end of August 2000 to the end of November 2000. In between these periods, no data is available, unfortunately. This results in approximately 5 months of data. The total amount of receipts being collected equals 88,163. 5,133 customers have purchased at least one product in the supermarket during the data collection period.

The following steps are required to perform the market basket analysis with Apache Mahout:

1. Copy the data from local disk to HDFS:
hadoop fs -copyFromLocal /home/path to your file/retail.dat retail.dat

2. Execute Mahout’s FPG procedure
mahout fpg -i retail.dat -o patternone -method mapreduce

Note: the number of top items with most frequent pattern is by default 50. Meaning, the output will provide you top 50 items.
you can also run this procedure in sequential mode, just make sure you have the data file in the directory where you execute this.

3. Check the output after processing. It will be in the folder ‘patternone’ in your HDFS.

hadoop fs -ls /user/hduser/patternone

4. Identify the output in form of sequence file. It will be under frequentpatterns folder. Use sequence dumper utility to extract the output.
mahout seqdumper -i hdfs://localhost:54310/user/hduser/patternone/frequentpatterns/part-r-00000

5. Learn to interpret the output. It will be likely to be in following key value pair:

FPM Output

Item is called the key here. ([99],32) means that item 99 seems to be appearing in 32 transactions. ([142, 99],22) means item 142 and 99 seems to be appearing in 22 transactions. We dont know for sure what those items are but in real life situations items will be indexed against its names so you can get output with name of the item.

For example item 99 could be beer. 32 means beer was bought by 32 customers. 142 and 99 appearing in 22 transactions meaning, people have bought beer and chips together in those 22 instances. May be the remaining people already have chips at home :).

This data can be used for further analysis to determine the nature of promotion that could be offered to customers.

Advertisements

13 comments

  1. Hi
    Thank you for your example. I have tried to get the retail.dat from the link provided in the pdf file but is not more availebel. Could you send me the file or post somewere in order to be able to downlow i
    Thank you

  2. I am enable to load the class fpg geeting the error like this..

    14/03/24 16:42:23 WARN driver.MahoutDriver: Unable to add class: fpg
    14/03/24 16:42:23 WARN driver.MahoutDriver: No fpg.props found on classpath, will use command-line arguments only
    Unknown program ‘fpg’ chosen.
    Valid program names are:
    arff.vector: : Generate Vectors from an ARFF file or directory
    baumwelch: : Baum-Welch algorithm for unsupervised HMM training
    canopy: : Canopy clustering
    cat: : Print a file or resource as the logistic regression models would see it
    cleansvd: : Cleanup and verification of SVD output
    clusterdump: : Dump cluster output to text
    clusterpp: : Groups Clustering Output In Clusters
    cmdump: : Dump confusion matrix in HTML or text formats
    concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix
    cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)
    cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
    evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes

    1. frequent pattern mining has been removed since mahout 0.8 . though as a mahout dev i would suggest you use latest version of mahout, if this is something you need to use for large scale of data then install an 0.7 seperately to make use of fpm. also there are packages in R programming that can run affinity analysis/ frequent pattern mining do have a look

  3. hello…rejesh..

    i was geting error like this..while running…
    14/04/08 16:12:12 WARN mapred.JobClient: Error reading task outputhttp://localhost:50060/tasklog?plaintext=true&attemptid=attempt_201404081602_0001_m_000001_0&filter=stdout..

    can u help me

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s