Market Basket Analysis with Mahout

Also known as Affinity Analysis/Frequent Pattern Mining.
Finding patterns in huge amounts of customer transactional data is called market basket analysis. This is useful where store’s transactional data is readily available. Using market basket analysis, one can find purchasing patterns. Market basket analysis is also called associative rule mining (actually its otherway around) or affinity analysis or frequent pattern mining. This technique is behind all customer promotional offers like buy 1 get 1 free, discounts, complimentary products, etc… that we see in the deparmental stores/supermarket chains.

MBA is one of the ways of recommending products to the customer. If we have customer transaction data, data where the number of items bought by each customer is available (the receipt that we get for buying a product) and imagine we have a million transaction records like this, then we can find out buying patterns. For example, lets assume the lifestyle of people at a certain locality eat chips while drinking beer. Then whoever comes into supermarket with sole purpose of buying beer, could most likely to pick up a packet of chips (crisps if you are from the UK) — now this is a pattern we know from local knowledge.

But what about other patterns that we don’t know/we don’t speak about? MBA helps us to find out such patterns. Why do we need such pattern information? We can use this information for purchase planning, introducing new products (not just on MBA results based also with the help of statistical hypothesis/inference)

Lets perform market basket analysis using Apache Mahout. You should have Apache Mahout and Hadoop installed, up and running.

For the input dataset, I found an anonymous Belgian supermarket transaction data. This data is available thanks to the courtesy of Dr Tom Briggs from University of Hasselt, Belgium. Please visit http://fimi.ua.ac.be/data/retail.pdf and http://www.luc.ac.be/~brijs/. We will use the retail.dat as input dataset. Each record in the data set contains information about the date of purchase, the receipt number, the article number, the number of items purchased, the article price in Belgian Francs and the customer number.

The data are collected over three non-consecutive periods. The ﬁrst period runs from half December 1999 to half January 2000. The second period runs from 2000 to the beginning of June 2000. The third and ﬁnal period runs from the end of August 2000 to the end of November 2000. In between these periods, no data is available, unfortunately. This results in approximately 5 months of data. The total amount of receipts being collected equals 88,163. 5,133 customers have purchased at least one product in the supermarket during the data collection period.

The following steps are required to perform the market basket analysis with Apache Mahout:

1. Copy the data from local disk to HDFS:
hadoop fs -copyFromLocal /home/path to your file/retail.dat retail.dat
2. Execute Mahout’s FPG procedure
mahout fpg -i retail.dat -o patternone -method mapreduce
Note: the number of top items with most frequent pattern is by default 50. Meaning, the output will provide you top 50 items.
you can also run this procedure in sequential mode, just make sure you have the data file in the directory where you execute this.

3. Check the output after processing. It will be in the folder ‘patternone’ in your HDFS.

hadoop fs -ls /user/hduser/patternone
4. Identify the output in form of sequence file. It will be under frequentpatterns folder. Use sequence dumper utility to extract the output.
mahout seqdumper -i hdfs://localhost:54310/user/hduser/patternone/frequentpatterns/part-r-00000
5. Learn to interpret the output. It will be likely to be in following key value pair:

Item is called the key here. ([99],32) means that item 99 seems to be appearing in 32 transactions. ([142, 99],22) means item 142 and 99 seems to be appearing in 22 transactions. We dont know for sure what those items are but in real life situations items will be indexed against its names so you can get output with name of the item.

For example item 99 could be beer. 32 means beer was bought by 32 customers. 142 and 99 appearing in 22 transactions meaning, people have bought beer and chips together in those 22 instances. May be the remaining people already have chips at home :).

This data can be used for further analysis to determine the nature of promotion that could be offered to customers.

13 thoughts on “Market Basket Analysis with Mahout”

Hi
Thank you for your example. I have tried to get the retail.dat from the link provided in the pdf file but is not more availebel. Could you send me the file or post somewere in order to be able to downlow i
Thank you

Pavan Kumar February 22, 20144:57 am Reply

Hi Catalin, can you try this link: http://fimi.ua.ac.be/data/retail.dat

Let me know if it works properly or send me your email id I will send you this .dat file
1. Catalin Dinu February 24, 20141:19 pm
  
  Hi,
  Thank you for the link is working 🙂
  Regards
  Catalin

I am enable to load the class fpg geeting the error like this..

14/03/24 16:42:23 WARN driver.MahoutDriver: Unable to add class: fpg
14/03/24 16:42:23 WARN driver.MahoutDriver: No fpg.props found on classpath, will use command-line arguments only
Unknown program ‘fpg’ chosen.
Valid program names are:
arff.vector: : Generate Vectors from an ARFF file or directory
baumwelch: : Baum-Welch algorithm for unsupervised HMM training
canopy: : Canopy clustering
cat: : Print a file or resource as the logistic regression models would see it
cleansvd: : Cleanup and verification of SVD output
clusterdump: : Dump cluster output to text
clusterpp: : Groups Clustering Output In Clusters
cmdump: : Dump confusion matrix in HTML or text formats
concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix
cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)
cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes

Pavan Kumar March 24, 20141:42 pm Reply

Hi Harikrishna, which version of mahout are you working with

it’s 0.9

Pavan Kumar March 25, 20141:17 pm Reply

frequent pattern mining has been removed since mahout 0.8 . though as a mahout dev i would suggest you use latest version of mahout, if this is something you need to use for large scale of data then install an 0.7 seperately to make use of fpm. also there are packages in R programming that can run affinity analysis/ frequent pattern mining do have a look

i uses this command to run it’s right…

mahout fpg -i /home/mahout/retail.dat -o patternone -method mapreduce

if do u have any material on Mahout..can u send me….krishnareddy.revuri@gmail.com

hello…rejesh..

i was geting error like this..while running…
14/04/08 16:12:12 WARN mapred.JobClient: Error reading task outputhttp://localhost:50060/tasklog?plaintext=true&attemptid=attempt_201404081602_0001_m_000001_0&filter=stdout..

can u help me

This article is intersted and I must to follow it.

Hi i am new for Bigdata analaysis
How to install Hadoop and Mahout in ubundu version ? Could you please give any tutorials for beginners ?

Pavan June 10, 201512:49 pm Reply

Hi Sanmugavel, click on the tags of Hadoop and Mahout and you will find what you are looking for

Catalin Dinu February 21, 20144:38 pm Reply

Hi
Thank you for your example. I have tried to get the retail.dat from the link provided in the pdf file but is not more availebel. Could you send me the file or post somewere in order to be able to downlow i
Thank you
1. Pavan Kumar February 22, 20144:57 am Reply
  
  Hi Catalin, can you try this link: http://fimi.ua.ac.be/data/retail.dat
  
  Let me know if it works properly or send me your email id I will send you this .dat file
  1. Catalin Dinu February 24, 20141:19 pm
    
    Hi,
    Thank you for the link is working 🙂
    Regards
    Catalin
Harikrishna Reddy Revuri March 24, 201411:17 am Reply

I am enable to load the class fpg geeting the error like this..

14/03/24 16:42:23 WARN driver.MahoutDriver: Unable to add class: fpg
14/03/24 16:42:23 WARN driver.MahoutDriver: No fpg.props found on classpath, will use command-line arguments only
Unknown program ‘fpg’ chosen.
Valid program names are:
arff.vector: : Generate Vectors from an ARFF file or directory
baumwelch: : Baum-Welch algorithm for unsupervised HMM training
canopy: : Canopy clustering
cat: : Print a file or resource as the logistic regression models would see it
cleansvd: : Cleanup and verification of SVD output
clusterdump: : Dump cluster output to text
clusterpp: : Groups Clustering Output In Clusters
cmdump: : Dump confusion matrix in HTML or text formats
concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix
cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)
cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes
1. Pavan Kumar March 24, 20141:42 pm Reply
  
  Hi Harikrishna, which version of mahout are you working with
Harikrishna Reddy Revuri March 24, 20144:19 pm Reply

it’s 0.9
1. Pavan Kumar March 25, 20141:17 pm Reply
  
  frequent pattern mining has been removed since mahout 0.8 . though as a mahout dev i would suggest you use latest version of mahout, if this is something you need to use for large scale of data then install an 0.7 seperately to make use of fpm. also there are packages in R programming that can run affinity analysis/ frequent pattern mining do have a look
Harikrishna Reddy Revuri March 24, 20146:34 pm Reply

i uses this command to run it’s right…

mahout fpg -i /home/mahout/retail.dat -o patternone -method mapreduce
Harikrishna Reddy Revuri March 26, 20142:44 pm Reply

if do u have any material on Mahout..can u send me….krishnareddy.revuri@gmail.com
Harikrishna Reddy Revuri April 8, 201410:51 am Reply

hello…rejesh..

i was geting error like this..while running…
14/04/08 16:12:12 WARN mapred.JobClient: Error reading task outputhttp://localhost:50060/tasklog?plaintext=true&attemptid=attempt_201404081602_0001_m_000001_0&filter=stdout..

can u help me
pe August 4, 201411:04 pm Reply

This article is intersted and I must to follow it.
alagukumar sanmugavel June 4, 20154:59 am Reply

Hi i am new for Bigdata analaysis
How to install Hadoop and Mahout in ubundu version ? Could you please give any tutorials for beginners ?
1. Pavan June 10, 201512:49 pm Reply
  
  Hi Sanmugavel, click on the tags of Hadoop and Mahout and you will find what you are looking for

Market Basket Analysis with Mahout

Published by Pavan

13 thoughts on “Market Basket Analysis with Mahout”

Leave a reply to Harikrishna Reddy Revuri Cancel reply

Share this:

Related

Published by Pavan

13 thoughts on “Market Basket Analysis with Mahout”

Leave a reply to Harikrishna Reddy Revuri Cancel reply