Month: November 2013

Building Recommender Engines with Apache Mahout: Part II

In my previous post we looked at user ratings based recommendations. Here, we are going to build recommender engine based on items.

Item-based Recommendation:
In item-based analysis, items likes recommend related to a particular item are determined. When a user likes a particular item, items related to that item are recommended. As shown in figure, if items A and C are highly similar and a user likes item A, then item C is recommended to the user.

Item-based recommendation calculate recommendations based on items, and not users.

Objective is the same: recommend items to users but approach is different

High level logic:
– for every item i that u has no preference for yet
– for every item j that u has a preference for
– compute a similarity s between i and j
– add u’s preference for j, weighted by s, to a running average
– return the top items, ranked by weighted average

We are going to use Movie lens data (similar to IMDB database) where the ratings are for 3900 movies by 6040 users and has 1,000,209 ratings
Format of the data set:
User ID tab Item ID tab Rating tab Time Stamp

  • Format is similar to other 100,000 dataset, approach would focus on item (and not the user)
  • Ratings are anonymous.
  • Item-based recommender in Mahout supports a distributed execution model and it can be computed using MapReduce

The following steps are required to build a user-based recommendation in Mahout:

#1 Pre-processing the data
awk -F"::" '{print $1","$2","$3}' ml-1m/ratings.dat > ratings.csv
we are using awk to convert the input data file into a comma seperated value format and export the same to a new csv file

#2 Copy the Files to HDFS
hadoop fs -copyFromLocal ratings.csv users.txt hdfs://localhost:54310/user/hduser/itemreco

simple command to copy the files into hadoop distributed file system

#3 Use the following mahout function to compute the recommendation based on product
mahout recommenditembased \
--Dmapred.reduce.tasks=10 \
--similarityClassnameSIMILARITY_PEARSON_CORRELATION \
--input ratings.csv \
--output item-rec-output \
--tempDir item-rec-tmp \
--Dmapred.reduce.tasks=10 \
--usersFile user-ids.txt

Parameter detail:

–Dmapred.reduce.tasks → assiging the number of map/reduce tasks
–similarityClassname → specifying the similarity distance metric. In this case, Pearson Correlation.
–input → path to input file
–output → path to output directory
–tempDir → path to temporary directory (optional parameter, can be used to store temp data)
–usersFile → path to user-ids

The output can be see like this:

mahout5

Advertisements

Computational Statistics setup

With the demand for analytics fast growing, many professionals are looking at leveraging open source systems for their analytics tasks. “R” or “R programming” is one of the famous statistical programming tool that is used for scientific computations. R programming language can also be linked with Hadoop to perform scalable/big data analytics.

Here we are going to look at installing R programming and R Studio IDE for R programming. Specifically on Linux(Debian). If you follow the code sequentially then you will be installing R in no time!

sudo apt-get update
sudo apt-get install r-base

If you want to compile R-packages then also install the following package as well.

sudo apt-get install r-base-dev

A number of R packages are available for Debian and they have the names starting with r-cran-****. These are usually kept upto date. Packages may require some build dependencies for it to run smoothly. Users should be aware of this and the following command would help:

sudo apt-get build-dep r-cran-

Now we have the R ready. Type ‘R’ on terminal to get the command line interface.

Next, we would like to install RStudio. You can either download the latest debian package from Rstudio or if you have ubuntu, then open software center and type Rstudio. IF you choose to download R debian package from the site, use the following command:

wget http://download1.rstudio.org/rstudio-0.98.495-i386.deb
sudo dpkg -i rstudio-0.98.495-i386.deb
[or]

sudo dpkg --install package-name-here.deb

Building Recommender Engines with Apache Mahout: Part I

Introduction to recommendation:
Recommender systems provide personalized information by learning the user’s interests from traces of interaction with that user.
Two broad types of recommendation:
– User-based recommendation
– Item-based recommendation
Recommendation engines aim to show items of interest to a user. Recommendation engines in essence are matching engines that take into account the context of where the items are being shown and to whom they’re being shown.
Recommendation engines are one of the best ways of utilizing collective intelligence in your application.
Collaborative Filtering: The process of producing recommendations based on, and only based on, knowledge of users’ relationships to item is called collaborative filtering

User-based recommendation:

In user-based analysis, users similar to the user are first determined. As shown in figure, if a user likes item A, then the same item can be recommended to other users who are similar to user A. Similar users can be obtained by using
profile-based information about the user—for example cluster the users based on their attributes, such as age, gender, geographic location, net worth, and so on. Alternatively, you can find similar users using a collaborative-based approach by analyzing the users’ actions.

mahout4

  • Recommendation based on historical ratings and reviews by users
  • Answers the question, How users are similar to other users?
  • Mahout does not use MapReduce to compute user-based recommendation because the user-based recommender is only designed to work within a single JVM

High level logic

– for every item i that u has no preference for yet
– for every other user v that has a preference for i
– compute a similarity s between u and v
– incorporate v's preference for i, weighted by s, into a running average
– return the top items, ranked by weighted average

We are going to use Movie lens data (similar to IMDB database) where the ratings are for 1682 movies by 943 users and has 100,000 ratings
Format of the data set:
User ID tab Item ID tab Rating tab Time Stamp
– Not everyone has seen all the movies
– Each user rated at least 20 movies but not seen them all
– Objective is to recommend movies for 943 users
The following steps are required to build a user-based recommendation in Mahout:

#1 Pre-processing the Data
sed 's/ /,/g' ua.base > final.csv
→ using Sed to convert the tab spaces into comma separated values and export it to into a csv file.

#2 Split the input dataset into training data and test data
mahout splitDataset \
--input reco/final.csv \
--output reco/ \
--trainingPercentage 0.9 \
--probePercentage 0.1

–input → is the input file directory
–output → is the output file directory
–trianingPercentage → data split for training Data
–probePercentage → data split for test Data
Using mahout’s splitDataset function, we are splitting final.csv into two datasets with predefined training and test data percentages.

#3 Use Mahout’s parallelALS to compute the ratings matrix
Use Mahout’s Alternating Least Squares with Weighted Lambda-Regularization to find the decomposition. This method is faster than singular value decomposition
https://cwiki.apache.org/confluence/display/MAHOUT/Collaborative+Filtering+with+ALS-WR
mahout parallelALS \
--input hdfs://localhost:54310/user/hduser/reco/trainingSet/ \
--output hdfs://localhost:54310/user/hduser/reco/als/out \
--numFeatures 20 \
--numIterations 10 \
--lambda 0.065

–input → path to input directory
–output → path to output directory
–numFeatures → dimensions of the feature space
–numIterations → number of iterations
–lambda → the regularization parameter the above procedure will compute the ratings matrix required for the recommendation computation.

#4 Use Mahout’s evaluate Factorization to compute RMSE and MAE of a rating matrix factorization against probes
RMSE – Root Mean Square Error
MAE – Mean absolute Error
these parameters are helpful to measure the accuracy
mahout evaluateFactorization \
--input hdfs://localhost:54310/user/hduser/reco/probeSet/ \
--output hdfs://localhost:54310/user/hduser/reco/als/rmse/ \
--userFeatures hdfs://localhost:54310/user/hduser/reco/als/out/U/ \
--itemFeatures hdfs://localhost:54310/user/hduser/reco/als/out/M/

–input → path to the input directory
–output → path to the output directory
–userFeatures → path to the user feature matrix
–itemFeatures → path to item feature matrix
The output will be stored in RMSE.txt in HDFS file system
– Output obtained was: 0.9760895406876606
– Higher is better, we can proceed to run the main recommendation algorithm

#5 Mahout’s recommendfactorized function will compute recommendations based on ratings matrix
mahout recommendfactorized \
--input hdfs://localhost:54310/user/hduser/reco/als/out/userRa
tings/ \ --output hdfs://localhost:54310/user/hduser/reco/recommendation
s/ \ --userFeatures hdfs://localhost:54310/user/hduser/reco/als/out/U/ \
--itemFeatures hdfs://localhost:54310/user/hduser/reco/als/out/M/ \
--numRecommendations 6 --maxRating 5

–input → path to the input directory
–output → path to the output directory
–userFeatures → path to the user feature matrix
–itemFeatures → path to item feature matrix
–numRecommendations → number of recommendations made to each user
–maxRating → maximum rating on the rating scale

This will start computing recommendations based on the matrix and we can use the option to define how many recommendations needed to be defined
The output will be in the following format:
[userID]
{movieID1:rating, movieID2:rating,movieID3:rating,
movieID4:rating,movieID5:rating, movieID6:rating}

mahout3