Performing document clustering using Apache Mahout k-means

Introduction to k-means clustering:
Clustering is all about organizing items from a given collection into groups of similar items. These clusters could be thought of as sets of items similar to each other in some ways but dissimilar from the items belonging to other clusters.

Clustering a collection involves three things:

  • An algorithm—This is the method used to group the items together.
  • A notion of both similarity and dissimilarity—which items belong in an existing stack and which should start a new one.
  • A stopping condition—this might be the point beyond which books can’t be stacked anymore, or when the stacks are already quite dissimilar.

Running k-means clustering in Apache Mahout:
k-means algorithm in Apache Mahout is mainly for text processing, if you need to process some numerical data, you need to write some utility functions to write the numerical data into sequence-vector format. For the general example “Reuters”, the first few Mahout steps are actually doing some data processing.
To be explicit, for Reuters example, the original downloaded file is in SGML format, which is similar to XML. So we need to first parse these files into document-id and document-text. For this preprocessing job, a much quicker way is to use the Reuters parser given in the Lucene benchmark JAR file. After that we can convert the file into sequenceFiles. SequencesFiles is a file with structure of key-value format. Key is the document id and value is the document content. This step will be done using ‘seqdirectory’, a built-in method in Apache Mahout to convert input files into sequenceFiles. We need to then convert the sequenceFile to index frequencey vectors using mahout’s ‘seq2sparse‘ built-in method. The generated vectors dir should contain the following items:

  • reuters-vectors/df-count
  • reuters-vectors/dictionary.file-0
  • reuters-vectors/frequency.file-0
  • reuters-vectors/tf-vectors
  • reuters-vectors/tfidf-vectors
  • reuters-vectors/tokenized-documents
  • reuters-vectors/wordcount


We will then use tfidf-vectors to run kmeans. You could give a ‘fake’ initial center path, as given argument k, mahout will automatically random select k to initial the clustering. On successfully running k-means algorithm, we could use mahout’s ‘clusterdumper‘ utility to export the data from HDFS to a text or a csv file.

The following are the actual codes that are written while performing this task.

Get the data from:
wget http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz

Place it within the example folder from mahout home director:
mkdir reuters
cd reuters
mkdir reuters-out
mv reuters21578.tar.gz reuters-out
cd reuters-out
tar -xzvf reuters21578.tar.gz
cd ..

Mahout Commands

#1 Run the org.apache.lucene.benchmark .utils.ExtractReuters class
org.apache.lucene.benchmark.utils.ExtractReuters reuters-out

#2 Copy the file to your HDFS
bin/hadoop fs -copyFromLocal

#3 Generate sequence-file
mahout seqdirectory -i hdfs://localhost:54310/user/bigdata/reuters-text
-o hdfs://localhost:54310/user/bigdata/reuters-seqfiles -c UTF-8 -chunk 5
-chunk → specifying the number of data blocks
UTF-8 → specifying the appropriate input format

#4 Check the generated sequence-file
mahout-0.7$ ./bin/mahout seqdumper -i
/your-hdfs-path-to/reuters-seqfiles/chunk-0 |less

#5 From sequence-file generate vector file
mahout seq2sparse -i
hdfs://localhost:54310/user/bigdata/reuters-seqfiles -o
hdfs://localhost:54310/user/bigdata/reuters-vectors -ow
-ow → overwrite

#6 Take a look at it should have 7 items by using this command
bin/hadoop fs -ls
bin/hadoop fs -ls reuters-vectors

#7 Check the vector: reuters-vectors/tf-vectors/part-r-00000
mahout-0.7$ hadoop fs -ls reuters-vectors/tf-vectors

#8 Run canopy clustering to get optimal initial centroids for k-means
mahout canopy -i
hdfs://localhost:54310/user/bigdata/reuters-vectors/tf-vectors -o
hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -dm
org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000

-dm → specifying the distance measure to be used while clustering (here it is cosine distance measure)

#9 Run k-means clustering algorithm
mahout kmeans -i
hdfs://localhost:54310/user/bigdata/reuters-vectors/tfidf-vectors -c
hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -o
hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters -cd 0.1 -ow
-x 20 -k 10

-i → input
-o → output
-c → initial centroids for k-means (not defining this parameter will
trigger k-means to generate random initial centroids)
-cd → convergence delta parameter
-ow → overwrite
-x → specifying number of k-means iterations
-k → specifying number of clusters

#10 Export k-means output using Cluster Dumper tool
mahout clusterdump -dt sequencefile -d hdfs://localhost:54310/user/bigdata/reuters-vectors/dictionary.file-*
-i hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters/clusters-8-
final -o clusters.txt -b 15

-dt → dictionary type
-b → specifying length of each word

The output would look like this (note: this screenshot was on real life document clustering (with big data) and not reuters example):