Final PPT

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

T.

Karthikeyan

What on World is

Apache MAHOUT

Applications

Examples Freq. Pattern Mining

Genetic

Classification

Clustering

Recommenders

Utilities Lucene/Vectorizer

Math Vectors/Matrices /SVD

Collections (primitives)

Apache Hadoop

Mahout Clustering
Algorithms : K-Means Fuzzy K-Means Mean shift Canopy Dirichlet Spectral Clustering based on Eigen values Minhash clustering LDA based clustering

Notion Of similarity : Distance Measure : Euclidean Cosine Tanimoto Manhattan

Clustering our own data

Dataset

Hadoop Sequence File format ./mahout seqdirectory <options> Sparse vector Format ./mahout seq2sparse <options>
Clustering Driver class ./mahout <kmeans/> <options> Dump cluster output

./mahout clusterdump <options>

Clustering Examples
Using Reuters Dataset (SGML File) : $ bin/mahout seqdirectory -i reuters-ip -o reuters-seqdir \ -c UTF-8 -chunk 1 $ bin/mahout seq2sparse -i reuters-seqdir -o reuters-sparse $ bin/mahout kmeans -i reuters-sparse/tfidf-vectors / -c reuters-clusters \ -o reuters-kmeans \ -dm org.apache.mahout.distance.CosineDistanceMeasure\ -cd 0.1 -x 10 -k 20 ow $ bin/mahout clusterdump -d reuters-sparse \dictionary.file-0 -s reuters-kmeans-clusters/clusters-19 -b 10 n 10

Mahout Classification
Algorithms Implemented: Nave Bayes Complementary Nave Bayes Random Forest Logistic Regression (Sequential Algorithm) Hidden markov models Upcoming Algorithms: Support vector machines Classification based on perception and winnow

Bayes , Cbayes Classifier


Preprocessing Raw data into classifiable data

Bayes ,Cbayes Classifier Example


Using Newsgroup Dataset: $./mahout prepare20newsgroups -p 20news-bydate-train -o 20news-train \ -a org.apache.lucene.analysis.standard.StandardAnalyzer \ -c UTF-8 $./mahout trainclassifier i 20news-train -o 20news-model \ -type <cbayes ,bayes> \ -ng 1 -source hdfs

$./mahout testclassifier -d 20news-test -m 20news-model \ -type <cbayes,bayes> \ -ng 1 -source hdfs

Output : Confusion matrix

Logistic Regression
x","y","shape","color","k","k0","xx","xy","yy","a","b","c","bias""
0.923307513352484,0.0135197141207755,21,2,4,8,0.852496764213146,...,1

./mahout trainlogistic --input input.csv --output ./model \ --target color --categories 2

./mahout runlogistic --input test.csv --model ./model \ --auc --confusion


CONFUSION MATRIX ( 0/P) A AUC = 0.97 ; B A {[24.0, 2.0],

B [3.0, 11.0]]

Random Forest
Input : arff or csv Generate a file descriptor for the dataset: $ericsson>$HADOOP_HOME/bin/hadoop jar \ $MAHOUT_HOME/core/target/mahout-core-0.6-SNAPSHOT-job.jar \ org.apache.mahout.df.tools.Describe -p KDDTrain.arff -f Train.info \ -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L Run the example: $ericsson>$HADOOP_HOME/hadoop jar \ $MAHOUT_HOME/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar\ org.apache.mahout.df.mapreduce.BuildForest <options> Using the Decision Forest to Classify new data $HADOOP_HOME/hadoop jar \ $MAHOUT_HOME/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar org.apache.mahout.df.mapreduce.TestForest -i Test.arff -ds Train.info <options> Output : confusion matrix

Dimension reduction
Algorithms Implemented: Singular value Decomposition Stochastic singular value Decomposition

Upcoming Algorithms : Principal Components Analysis Independent Component Analysis Gaussian Discriminative Analysis
Input : Real value Matrix

0.12 0.8 0.123


0.89 2.33 1.445 4.12 2.123 3.12

./mahout <svd/ssvd> <options>

Eigen Vectors

Frequent Pattern mining


Algorithm: Parallel FP growth Algorithm Input : dat or csv Running Parallel FPGrowth: $./mahout fpg retail.dat -o patterns -k 50 -method mapreduce -regex '[\ ]' -s 2

Viewing the results : $./mahout seqdumper -s patterns/part-?-00000 -n 4

Recommenders / Collaborative Filtering


Algorithms: Non-distributed recommenders ("Taste") Distributed Item-Based Collaborative Filtering Collaborative Filtering using a parallel matrix factorization Input is text file: user ,item ,preference

TASTE

Collaborative Filtering using a parallel matrix factorization


Input : Rating Matrix or csv To Run distributed ALS-WR to factorize the rating matrix defined by the training set

$MAHOUT parallelALS input TrainingSet --output out \ --tempDir tmp -- numFeatures 20 -- numIterations 10 --lambda 0.065 Compute predictions against the probe set, measure the error $MAHOUT evaluateFactorization input TrainingSet --output op \ --tempDir tmp1

Compute recommendations

$MAHOUT recommendfactorized

input userRatings --output recommendations \numRecommendations 6 --maxRating 5

SUMMARY
ALGORITHMS All Clustering Algorithms, Bayes, Cbayes classifier Logistic regression, Random forest, FP Growth Taste , Collaborative Filtering SVD, SSVD INPUT Sparse Vector

CSV

User ,Item ,Preference

Matrix

You might also like