Final PPT
Final PPT
Final PPT
Karthikeyan
What on World is
Apache MAHOUT
Applications
Genetic
Classification
Clustering
Recommenders
Utilities Lucene/Vectorizer
Collections (primitives)
Apache Hadoop
Mahout Clustering
Algorithms : K-Means Fuzzy K-Means Mean shift Canopy Dirichlet Spectral Clustering based on Eigen values Minhash clustering LDA based clustering
Dataset
Hadoop Sequence File format ./mahout seqdirectory <options> Sparse vector Format ./mahout seq2sparse <options>
Clustering Driver class ./mahout <kmeans/> <options> Dump cluster output
Clustering Examples
Using Reuters Dataset (SGML File) : $ bin/mahout seqdirectory -i reuters-ip -o reuters-seqdir \ -c UTF-8 -chunk 1 $ bin/mahout seq2sparse -i reuters-seqdir -o reuters-sparse $ bin/mahout kmeans -i reuters-sparse/tfidf-vectors / -c reuters-clusters \ -o reuters-kmeans \ -dm org.apache.mahout.distance.CosineDistanceMeasure\ -cd 0.1 -x 10 -k 20 ow $ bin/mahout clusterdump -d reuters-sparse \dictionary.file-0 -s reuters-kmeans-clusters/clusters-19 -b 10 n 10
Mahout Classification
Algorithms Implemented: Nave Bayes Complementary Nave Bayes Random Forest Logistic Regression (Sequential Algorithm) Hidden markov models Upcoming Algorithms: Support vector machines Classification based on perception and winnow
Logistic Regression
x","y","shape","color","k","k0","xx","xy","yy","a","b","c","bias""
0.923307513352484,0.0135197141207755,21,2,4,8,0.852496764213146,...,1
B [3.0, 11.0]]
Random Forest
Input : arff or csv Generate a file descriptor for the dataset: $ericsson>$HADOOP_HOME/bin/hadoop jar \ $MAHOUT_HOME/core/target/mahout-core-0.6-SNAPSHOT-job.jar \ org.apache.mahout.df.tools.Describe -p KDDTrain.arff -f Train.info \ -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L Run the example: $ericsson>$HADOOP_HOME/hadoop jar \ $MAHOUT_HOME/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar\ org.apache.mahout.df.mapreduce.BuildForest <options> Using the Decision Forest to Classify new data $HADOOP_HOME/hadoop jar \ $MAHOUT_HOME/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar org.apache.mahout.df.mapreduce.TestForest -i Test.arff -ds Train.info <options> Output : confusion matrix
Dimension reduction
Algorithms Implemented: Singular value Decomposition Stochastic singular value Decomposition
Upcoming Algorithms : Principal Components Analysis Independent Component Analysis Gaussian Discriminative Analysis
Input : Real value Matrix
Eigen Vectors
TASTE
$MAHOUT parallelALS input TrainingSet --output out \ --tempDir tmp -- numFeatures 20 -- numIterations 10 --lambda 0.065 Compute predictions against the probe set, measure the error $MAHOUT evaluateFactorization input TrainingSet --output op \ --tempDir tmp1
Compute recommendations
$MAHOUT recommendfactorized
SUMMARY
ALGORITHMS All Clustering Algorithms, Bayes, Cbayes classifier Logistic regression, Random forest, FP Growth Taste , Collaborative Filtering SVD, SSVD INPUT Sparse Vector
CSV
Matrix