Random Forest Classifiers A Survey and Future

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

International Journal of Advanced Computing, ISSN:2051-0845, Vol.36, Issue.

1 1144

Random Forest Classifiers :A Survey and Future


Research Directions
Vrushali Y Kulkarni Dr Pradeep K Sinha
PhD Student, COEP, Pune, India Senior Director,
Email: [email protected] HPC, CDAC, Pune, India

ABSTRACT
Random Forest is an ensemble supervised machine used to classify samples into different categories.
learning technique. Machine learning techniques have Predictive model learns using training dataset. Test
applications in the area of Data mining. Random Forest dataset is used to estimate accuracy of the model.
has tremendous potential of becoming a popular technique Decision tree is commonly used technique for supervised
for future classifiers because its performance has been machine learning. Random Forest [11] uses decision tree
found to be comparable with ensemble techniques as base classifier. Random Forest generates multiple
bagging and boosting. Hence, an in-depth study of decision trees; the randomization is present in two ways:
existing work related to Random Forest will help to (1) random sampling of data for bootstrap samples as it is
accelerate research in the field of Machine Learning. This done in bagging and (2) random selection of input
paper presents a systematic survey of work done in features for generating individual base decision trees.
Random Forest area. In this process, we derived Strength of individual decision tree classifier and
Taxonomy of Random Forest Classifier which is correlation among base trees are key issues which decide
presented in this paper. We also prepared a Comparison generalization error of a Random Forest classifier [11].
chart of existing Random Forest classifiers on the basis of Accuracy of Random Forest classifier has been found to
relevant parameters. The survey results show that there is be at par with existing ensemble techniques like bagging
scope for improvement in accuracy by using different split and boosting. As per Breiman [11], Random Forest runs
measures and combining functions; and in performance efficiently on large databases, can handle thousands of
by dynamically pruning a forest and estimating optimal input variables without variable deletion, gives estimates
subset of the forest. There is also scope for evolving other of important variables, generates an internal unbiased
novel ideas for stream data and imbalanced data estimate of generalization error as forest growing
classification, and for semi-supervised learning. Based on progresses, has effective method for estimating missing
this survey, we finally presented a few future research data and maintains accuracy when a large proportion of
directions related to Random Forest classifier. data are missing, and has methods for balancing class
error in class population unbalanced data sets. The
Keywords - Data Mining, Ensemble, Classification, inherent parallel nature of Random Forest has led to its
Random Forest, Supervised Machine Learning parallel implementations using multithreading, multi-core,
and parallel architectures. Random Forest is used in many
1. INTRODUCTION recent classification and prediction applications due to its
Random Forest is an Ensemble Supervised Machine above mentioned features. In this paper, we have
Learning technique that has emerged recently. Machine concentrated on the empirical research related to Random
learning techniques have applications in the area of Data forest classifier rather than exploring and analyzing its
mining. Data mining is broadly classified as Descriptive theoretical background in detail.
and Predictive. Descriptive data mining concentrates more
This paper is organized as follows: Section 2 provides
on describing the data, grouping them into categories, and
theoretical foundations of ensembles and Random Forest
summarizing the data. Predictive data mining analyzes
algorithm. Section 3 provides a survey of current status of
past data and generates trends or conclusions for future
research on Random Forest classifier. Based on this
prediction. Predictive data mining has its roots in the
survey, we have evolved Taxonomy of Random Forest
classical model building process of statistics. Predictive
classifier which is also presented in this section. Section 4
model building works on the basis of feature analysis of
includes Discussions and a Summary chart summarizing
predictor variables. One or more features are considered
key features of the surveyed Random Forest classifiers in
as predictors. Output is some function of the predictors,
tabular form. Section 5 focuses few future research
which is called hypothesis. The generated hypotheses are
directions in the area of Random Forest. Section 6 gives
tested for their acceptance or rejection. Accuracy of this
model is decided by following various error estimation concluding remarks.
techniques. Usually, descriptive data mining is 2. THEORETICAL FOUNDATIONS
implemented using unsupervised machine learning
techniques, while predictive data mining is carried out 2.1 Ensemble Classifiers
using supervised machine learning techniques. Supervised
machine learning uses labeled data samples; labels are An ensemble consists of a set of individually trained
classifiers (such as neural networks or decision trees)

© RECENT SCIENCE PUBLICATIONS ARCHIVES | April 2013|$25.00 | 27702358 |


*This article is authorized for use only by Recent Science Journal Authors, Subscribers and Partnering Institutions*
International Journal of Advanced Computing, ISSN:2051-0845, Vol.36, Issue.1 1145

whose predictions are combined for classifying new In this way, multiple trees are induced in the forest; the
instances. Previous research has shown that an ensemble number of trees is pre-decided by the parameter Ntree.
is often more accurate than any of the single classifiers in The number of variables (m) selected at each node is also
the ensemble [20], [22], [29]. Bagging [10] and Boosting referred to as mtry or k in the literature. The depth of the
[32] are two popular methods for producing ensembles. tree can be controlled by a parameter nodesize (i.e.
These methods use re-sampling techniques to obtain number of instances in the leaf node) which is usually set
different training sets for each of the classifiers. Bagging to one.
stands for bootstrap aggregating which works on the
Once the forest is trained or built as explained above, to
concept of bootstrap samples. If original training dataset
classify a new instance, it is run across all the trees grown
is of size N and m individual classifiers are to be
in the forest. Each tree gives classification for the new
generated as part of ensemble then m different training
instance which is recorded as a vote. The votes from all
sets- each of size N, are generated from original dataset by
trees are combined and the class for which maximum
sampling with replacement. The multiple classifiers
votes are counted (majority voting) is declared as
generated in bagging are independent to each other. In
classification of the new instance.
case of boosting, weights are assigned to each sample
from the training dataset. If m classifiers are to be This process is referred to as Forest RI in the literature
generated, they are generated sequentially such that one [11]. Here onwards, Random Forest means the forest of
classifier is generated in a single iteration. For generating decision trees generated using Forest RI process.
classifier Ci, weights of training samples are updated
based on classification results of classifier Ci-1. The In the forest building process, when bootstrap sample set
classifiers generated by boosting are dependent on each is drawn by sampling with replacement for each tree,
other. about 1/3rd of original instances are left out. This set of
instances is called OOB (Out-of-bag) data. Each tree has
The theoretical and empirical research related to ensemble its own OOB data set which is used for error estimation of
has shown that an ideal ensemble consists of highly individual tree in the forest, called as OOB error
correct classifiers that disagree as much as possible [18], estimation. Random Forest algorithm also has in-built
[22], [26], [35]. Opitz and Shavlik [28] empirically facility to compute variable importance and proximities
verified that such ensembles generalize well. Breiman [11]. The proximities are used in replacing missing values
[10] showed that bagging is effective on unstable learning and outliers.
algorithms. In [23] Kuncheva presents four approaches for
building ensembles of diverse classifiers: Illustrating Accuracy of Random Forest:

1. Combination level: Design different combiners. The Generalization error (PE*) of Random Forest is given
as,
2. Classifier level: Use different base classifiers.
PE * = P x,y (mg(X,Y)) < 0
3. Feature level: Use different feature subsets.
Where mg(X,Y)is Margin function. The Margin function
4. Data level: Use different data subsets. measures the extent to which the average number of votes
2.2 Random Forest at (X, Y) for the right class exceeds the average vote for
any other class. Here X is the predictor vector and Y is the
Definition: Random Forest is a classifier consisting of a classification.
collection of tree-structured classifiers {h(x, Θk) k=1,
2, ….}, where the {Θk } are independent identically The Margin function is given as,
distributed random vectors and each tree casts a unit vote mg (X,Y) = avk I(hk (X) = Y) – max j≠Y avk I(hk (X) = j)
for the most popular class at input x [11].
Here I(.) is Indicator function.
Random Forest generates an ensemble of decision trees.
To achieve diversity among base decision trees, Breiman Margin is directly proportional to confidence in the
selected the randomization approach which works well classification.
with bagging or random subspace methods [10], [11],
[29]. To generate each single tree in Random Forest Strength of Random Forest is given in terms of the
Breiman followed following steps: If the number of expected value of Margin function as,
records in the training set is N, then N records are S = E X, Y (mg (X, Y))
sampled at random but with replacement, from the
original data, this is bootstrap sample. This sample will be The generalization error of ensemble classifier is bounded
the training set for growing the tree. If there are M input above by a function of mean correlation between base
variables, a number m << M is selected such that at each classifiers and their average strength (s) [33]. If ρ is mean
node, m variables are selected at random out of M and the value of correlation, an upper bound for generalization
best split on these m attributes is used to split the node. error is given by,
The value of m is held constant during forest growing.
PE* ≤ ρ (1 – s2) / s2
Each tree is grown to the largest extent possible. There is
no pruning.

© RECENT SCIENCE PUBLICATIONS ARCHIVES | April 2013|$25.00 | 27702358 |


*This article is authorized for use only by Recent Science Journal Authors, Subscribers and Partnering Institutions*
International Journal of Advanced Computing, ISSN:2051-0845, Vol.36, Issue.1 1146

3. CURRENT ONGOING WORK ON demonstrates that performance of Random Forest is


RANDOM FOREST improved in some domains by replacing majority voting
with Dynamic Integration, which is based on local
Research work in the area of Random Forest aims at prediction performances of base decision trees. Tsymbal,
either improving accuracy, or improving performance Pechenizkiy, and Cunningham [38] suggested three
(reducing time required for learning and classification), or different techniques based on performance of local
both. Some work aims at experimentation with Random predictors: Dynamic Selection (DS), Dynamic Voting
Forest using online continuous stream data, which is (DV), and Dynamic Voting with Selection (DVS).
essential today due to data streams getting generated by
various applications. Random Forest being an ensemble Simon Bernard, Laurent Heutte, and Sebastien Adam [4]
technique, experiments are done with its base classifier, proposed a new Random Forest algorithm called Forest
e.g. Fuzzy Decision Tree as base classifier of Random RK in which k, the number of features, is randomly
Forest. We have done systematic survey of current selected at each node during tree induction process. In this
ongoing research on Random Forest and developed a paper it is stated that k is not a hyper-parameter, as it is
“Taxonomy of Random Forest Classifier”. In this section, not playing a crucial role in generating accurate Random
we first elaborate in detail the work done and then present Forest classifier. They used McNemar statistical test of
the Taxonomy. significance to compare predictions generated by original
Random Forest and Forest RK. They claimed that the two
3.1 Improvements in Random Forest algorithms are statistically equivalent.
Algorithm Based on Accuracy 3.2 Improvements in Random Forest
To have a good ensemble, base classifiers are to be Algorithm based on Performance
diverse (i.e. they predict differently), and accurate.
Random selection of attributes makes individual trees Theoretical and empirical results have proved that above a
weak. The improvements suggested are such that certain number of trees, adding more trees in the forest
individual base classifiers are strong as well as diverse. does not improve accuracy [5]. There are specific methods
suggested to find a sub-forest that can achieve prediction
Meta Random Forest [7] is based on the concept of using accuracy of a large random forest. Researchers have taken
random forest themselves as base classifiers for making efforts in achieving smaller forests or shrink the forest.
ensembles, and the performance of this model is tested Most of these efforts are based on Overproduce-and-
and compared with the existing Random Forest algorithm. Chose strategy [36]. The approach taken to shrink the
Meta Random Forests are generated by both bagging and forest is as follows: First overproduce the forest to a fixed
boosting approaches i.e. ensemble using Random Forest number decided a priori. Then calculate prediction
as base classifier with bagging approach, and ensemble accuracy of the forest. For every tree T in the forest,
using Random Forest as base classifier with boosting calculate the prediction accuracy of the forest that
approach. Comparative study of both these techniques and excludes T. Then find the difference (ΔT) between the
original Random Forest technique has shown that Bagged prediction accuracies of the original forest and the forest
Random Forest gives the best results among the three without T. The tree with minimum ΔT is the least
techniques. important one and it is removed from the forest [43]. The
In original Random Forest, Gini Index is used in decision other approach is based on similarity between two trees. It
tree for attribute split. Gini Index is not able to detect works on the basis that a tree can be removed if it is
strong conditional dependencies among attributes [34]. similar to other trees in the forest. Another approach to
ReliefF measure for attribute split gives better results in limit the number of trees in the random forest works a
this case. Robnik and Sikonja [34] experimented with priori and it is based on applying McNemar non-
Random Forest using five different attribute measures; parametric test of significance between predictions of two
each fifth of the trees in the forest is generated using subsets of the original forest [24].
different split measure (Gini index, Gain ratio, MDL, The work which is add-on to the existing “Overproduce
ReliefF). This helped in decreasing correlation between and Choose” paradigm is suggested in [41]. Here a new
the trees while retaining their strengths. The performance algorithm called BAGA is proposed which generates
increase observed was not much significant. ensemble using combination of bagging and genetic
algorithm techniques so that individual classifiers are
As suggested by Breiman, Random Forest uses majority
determined at execution time. As bagging is to be treated
voting as voting mechanism for classification.
as special case of Random forest, the BAGA approach
Experiments are carried out related to the voting
suggested is also applicable to Random Forest, and hence
mechanism. For improving voting scheme, internal
estimates are used. The process is as follows: for included here as a development related to Random Forest.
classifying a new instance, instances similar to this new Researchers have proposed a new concept called dynamic
instance are found. Then individual trees are given ensemble. The dynamic induction of Random Forest
weights based on the strength they demonstrate on these eliminates the Overproduce phase. In their work, Tripoliti,
selected instances. This is a kind of weighted voting. Fotiadis, and Manis [44] determine the number of
Research work related to Dynamic Integration decision trees in random forest dynamically during the

© RECENT SCIENCE PUBLICATIONS ARCHIVES | April 2013|$25.00 | 27702358 |


*This article is authorized for use only by Recent Science Journal Authors, Subscribers and Partnering Institutions*
International Journal of Advanced Computing, ISSN:2051-0845, Vol.36, Issue.1 1147

growing process of forest. The method is based on on-line Temporal Weighting scheme to discard non performing
curve fitting. The forest is first built with 10 trees. At each trees based on their out-of-bag error performance. The
next step, a new tree is added and tested if it is a best fit. algorithm is ported on NVIDIA GPU, which has shown
For selection of best fit, eight polynomials are used. The ten times speed up.
termination of iterative process is based on predefined
Incremental Extremely Random forest algorithm is
threshold for the fitted value and the accuracy curve.
specially designed for small data streams [40]. The
These threshold values are determined heuristically.
algorithm works on the basis of expanding the leaf nodes
In Dynamic Random Forests [6], individual base trees are without reconstructing the whole trees. This approach
added in the dependent manner rather than the avoids use of Hoeffding bounds which need large number
independent approach taken by Breiman. A new tree is of samples.
added in the forest by taking into account the evaluation
of the sub-forest already built and thus taking an adaptive 3.4 Data Specific Random Forest Algorithm
approach. With this approach, an initial tree is generated In many real world applications, the data to be dealt with
in traditional way as it is done in original Random Forest. is imbalanced. A classifier built using all data has a
For generating every next tree, the weights of training tendency to ignore minority class. There are two common
instances are modified (as it is done in boosting), so that approaches to deal with imbalanced data. The first is
weights are increased for the instances those are wrongly based on cost sensitive learning and the second is based
classified by the initial tree and decreased for correctly on use of a sampling technique: either down sampling the
classified instances. This approach generates dependent majority class, or over sampling the minority class.
trees and nullifies the original inherent parallel nature of Breiman has mentioned that Random Forest has methods
Random Forest. Here base trees are Random Trees rather for balancing error in class population unbalanced data
than the Decision trees used by Breiman. [11]. Vladimir, McLachlan, and Shu Kay Ng proposed a
Many tasks in the data mining domain concern high- large number of relatively small and balanced subsets
dimensional data. Consequently, these tasks are often where representatives from the larger pattern are to be
complex and computationally expensive. A GPU-based selected randomly [27]. Another approach is ensemble
implementation of Random Forest algorithm is developed, learning based on repeated random sub-sampling [12].
which is based on Compute Unified Device Architecture This technique divides training data into multiple sub-
(CUDA). The algorithm is experimentally evaluated on samples while ensuring that each sub-sample is fully
NVIDIA GT 220 graphics card with 48 CUDA cores and balanced. The results have shown that Random Forest
1 GB of memory. Both training phase and classification ensemble outperformed SVM, bagging and boosting in
phase are parallelized in CUDA implementation. terms of the area under receiver operating characteristics
Performance is compared with two state-of-the-art (ROC) curve (AUC) for Imbalanced data. It is suggested
implementations of Random Forest; sequential (LibRF) that Random Forest can be used as a base learner of
and parallel (FastRF) in Weka [19]. CudaRF outperforms ensemble for achieving better results with Imbalanced
both LibRF and FastRF for the specified classification data [27].
task [17]. One of the features of Random Forest is that it can handle
3.3 Improvements in Random Forest thousands of input variables without variable deletion.
The study of Gene data uses tens of thousands of gene
Algorithm for Online Data expressions to predict an outcome using several tens or
Standard Random Forest algorithm works on off-line data. hundreds of subject. This is commonly referred to as
Many recent applications deal with data streams. Streams “Large p (number of predictors) and Small n (number of
are conceptually end-less sequence of data records, real- samples)” problem. The “Large p Small n” paradigm
time, and often arriving at high flow rates [1], [14]. The arises in Microarray studies where expression levels of
challenge with streaming data is that there cannot be thousands of genes are monitored for a small number of
multiple passes through the data for analysis. Streaming subjects [20]. Random Forest works well for this
Random Forest is a classification algorithm that combines paradigm.
techniques used to build streaming decision trees with The original Random Forest algorithm or its modified
attribute selection techniques of Random Forest. The version (to suit the application) is used to solve
streaming version of Random Forest achieves classification problems in various areas. Some areas
classification accuracy comparable to the standard version where Random Forest classifier is used are Handwritten
on artificial and real data sets using only single pass digit recognition [2], Detection of hidden web search
through the data [1]. The limitation is that the algorithm interfaces [42], Land cover classification [31], Prediction
handles only numerical or nominal attributes for which of fault-prone modules in software development process
minimum and maximum values of each attribute are for effective detection and identification of defects [16],
known. It also handles multi-class classification problem. Multi-label classification [21], Analysis of Hyper-spectral
Online Random forest algorithm [37] generates on-line data [13], etc. The survey of various application areas
decision trees based on concepts from on-line bagging using Random Forest is given in summarized form in [39].
[30] and extremely randomized trees [15]. It also uses

© RECENT SCIENCE PUBLICATIONS ARCHIVES | April 2013|$25.00 | 27702358 |


*This article is authorized for use only by Recent Science Journal Authors, Subscribers and Partnering Institutions*
International Journal of Advanced Computing, ISSN:2051-0845, Vol.36, Issue.1 1148

3.5 Naïve Implementations of Random Forest mainly concerns on reducing number of base decision
Algorithm trees in Random Forest so that learning and in turn,
classification is faster. The survey shows that efforts are
Research work has been carried out for generating multi- taken in suggesting different ways for finding subsets of
class classifier using fuzzy decision trees i.e. Fuzzy Random forest, but no concrete work is done to find
Random Forest. Fuzzy Random Forests try to use the optimal subset of Random forest. Additionally, all efforts
robustness of a tree ensemble, the power of the taken to find subsets of Random forest which will work
randomness to increase diversity of the trees in the forest, with same accuracy as the original Random Forest are
and the flexibility of fuzzy logic and fuzzy sets for data taking static approach. i.e. The entire forest is grown first
managing [8], [9]. and then step-by-step base decision trees are verified for
Random Forests suffer from the same disadvantage as being part of the subset. Major work of this kind uses
“Overproduce and Choose” approach which is not cost
other popular discriminative learning methods: they need
effective. Though efforts are taken to generate dynamic
a huge amount of labeled data to achieve good
performance. C Leistner, A Saffari, J Santner, M Godec, ensemble, many of them are not eliminating the
H Bischof [25] address this particular weakness by overproduce phase, i.e. generation of N classifiers at the
proposing a semi-supervised learning (SSL) approach for start. Eliminating overproduce phase will truly generate
dynamic ensemble. The Dynamic Random forest
Random Forest allowing the algorithm to make use of
eliminates the overproduce phase and generates only the
both labeled and unlabeled training data. A problem with
trees which are contributing to the better accuracy; but
SSL methods is that they only focus on binary
due to dependent way of tree generation it eliminates the
classification problems. Multi-class problems are often
inherent parallel nature of Random Forest induction. By
decomposed into a set of binary tasks with 1-Vs-all or 1-
reviewing all work done related to performance
Vs-1 strategies. Considering the fact that most state-of-
the-art SSL methods have high computational complexity, improvement of Random forest there is still an important
issue which is unresolved is: what is the optimal number
such a strategy can become a problem when dealing with
of base classifiers in Random Forest and how to select the
a large number of samples and classes. Therefore, the
ability of Random Forest algorithm to handle multi-class optimal subset without growing the entire forest.
tasks makes it very attractive for SSL problem. There are existing parallel implementations of Random
Forest: PARF is parallel implementation of Random
3.6 Taxonomy of Random Forest Forest using Fortran 90. FastRF is parallel implementation
Based on the above survey, we have developed Taxonomy of Random Forest in Weka which uses multithreading.
of Random Forest Classifier which is presented in figure 1. There exists GPU based parallel implementation of
Random Forest using CUDA platform, based on multi-
4. DISCUSSION core architecture. R contains parallel cluster based
Ensemble methods aim at improving classification Random Forest. Each implementation is specific to some
accuracy by aggregating predictions from multiple language or platform.
classifiers. More diverse the base classifiers and less are There is lot of scope for experimentation with Random
they correlated; the more is accuracy of the ensemble. Forest using streaming data. Many recent applications like
Random Forest algorithm uses 1) Sub-sampling the Internet traffic monitoring, Telecommunications billings,
examples/cases as in bagging, 2) Sub-sampling the etc. produce huge amount of data and it is practically
features known as feature selection. Both these strategies impossible to store this real-time stream. Also it is not
are used in Random Forest to introduce randomization possible to have multiple passes through this data. Many
and achieve diversity. Also, there is no pruning in the base algorithms for stream data has problem with handling
decision trees to ensure diversity among them. multi-class classification, which is not an issue in Random
Using the strong law of large numbers, Breiman has forest due to its inherent multi-class capability. A good
demonstrated that Random Forest always converges so amount of base research work is done related to
that over-fitting is not a problem [11]. Survey of various classification of stream data using Random Forest, which
papers shows that there is scope for work using important can be used as fundamental work for further enhancement
features of Random Forest, i.e. proximity based in this field.
computation, and variable importance [39]. Research is also going on for classifying Imbalanced data
In case of accuracy improvement, research is done using using Random forest. Results have shown that Random
different attribute split measures and combine functions. Forest outperforms other classification techniques for
The survey has shown that experiments with attribute split Imbalanced data and hence there is great scope for
measures has not shown significant improvement and developing improved Random Forest algorithm for
further work in this direction need to be carried out. The Imbalanced data.
weighted voting with Random Forest has shown Use of Fuzzy decision trees and Semi supervised learning
significant improvements in accuracy. As compared to with Random Forest is recent development. There is
improvement in accuracy, there is less work done for future scope for semi supervised learning with Random
improvement in performance. Performance improvement Forest due to capability of handling both labeled and

© RECENT SCIENCE PUBLICATIONS ARCHIVES | April 2013|$25.00 | 27702358 |


*This article is authorized for use only by Recent Science Journal Authors, Subscribers and Partnering Institutions*
International Journal of Advanced Computing, ISSN:2051-0845, Vol.36, Issue.1 1149

unlabelled data; especially for scenarios where getting number of trees is 100 in many cases. Weka and R are the
labeled data is a problem. commonly used tools for research using Random Forest.
The applications implemented using Random Forest
Most of the work done related to Random Forest follows
algorithms are compared with bagging and boosting.
parameter settings as mentioned by Breiman. The forest
Almost all results have shown that Random Forest does
size is decided a priori and the default value used for
either better or at least equivalent with these two

Fig.1 Taxonomy of Random Forest Classifier


techniques. Commonly used datasets for research work 2. Split Measure: If base classifier of Random Forest is
related to Random Forest Classifier are from UCI decision tree, then which split measure is used at each
Machine Learning Repository. A few datasets are from node of the tree to perform the splitting.
Semi-supervised Benchmarks and LibSVM repository. 3. Number of Passes: For building Random Forest
Two synthetic datasets (Twonorm and Ringnorm) are classifier, if single pass is sufficient or multiple
used which are designed by Breiman. passes through data are needed.
4. Combine Strategy: In Random Forest ensemble, all
Classification accuracy which is defined as percentage of
the base classifiers generated are used for
correctly classified samples to the total number of samples
classification. At the time of classification, how the
is an important measure for evaluation of classifier. AUC
results of individual base classifiers are combined is
(Area Under Receiver Operating Curve ROC) is used as
decided by the combine strategy.
measure of performance with Random Forest. F-measure
5. Number of attributes used for base classifier
is also used to evaluate performance of a classification
generation (Mtry): This parameter gives the number of
[41]. Random Forest being ensemble classifier, different
how many attributes are to be used (which are
techniques are used to compare performances of
randomly selected from the original set of attributes)
individual base classifiers. Statistical tests, especially
at each node of the base decision tree.
Wilcoxon Signed –rank test [34] and McNemar non-
6. Stopping Criterion: In Random Forest, multiple base
parametric test [3], [24] are commonly used with Random
classifiers are generated. The number of base
Forest.
classifiers is usually pre-decided or based on some
We have systematically analyzed all the research efforts estimate (usually accuracy). This is described by the
taken related to Random Forest and come up with the stopping criterion.
Comparison chart as given in figure 2. The parameters 7. Pruning of Forest: This parameter gives if there are
used for comparison are as follows: steps / measures taken to perform pruning of Random
Forest / to reduce the number of base classifiers from
1. Base Classifier: It describes the base classifier used in the Random forest.
the Random Forest ensemble.

© RECENT SCIENCE PUBLICATIONS ARCHIVES | April 2013|$25.00 | 27702358 |


*This article is authorized for use only by Recent Science Journal Authors, Subscribers and Partnering Institutions*
International Journal of Advanced Computing, ISSN:2051-0845, Vol.36, Issue.1 1150

8. Parallel Extension: It describes whether there is 9. Datasets used: This is the number showing how many
parallel extension exists for the associated approach datasets are used.
related to Random Forest classifier.

Parameters Base Split No Combine Mtry Stopping Pruning Parallel Datasets Key Features
 Classifier Measure of Strategy Criterion of Extension Tested
Passes * Forest
Approach
Meta Random Random Gini MP Majority √M Fixed No No 10 Random Forest as base
Forest Forest Index Voting apriori classifier, Bagging and
Boosting techniques for
ensemble
Improved Decision Gini, Info MP Weighted √M Fixed No No 17 Multiple split measures,
Random Forest Tree Gain, Voting apriori Weighted voting
MDL
Accuracy Improvement

ReliefF
Forest RK Decision Gini MP Majority KЄ Fixed No No 10 Features randomly
Tree Index Voting [1,M] apriori selected at each node
during tree induction,
McNemar test for
comparison
Dynamic Decision Info Gain MP Dynamic Log2M+ Fixed Yes No 27 Use of distance measures
Integration Tree Integrated 1 apriori HEOM and Intrinsic
Random Forest Voting similarity
BAGA Decision Info Gain MP Majority / M Fixed Yes No 2 Overproduce, and Genetic
Tree probabilistic apriori algorithm strategy
voting
Selection Decision Gini MP Majority √M Fixed No No 10 Overproduce & Choose,
of DTs in RF Tree Index Voting apriori SFS, SBS approach,
McNemar nonparametric
Performance
Improvement

test for classifier


comparison
Dynamic Random Info Gain SP Majority KЄ Fixed No No 20 Weighted training
Random Tree Voting [1,M] apriori samples as in boosting,
Forests sequential process
Fuzzy Fuzzy Info Gain MP Majority √M Fixed No No 2 Use of Fuzzy partition to
Random Forest Decision Voting apriori generate Fuzzy decision
Naive Versions

Tree tree
Semi- Decision Info Gain MP Majority √M Fixed No Yes- 3 Use of labeled and
supervised Tree / Gini Voting GPU unlabelled data,
Random Forest index Maximum margin
approach using
Deterministic Annealing
Online Extremely Info SP Majority √M Fixed Yes Yes- 7 Online bagging, Temporal
Random Randomized Gain Voting apriori GPU weighting for forest
Forest Tree pruning
Streaming Decision Gini SP Majority Log2M+ Fixed No No 2 Use of Hoeffiding bound,
Random Forest Tree Index Voting 1 apriori Limited pruning of base
Online Versions

tree

Incremental Extremely Gini SP Majority √M Fixed No No 7 Small number of labeled


Extremely Randomized Index Voting apriori examples, Used for video
Random F Tree tracking

Random Forest Decision Gini MP Majority √M Fixed No No 5 Weights on minority


Specific
Data

for Imbalanced Tree Index Voting apriori class, Down-sampling


Data /Weighted majority class
voting
Fig.2 Comparison Chart (* MP - Multiple Passes, SP - Single Pass)

© RECENT SCIENCE PUBLICATIONS ARCHIVES | April 2013|$25.00 | 27702358 |


*This article is authorized for use only by Recent Science Journal Authors, Subscribers and Partnering Institutions*
International Journal of Advanced Computing, ISSN:2051-0845, Vol.36, Issue.1 1151

10. Key Features: It describes the core ideas / concepts shown that Random Forest with suitable modification
used in the approach related to Random Forest gives better results over other classifiers for imbalanced
classifier. data sets. Hence, there is scope to propose a new modified
Random Forest algorithm for Imbalanced data. Using
THIS COMPARISON CHART WILL BE OF HELP TO
Random Forest as a base learner can achieve good results
THOSE WHO ARE ASPIRING TO TAKE UP
in the domain of Imbalanced data.
RESEARCH RELATED TO RANDOM FOREST
CLASSIFIER. As per Breiman, Random Forest can handle thousands of
input variables without variable deletion. In case of
5 FUTURE RESEARCH DIRECTIONS applications where nature of data is such that number of
5.1 Based on Accuracy Improvement samples available is less than number of predictors, i.e. n
<< p, Random Forest can work very well and there is
Accuracy improvements in Random Forest are possible scope for research in this direction.
using different attribute split measures, using different
combine functions, or using both. Achieving diversity in 5.4 Online versions of Random Forest
base classifiers is an ongoing quality improvement Online continuous and endless data stream processing is a
process which will improve accuracy. Hence, finding challenge for machine learning community. Improvement
ways to achieve diversity definitely has future scope for in accuracy and performance for Random Forest with
research. It is possible to use OOB estimates, proximity Stream data is a prominent field for research.
computation, and variable importance features more Experimenting using different attribute split measures,
prominently for improving accuracy of Random Forest combine functions, pruning of forest based on tree
classifiers. performance, proper handling of concept drifts, and
parallel algorithm for Random Forest using stream data
5.2 Based on Performance Improvement
are some of the directions for future research in this area.
Random forest algorithm generates many classification
trees and generation of each tree is independent of each 5.5 Naïve approach
other. Thus, Random Forest is by nature a suitable Random Forests using Semi Supervised Learning (SSL)
candidate for parallel processing. Additionally, data approach is an open field for research. With SSL
mining is usually performed on very large datasets, and approach, it is possible to construct a classifier using
Random Forest can work well on datasets with large combination of labeled and unlabeled data. This approach
number of predictors. As mentioned in Section 4, each is useful for both offline and online data problems.
parallel implementation of Random Forest is specific to Especially with stream data where decision tree
some platform or language. Thus, there is scope for construction is based on Hoeffiding bound statistics, the
generalized Parallel Algorithm for Random Forest. With number of data samples needed at each node for splitting
the geographical spread of business and the world getting is huge, and in this case SSL approach can be effective.
connected with the Internet; business data is distributed at
different locations. Hence, design of Distributed Random 6. CONCLUDING REMARKS
Forest algorithm is another important future research
The intension of this paper was to present a review of
direction.
current work related to Random Forest classifier and
Theoretical and empirical results have proved that beyond identify future research directions in the field of Random
a certain number, increasing the number of trees in the Forest classifier. Random Forest classifier is an ensemble
forest does not yield increase in accuracy. Previous technique and hence is more accurate, but it is time
research work in this direction takes static approach, i.e. consuming compared to other individual classification
first build a forest to its full extent and then shrink / prune techniques. We mainly tried to review the work done for
it by deleting some of the trees which are not contributing accuracy improvement and performance improvement of
towards increase in accuracy. This approach is not cost Random Forest. As a result of our survey, we have
effective from the viewpoint of time and memory. Also, it presented Taxonomy of Random Forest algorithm and
reduces only time taken by classification and not by performed analysis of various algorithms / techniques
learning. There is scope to generate dynamic techniques to based on Random Forest algorithm. This analysis which is
prune the forest size on the fly. Also, no research work presented as Comparison chart will serve as a guideline
has yet shown what will be the optimal subset of forest for pursuing future research related to Random forest
which will work with accuracy of the original forest. classifier.
5.3 Data Specific Improvements REFERENCES
Almost all classifiers have problem in classifying [1] Abdulsalam H, Skillicorn B, Martin P, Streaming
imbalanced data; they have a tendency to ignore minority Random Forests, Proceedings of 11th International
classes. As there are many real life problems that deal Database and Engineering Applications
with imbalanced data such as Fraud detection, Network Symposium, Banff, Alta pp 225-232, (2007) .
intrusion, Rare disease diagnosing, etc; classifiers for
imbalanced data are in demand. Earlier results have

© RECENT SCIENCE PUBLICATIONS ARCHIVES | April 2013|$25.00 | 27702358 |


*This article is authorized for use only by Recent Science Journal Authors, Subscribers and Partnering Institutions*
International Journal of Advanced Computing, ISSN:2051-0845, Vol.36, Issue.1 1152

[2] Bernard S, Heutte L, Adam S, Using Random [16] Guo L, Ma Y, Cukic B, Singh H, Robust Prediction
Forest for Handwritten Digit Recognition, of Fault-Proneness by Random Forests,
International Conference on Document Analysis Proceedings of the 15th International Symposium
and Recognition 1043-1047, (2007) on Software Reliability Engineering, IEEE, (2004)
[3] Bernard S, Heutte L, Adam S, Towards a Better [17] Grahn H, Lavesson N, Lapajne M, Slat D, A
Understanding of Random Forests Through the CUDA implementation of Random Forest – Early
Study of Strength and Correlation, ICIC Results, Master Thesis Software Engineering,
Proceedings of the Intelligent Computing 5th School of Computing, Blekinge Institute of
International Conference on Emerging Intelligent Technology, Sweden
Computing Technology and Applications, (2009)
[18] Hansen L, Salamon P, Neural Network Ensembles,
[4] Bernard S, Heutte L, Adam S, Forest-RK : A New IEEE Transaction on Pattern Analysis and Machine
Random Forest Induction Method, Proceedings of Intelligence, Vol 12 No 10, (1990)
4th International Conference on Intelligent
[19] I. H. Witten, E. Frank, Weka: Practical machine
Computing: Advanced Intelligent Computing
learning tools and techniques, Morgan Kaufmann
Theories and Applications – with Aspects of
publisher, (2005)
Artificial Intelligence, Springer-Verlag, (2008)
[20] Kosorok M, Ma S, Marginal Asymptotics for the
[5] Bernard S, Heutte L, Adam S, On the Selection of
Large p Small n paradigm: With Applications to
Decision Trees in Random Forest, Proceedings
Microarray Data, Ann Statist 35, 1456-1486, (2007)
of International Joint Cobference on Neural
Networks, Atlanta, Georgia, USA, June 14-19,302- [21] Kouzani A, Nasireding G, Multilabel Classification
307, (2009) by BCH Code and Random forest, International
Journal of Recent Trends in Engineering, Vol 2, No
[6] Bernard S, Heutte L, Adam S, Dynamic Random
1, (2009)
forests, Pattern Recognition Letters, 33 (2012),
1580-1586 [22] Krogh A, Vedelsby J, Neural Network Ensembles,
Cross Validation, and Active Learning, Advances
[7] Boinee P, Angelis A, Foresti G, Meta Random
in Neural Information Processing Systems Vol 7,
Forest, International Journal of Computational
MIT Press , 231-238, (1995)
Intelligence 2, (2006)
[23] Kuncheva L, Diversity in Multiple Classifier
[8] Bonissone P, Candenas J, Garrido M, Diaz R, A
Systems, Information Fusion, Vol 6, Issue 1, 3-4,
Fuzzy Random Forest: Fundamental for Design and
(2005)
Construction, Studies in Fuzziness and Soft
Computing, Vol 249, 23-42, (2010) [24] Latinne P, Debeir O, Decastecker C, Limiting the
number of trees in Random Forest, MCS, UK
[9] Bonissone P, Cadenas J, Garrido M, Diaz-
(2001)
Valladares R, A Fuzzy Random Forest,
International Journal of Approximate Reasoning, [25] Leistner C, Saffari A, Santner J, Godec M, Bischof
51, 729-747, (2010) H, Semi-Supervised Random Forests, ICCV IEEE,
Conference Proceedings, 506-513 (2009)
[10] Breiman L, Bagging Predictors , Technical report
No 421, (1994) [26] Maudes J, Rodridugz J, Garcia-Osorio C,
Disturbing Neighbors diversity for decision forests,
[11] Brieman L, Random Forests, Machine Learning, 45,
Studies in Computational Intelligence, Vol 245,
5-32, (2001)
113-133, (2009)
[12] Chain C, Liaw A, Breiman L, Using Random forest
[27] Nikulin V, McLachlan G, Ng S, Ensemble
to Learn Imbalanced Data, Technical Report,
Approach for Classification of Imbalanced Data,
Department of Statistics, U. C. Berkley (2004)
Proceedings of the 22nd Australian Joint
[13] Crawford M, Ham J, Chen Y, Ghosh J, Random Conference on Advances in Artificial Intelligence,
Forests of Binary Hierarchical Classifiers for Springer-Verlag (2009)
Analysis of Hyper-spectral Data, Advances in
[28] Opitz D, Shavlik J, Generating Accurate and
Techniques for Analysis of Remotely Sensed Data,
Diverse Members of a Neural-Network Ensemble,
337-345, IEEE, (2003)
Advances in Neural Information Processing
[14] Gaber M, Zaslavsky A, Krshnaswamy S, Mining Systems Vol 8, MIT Press , (1996)
Data Streams: A Review, SIGMOD Record, Vol 34
[29] Opitz D, Maclin R, Popular Ensemble Methods: An
No 2, (2005)
Empirical Study, Journal of Artificial Intelligence
[15] Geurts P, Ernst D, Wehenkel L, Extremely 11, 169-198, (1999)
Randomized Trees, Machine Learning, volume 63,
3-42, (2006)

© RECENT SCIENCE PUBLICATIONS ARCHIVES | April 2013|$25.00 | 27702358 |


*This article is authorized for use only by Recent Science Journal Authors, Subscribers and Partnering Institutions*
International Journal of Advanced Computing, ISSN:2051-0845, Vol.36, Issue.1 1153

[30] Oza, Russell S, Online Bagging and Boosting, [44] E Tripoli, D Fotiadis, G Manis, “ Dynamic
Proceedings of Artificial Intelligence and Statistics, Construction of Random Forests: Evaluation using
105-112, (2001) Biomedical Engineering Problems”, IEEE, 2010
[31] Pal M, Random Forests for Land Cover
Classification, Proceedings of Geoscience and
Remote Sensing Symposium, IEEE, 3510-3512,
(2003)
[32] Robert E Schapire, The Boosting Approach to
Machine Learning an Overview, Nonlinear
Estimation and Classification, Springer, 2003
[33] Prenger R, Lemmond T, Varshney K, Chen B,
Hanley W, Class-Specific Error Bounds for
Ensemble Classifiers, KDD’10, Washington DC,
USA, (2010)
[34] Robnik M, Sikonja, Improving Random Forests, J F
Boulicaut et al (eds): Machine Learning, ECML
2004 Proceedings, Springer, Berlin, (2004)
[35] Rodriguze J, Kuncheva L, Rotation Forest: A New
Classifier Ensemble Method, IEEE Transaction on
Pattern Analysis and Machine intelligence, Vol 28,
N0 10, 1619-1630, (2006)
[36] Roli F, Giacinto G, Vernazza G, Methods for
Designing Multiple Classifier Systems, Second
International Workshop on Multiple Classifier
Systems, Springer-Verlag, (2001)
[37] Saffari A, Leistner C, Santner J, Godec M, Bischof
H, On-line Random Forests, ICCV IEEE,
Conference Proceedings 1393-1400, (2009)
[38] Tsymbal A, Pechenizkiy M, Cunningham P,
Dynamic Integration with Random Forest, ECML,
LNAI, 801-808, Springer-Verlag (2006)
[39] Verikas A, Gelzinis A, Bacauskiene M, Mining
data with random forests: A survey and results of
new tests, Pattern Recognition 44 , 330 - 349,
(2011)
[40] Wang A, Wan G, Cheng Z, Li S, An Incremental
Extremely Random Forest Classifier for Online
Learning and Tracking, 16th IEEE International
Conference on Image Processing,1449-1452, (2009)
[41] Wu X, Chen Z, Toward Dynamic Ensemble: The
BAGA Approach, Proceedings of the ACS/ IEEE
International Conference on Computer Systems and
Applications,(2005)
[42] Ye Y, Li H, Deng X, Huang J, Feature Weighting
Random Forest for Detection of Hidden Web
Search Interfaces, Computational Linguistic and
Chinese Language Processing, Vol 13, No 4, 387-
404, (2008)
[43] Zhang H, Wang M, Search for the smallest Random
Forest, Statistics and Its Interface Volume.2, pp
381-388, (2009)

© RECENT SCIENCE PUBLICATIONS ARCHIVES | April 2013|$25.00 | 27702358 |


*This article is authorized for use only by Recent Science Journal Authors, Subscribers and Partnering Institutions*

You might also like