Random Forest (RF)
Contents
Background
Random forest is a classification machine learning algorithm that is based on combination of tree predictors. The individual decision trees are generated using a random selection of attributes at each node to determine the split. More formally, each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. During classification, each tree votes and the most popular class is returned.
Theory
In random forest, a bootstrap sample is first drawn from the original data (that is, a sample drawn with replacement, which is then used to build a decision tree with a random subset of attributes selected for each tree split. The process is repeated according to a prescribed number of rounds until the prediction (aggregated from each of the built decision trees) is within a target convergence tolerance.
Bootstrap^{[1]}
The bootstrap method samples the given training tuples uniformly with replacement. That is, each time a tuple is selected, it is equally likely to be selected again and readded to the training set. For instance, imagine a machine that randomly selects tuples for our training set. In sampling with replacement, the machine is allowed to select the same tuple more than once.
There are several bootstrap methods. A commonly used one is the .632 bootstrap, which works as follows. Suppose we are given a data set of d tuples. The data set is sampled d times, with replacement, resulting in a bootstrap sample or training set of d samples. It is very likely that some of the original data tuples will occur more than once in this sample. The data tuples that did not make it into the training set end up forming the test set. Suppose we were to try this out several times. As it turns out, on average, 63.2% of the original data tuples will end up in the bootstrap sample, and the remaining 36.8% will form the test set (hence, the name, .632 bootstrap).
“Where does the figure, 63.2%, come from?” Each tuple has a probability of 1/d of being selected, so the probability of not being chosen is (1 − 1/d). We have to select d times, so the probability that a tuple will not be chosen during this whole time is (1 − 1/d)^{d}. If d is large, the probability approaches e^{−1} = 0.368. Thus, 36.8% of tuples will not be selected for training and thereby end up in the test set, and the remaining 63.2% will form the training set.
We can repeat the sampling procedure k times, where in each iteration, we use the current test set to obtain an accuracy estimate of the model obtained from the current bootstrap sample. The overall accuracy of the model, M, is then estimated as
where Acc(M_{i})_{test_set} is the accuracy of the model obtained with bootstrap sample i when it is applied to test set i. Acc(M_{i})_{train_set} is the accuracy of the model obtained with bootstrap sample i when it is applied to the original set of data tuples. Bootstrapping tends to be overly optimistic. It works best with small data sets.
Bagging^{[1]}
Given a set, D, of d tuples, bagging works as follows. For iteration i (i = 1, 2,..., k), a training set, D_{i}, of d tuples is sampled with replacement from the original set of tuples, D. Note that the term bagging stands for bootstrap aggregation. Each training set is a bootstrap sample. Because sampling with replacement is used, some of the original tuples of D may not be included in D_{i}, whereas others may occur more than once. A classifier model, M_{i}, is learned for each training set, D_{i}. To classify an unknown tuple, X, each classifier, M_{i}, returns its class prediction, which counts as one vote. The bagged classifier, M∗, counts the votes and assigns the class with the most votes to X. Bagging can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple.
The bagged classifier often has significantly greater accuracy than a single classifier derived from D, the original training data. It will not be considerably worse and is more robust to the effects of noisy data and overfitting. The increased accuracy occurs because the composite model reduces the variance of the individual classifiers.
Boosting^{[1]}
In boosting,
 Weights are assigned to each training tuple
 A series of k classifiers is iteratively learned
 After a classifier, M_{i}, is learned, the weights are updated to allow the subsequent classifier, M_{i+1}, to “pay more attention” to the training tuples that were misclassified by M_{i}
 The final boosted classifier, M^{∗}, combines the votes of each individual classifier, where the weight of each classifier’s vote is a function of its accuracy
Boosting algorithm can be extended for numeric prediction.
Comparing with bagging, boosting tends to have greater accuracy, but it also risks overfitting the model to misclassified data.
AdaBoost^{[1]}
AdaBoost is short for Adaptive Boosting and it is a popular boosting algorithm.
Suppose we want to boost the accuracy of a learning method. We are given D, a data set of d classlabeled tuples, (X_{1},y_{1}),(X_{2},y_{2}),...,(X_{d},y_{d}), where y_{i} is the class label of tuple X_{i}. Initially, AdaBoost assigns each training tuple an equal weight of 1/d. Generating k classifiers for the ensemble requires k rounds through the rest of the algorithm. In round i, the tuples from D are sampled to form a training set, D_{i}, of size d. Sampling with replacement is used—the same tuple may be selected more than once. Each tuple’s chance of being selected is based on its weight. A classifier model, M_{i}, is derived from the training tuples of D_{i}. Its error is then calculated using D_{i} as a test set. The weights of the training tuples are then adjusted according to how they were classified.
If a tuple was incorrectly classified, its weight is increased. If a tuple was correctly classified, its weight is decreased. A tuple’s weight reflects how difficult it is to classify— the higher the weight, the more often it has been misclassified. These weights will be used to generate the training samples for the classifier of the next round. The basic idea is that when we build a classifier, we want it to focus more on the misclassified tuples of the previous round. Some classifiers may be better at classifying some “difficult” tuples than others. In this way, we build a series of classifiers that complement each other.
Now, let’s look at some of the math that’s involved in the algorithm. To compute the error rate of model M_{i}, we sum the weights of each of the tuples in D_{i} that M_{i} misclassified. That is,
where err(X_{j}) is the misclassification error of tuple X_{j}: If the tuple was misclassified, then err(X_{j}) is 1; otherwise, it is 0. If the performance of classifier Mi is so poor that its error exceeds 0.5, then we abandon it. Instead, we try again by generating a new D_{i} training set, from which we derive a new M_{i}.
The error rate of M_{i} affects how the weights of the training tuples are updated. If a tuple in round i was correctly classified, its weight is multiplied by error(M_{i})/(1 − error(M_{i})). Once the weights of all the correctly classified tuples are updated, the weights for all tuples (including the misclassified ones) are normalized so that their sum remains the same as it was before. To normalize a weight, we multiply it by the sum of the old weights, divided by the sum of the new weights. As a result, the weights of misclassified tuples are increased and the weights of correctly classified tuples are decreased, as described before.
Unlike bagging, where each classifier was assigned an equal vote, boosting assigns a weight to each classifier’s vote, based on how well the classifier performed. The lower a classifier’s error rate, the more accurate it is, and therefore, the higher its weight for voting should be. The weight of classifier M_{i}’s vote is
For each class, c, we sum the weights of each classifier that assigned class c to X. The class with the highest sum is the “winner” and is returned as the class prediction for tuple X.
Random Forest
In random forest, each of the classifiers in the ensemble is a decision tree classifier so that the collection of classifiers is a “forest.” The individual decision trees are generated using a random selection of attributes at each node to determine the split. More formally, each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. During classification, each tree votes and the most popular class is returned.
There are two methods to construct random forest:
ForestRI (random input selection)^{[1]}
Random forests can be built using bagging in combination with random attribute selection. A training set, D, of d tuples is given. The general procedure to generate k decision trees for the ensemble is as follows. For each iteration, i (i = 1, 2,..., k), a training set, D_{i}, of d tuples is sampled with replacement from D. That is, each D_{i} is a bootstrap sample of D so that some tuples may occur more than once in Di, while others may be excluded. Let F be the number of attributes to be used to determine the split at each node, where F is much smaller than the number of available attributes. To construct a decision tree classifier, M_{i}, randomly select, at each node, F attributes as candidates for the split at the node. The trees are grown to maximum size and are not pruned. Random forests formed this way, with random input selection, are called ForestRI.
ForestRC (random linear combinations)^{[1]}
Another form of random forest, called ForestRC, uses random linear combinations of the input attributes. Instead of randomly selecting a subset of the attributes, it creates new attributes (or features) that are a linear combination of the existing attributes. That is, an attribute is generated by specifying L, the number of original attributes to be combined. At a given node, L attributes are randomly selected and added together with coefficients that are uniform random numbers on [−1,1]. F linear combinations are generated, and a search is made over these for the best split. This form of random forest is useful when there are only a few attributes available, so as to reduce the correlation between individual classifiers.
Outofbag error and bootstrap accuracy
Outofbag error is the mean prediction error on each training sample xᵢ, using only the trees that did not have xᵢ in their bootstrap sample.
Suppose there are M input variables. After each tree is constructed, the values of the m^{th} variable in the outofbag examples are randomly permuted and the outofbag data is run down the corresponding tree. The classification given for each x_{n} that is out of bag is saved. This is repeated for m = 1, 2,..., M. At the end of the run, the plurality of outofbag class votes for x_{n} with the m^{th} variable noised up is compared with the true class label of x_{n} to give a misclassification rate.^{[2]}
Properties
RF has proven suitable for robust metaanalysis of highly complex and heterogeneous literature data.^{[3]}^{[4]} RF regression was employed for model development and attribute assessment given the following advantages: (1) capability of coping with the heterogeneity of compiled QD data that contains both categorical and numerical attributes; (2) robustness with respect to data noise; (3) ability to provide internal estimates of model predictive performance via outofbootstrap (OOB) validation^{[3]}, which, together with resubstitution validation, provides a synthetic model performance estimator.^{[5]}
 It is unexcelled in accuracy among current algorithms.
 It runs efficiently on large data bases.
 It can handle thousands of input variables without variable deletion.
 It gives estimates of what variables are important in the classification.
 It generates an internal unbiased estimate of the generalization error as the forest building progresses.
 It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
 It has methods for balancing error in class population unbalanced data sets.
 Generated forests can be saved for future use on other data.
 Prototypes are computed that give information about the relation between the variables and the classification.
 It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.
 The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
 It offers an experimental method for detecting variable interactions.
Application
Variable importance
Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way. The following technique was described in Breiman's original paper.^{[2]}
The first step in measuring the variable importance in a data set is to fit a random forest to the data. During the fitting process the outofbag error for each data point is recorded and averaged over the forest (errors on an independent test set can be substituted if bagging is not used during training).
To measure the importance of the jth feature after training, the values of the jth feature are permuted among the training data and the outofbag error is again computed on this perturbed data set. The importance score for the jth feature is computed by averaging the difference in outofbag error before and after the permutation over all trees. The score is normalized by the standard deviation of these differences.
Features which produce large values for this score are ranked as more important than features which produce small values.
This method of determining variable importance has some drawbacks. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Methods such as partial permutations^{[6]} and growing unbiased trees^{[7]} can be used to solve the problem. If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups.^{[8]}
Extensions
Unsupervised learning^{[9]}
In unsupervised learning the data consist of a set of x vectors of the same dimension with no class labels or response variables. There is no figure of merit to optimize, leaving the field open to ambiguous conclusions. The usual goal is to cluster the data  to see if it falls into different piles, each of which can be assigned some meaning.
The approach in random forests is to consider the original data as class 1 and to create a synthetic second class of the same size that will be labeled as class 2. The synthetic second class is created by sampling at random from the univariate distributions of the original data. Here is how a single member of class two is created  the first coordinate is sampled from the N values {x(1,n)}. The second coordinate is sampled independently from the N values {x(2,n)}, and so forth.
Thus, class two has the distribution of independent random variables, each one having the same univariate distribution as the corresponding variable in the original data. Class 2 thus destroys the dependency structure in the original data. But now, there are two classes and this artificial twoclass problem can be run through random forests. This allows all of the random forests options to be applied to the original unlabeled data set.
If the outofbag misclassification rate in the twoclass problem is, say, 40% or more, it implies that the x variables look too much like independent variables to random forests. The dependencies do not have a large role and not much discrimination is taking place. If the misclassification rate is lower, then the dependencies are playing an important role.
Formulating it as a two class problem has a number of payoffs. Missing values can be replaced effectively. Outliers can be found. Variable importance can be measured. Scaling can be performed (in this case, if the original data had labels, the unsupervised scaling often retains the structure of the original scaling). But the most important payoff is the possibility of clustering.
Implementation
R
In the R environment, there are many packages that provide tools for Random Forest model development and prediction. One of the most widely used packages is 'randomForest' by Andy Liaw and Matthew Wiener^{[10]}. This package contains many functions that allow model development, modelbased prediction, crossvalidation, variable importance calculation, outlier detection, or missing value solution.
Some of the most important functions of the package are:
randomForest
Implementation of Breiman’s random forest algorithm (based on Breiman and Cutler’s original Fortran code) for classification and regression. It can also be used in unsupervised mode for assessing proximities among data points.predict.randomForest
Prediction of test data using random forest.rfcv
This function shows the crossvalidated prediction performance of models with sequentially reduced number of predictors (ranked by variable importance) via a nested crossvalidation procedure.importance
This is the extractor function for variable importance measures as produced byrandomForest
.outlier
Computes outlying measures based on a proximity matrix.na.roughfix
Imputes Missing Values by median/mode.
Rapidminer
In Rapidminer, users can utilize the Random Forest algorithm with the Random Forest operator.^{[11]} The Random Forest operator generates a set of random trees. The number of trees parameter specifies the required number of trees. The resulting model is a voting model of all the random trees. The goal is to create a classification model that predicts the value of a target attribute (often called class or label) based on several input attributes of the ExampleSet. Each interior node of the tree corresponds to one of the input attributes. The number of edges of a nominal interior node is equal to the number of possible values of the corresponding input attribute. Outgoing edges of numerical attributes are labeled with disjoint ranges. Each leaf node represents a value of the label attribute given the values of the input attributes represented by the path from the root to the leaf. Pruning is a technique in which leaf nodes that do not add to the discriminative power of the tree are removed. This is done to convert an overspecific or overfitted tree to a more general form in order to enhance its predictive power on unseen datasets. Prepruning is a type of pruning performed parallel to the tree creation process. Postpruning, on the other hand, is done after the tree creation process is complete.
Input
 training set (Data Table) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input.
Output
 model (Random Forest Model) The Random Forest model is delivered from this output port. This model can be applied on unseen data sets for the prediction of the label attribute. This model is a voting model of all the random trees
 example set (Data Table) The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
Parameters
Parameter  Description  Range 

number_of_trees  This parameter specifies the number of random trees to generate.  integer 
criterion  Selects the criterion on which attributes will be selected for splitting. It can have one of the following values:

selection 
maximal_depth  The depth of a tree varies depending upon size and nature of the ExampleSet. This parameter is used to restrict the size of the Decision Tree. The tree generation process is not continued when the tree depth is equal to the maximal depth. If its value is set to '1', the maximal depth parameter puts no bound on the depth of the tree, a tree of maximum depth is generated. If its value is set to '1', a Tree with a single node is generated.  integer 
apply_prepruning  By default the Decision Tree is generated with prepruning. Setting this parameter to false disables the prepruning and delivers a tree without any prepruning.  boolean 
minimal_gain  The gain of a node is calculated before splitting it. The node is split if its Gain is greater than the minimal gain. Higher value of minimal gain results in fewer splits and thus a smaller tree. A too high value will completely prevent splitting and a tree with a single node is generated.  real 
minimal_leaf_size  The size of a leaf node is the number of examples in its subset. The tree is generated in such a way that every leaf node subset has at least the minimal leaf size number of instances.  integer 
minimal_size_for_split  The size of a node is the number of examples in its subset. The size of the root node is equal to the total number of examples in the ExampleSet. Only those nodes are split whose size is greater than or equal to the minimal size for split parameter.  integer 
number_of_prepruning_alternatives  As prepruning runs parallel to the tree generation process, it may prevent splitting at certain nodes when splitting at that node does not add to the discriminative power of the entire tree. In such a case alternative nodes are tried for splitting. This parameter adjusts the number of alternative nodes tried for splitting when split is prevented by prepruning at a certain node.  integer 
apply_pruning  By default the Decision Tree is generated with pruning. Setting this parameter to false disables the pruning and delivers an unpruned Tree.  boolean 
confidence  This parameter specifies the confidence level used for the pessimistic error calculation of pruning.  real 
guess_subset_ratio  If this parameter is set to true then log(m) + 1 attributes are used, otherwise a ratio should be specified by the subset ratio parameter.  boolean 
voting_strategy  Specifies the prediction strategy in case of dissenting tree model predictions:

selection 
subset_ratio  This parameter specifies the ratio of randomly chosen attributes to test.  real 
use_local_random_seed  This parameter indicates if a local random seed should be used for randomization. Using the same value of local random seed will produce the same randomization.  boolean 
local_random_seed  This parameter specifies the local random seed. This parameter is only available if the use local random seed parameter is set to true.  integer 
References
 ↑ ^{1.0} ^{1.1} ^{1.2} ^{1.3} ^{1.4} ^{1.5} Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and techniques. Elsevier.
 ↑ ^{2.0} ^{2.1} Breiman, L. (2001). Machine learning, 45(1), 532.
 ↑ ^{3.0} ^{3.1} Gernand, J. M., & Casman, E. A. (2014). Risk Analysis, 34(3), 583597.
 ↑ Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., & Feuston, B. P. (2003). Journal of chemical information and computer sciences, 43(6), 19471958.
 ↑ BragaNeto, U. M., & Dougherty, E. R. (2004). Bioinformatics, 20(3), 374380.
 ↑ Altmann, A., Toloşi, L., Sander, O., & Lengauer, T. (2010). Bioinformatics, 26(10), 13401347.
 ↑ Strobl, C., Boulesteix, A. L., & Augustin, T. (2007). Computational Statistics & Data Analysis, 52(1), 483501.
 ↑ Toloşi, L., & Lengauer, T. (2011). Bioinformatics, 27(14), 19861994.
 ↑ https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
 ↑ https://cran.rproject.org/web/packages/randomForest/index.html
 ↑ http://docs.rapidminer.com/studio/operators/modeling/predictive/trees/parallel_random_forest.html