# Random Forest (RF)

## Background

Random forest is a classification machine learning algorithm that is based on combination of tree predictors. The individual decision trees are generated using a random selection of attributes at each node to determine the split. More formally, each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. During classification, each tree votes and the most popular class is returned.

## Theory

In random forest, a bootstrap sample is first drawn from the original data (that is, a sample drawn with replacement, which is then used to build a decision tree with a random subset of attributes selected for each tree split. The process is repeated according to a prescribed number of rounds until the prediction (aggregated from each of the built decision trees) is within a target convergence tolerance.

### Bootstrap

The bootstrap method samples the given training tuples uniformly with replacement. That is, each time a tuple is selected, it is equally likely to be selected again and re-added to the training set. For instance, imagine a machine that randomly selects tuples for our training set. In sampling with replacement, the machine is allowed to select the same tuple more than once.

There are several bootstrap methods. A commonly used one is the .632 bootstrap, which works as follows. Suppose we are given a data set of d tuples. The data set is sampled d times, with replacement, resulting in a bootstrap sample or training set of d samples. It is very likely that some of the original data tuples will occur more than once in this sample. The data tuples that did not make it into the training set end up forming the test set. Suppose we were to try this out several times. As it turns out, on average, 63.2% of the original data tuples will end up in the bootstrap sample, and the remaining 36.8% will form the test set (hence, the name, .632 bootstrap).

“Where does the figure, 63.2%, come from?” Each tuple has a probability of 1/d of being selected, so the probability of not being chosen is (1 − 1/d). We have to select d times, so the probability that a tuple will not be chosen during this whole time is (1 − 1/d)d. If d is large, the probability approaches e−1 = 0.368. Thus, 36.8% of tuples will not be selected for training and thereby end up in the test set, and the remaining 63.2% will form the training set.

We can repeat the sampling procedure k times, where in each iteration, we use the current test set to obtain an accuracy estimate of the model obtained from the current bootstrap sample. The overall accuracy of the model, M, is then estimated as

where Acc(Mi)test_set is the accuracy of the model obtained with bootstrap sample i when it is applied to test set i. Acc(Mi)train_set is the accuracy of the model obtained with bootstrap sample i when it is applied to the original set of data tuples. Bootstrapping tends to be overly optimistic. It works best with small data sets.

### Bagging

Given a set, D, of d tuples, bagging works as follows. For iteration i (i = 1, 2,..., k), a training set, Di, of d tuples is sampled with replacement from the original set of tuples, D. Note that the term bagging stands for bootstrap aggregation. Each training set is a bootstrap sample. Because sampling with replacement is used, some of the original tuples of D may not be included in Di, whereas others may occur more than once. A classifier model, Mi, is learned for each training set, Di. To classify an unknown tuple, X, each classifier, Mi, returns its class prediction, which counts as one vote. The bagged classifier, M∗, counts the votes and assigns the class with the most votes to X. Bagging can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple.

The bagged classifier often has significantly greater accuracy than a single classifier derived from D, the original training data. It will not be considerably worse and is more robust to the effects of noisy data and overfitting. The increased accuracy occurs because the composite model reduces the variance of the individual classifiers.

### Boosting

In boosting,

• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier, Mi, is learned, the weights are updated to allow the subsequent classifier, Mi+1, to “pay more attention” to the training tuples that were misclassified by Mi
• The final boosted classifier, M, combines the votes of each individual classifier, where the weight of each classifier’s vote is a function of its accuracy

Boosting algorithm can be extended for numeric prediction.

Comparing with bagging, boosting tends to have greater accuracy, but it also risks overfitting the model to misclassified data.

AdaBoost is short for Adaptive Boosting and it is a popular boosting algorithm.

Suppose we want to boost the accuracy of a learning method. We are given D, a data set of d class-labeled tuples, (X1,y1),(X2,y2),...,(Xd,yd), where yi is the class label of tuple Xi. Initially, AdaBoost assigns each training tuple an equal weight of 1/d. Generating k classifiers for the ensemble requires k rounds through the rest of the algorithm. In round i, the tuples from D are sampled to form a training set, Di, of size d. Sampling with replacement is used—the same tuple may be selected more than once. Each tuple’s chance of being selected is based on its weight. A classifier model, Mi, is derived from the training tuples of Di. Its error is then calculated using Di as a test set. The weights of the training tuples are then adjusted according to how they were classified.

If a tuple was incorrectly classified, its weight is increased. If a tuple was correctly classified, its weight is decreased. A tuple’s weight reflects how difficult it is to classify— the higher the weight, the more often it has been misclassified. These weights will be used to generate the training samples for the classifier of the next round. The basic idea is that when we build a classifier, we want it to focus more on the misclassified tuples of the previous round. Some classifiers may be better at classifying some “difficult” tuples than others. In this way, we build a series of classifiers that complement each other.

Now, let’s look at some of the math that’s involved in the algorithm. To compute the error rate of model Mi, we sum the weights of each of the tuples in Di that Mi misclassified. That is,

where err(Xj) is the misclassification error of tuple Xj: If the tuple was misclassified, then err(Xj) is 1; otherwise, it is 0. If the performance of classifier Mi is so poor that its error exceeds 0.5, then we abandon it. Instead, we try again by generating a new Di training set, from which we derive a new Mi.

The error rate of Mi affects how the weights of the training tuples are updated. If a tuple in round i was correctly classified, its weight is multiplied by error(Mi)/(1 − error(Mi)). Once the weights of all the correctly classified tuples are updated, the weights for all tuples (including the misclassified ones) are normalized so that their sum remains the same as it was before. To normalize a weight, we multiply it by the sum of the old weights, divided by the sum of the new weights. As a result, the weights of misclassified tuples are increased and the weights of correctly classified tuples are decreased, as described before.

Unlike bagging, where each classifier was assigned an equal vote, boosting assigns a weight to each classifier’s vote, based on how well the classifier performed. The lower a classifier’s error rate, the more accurate it is, and therefore, the higher its weight for voting should be. The weight of classifier Mi’s vote is

For each class, c, we sum the weights of each classifier that assigned class c to X. The class with the highest sum is the “winner” and is returned as the class prediction for tuple X.

### Random Forest

In random forest, each of the classifiers in the ensemble is a decision tree classifier so that the collection of classifiers is a “forest.” The individual decision trees are generated using a random selection of attributes at each node to determine the split. More formally, each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. During classification, each tree votes and the most popular class is returned.

There are two methods to construct random forest:

#### Forest-RI (random input selection)

Random forests can be built using bagging in combination with random attribute selection. A training set, D, of d tuples is given. The general procedure to generate k decision trees for the ensemble is as follows. For each iteration, i (i = 1, 2,..., k), a training set, Di, of d tuples is sampled with replacement from D. That is, each Di is a bootstrap sample of D so that some tuples may occur more than once in Di, while others may be excluded. Let F be the number of attributes to be used to determine the split at each node, where F is much smaller than the number of available attributes. To construct a decision tree classifier, Mi, randomly select, at each node, F attributes as candidates for the split at the node. The trees are grown to maximum size and are not pruned. Random forests formed this way, with random input selection, are called Forest-RI.

#### Forest-RC (random linear combinations)

Another form of random forest, called Forest-RC, uses random linear combinations of the input attributes. Instead of randomly selecting a subset of the attributes, it creates new attributes (or features) that are a linear combination of the existing attributes. That is, an attribute is generated by specifying L, the number of original attributes to be combined. At a given node, L attributes are randomly selected and added together with coefficients that are uniform random numbers on [−1,1]. F linear combinations are generated, and a search is made over these for the best split. This form of random forest is useful when there are only a few attributes available, so as to reduce the correlation between individual classifiers.

### Out-of-bag error and bootstrap accuracy

Out-of-bag error is the mean prediction error on each training sample xᵢ, using only the trees that did not have xᵢ in their bootstrap sample.

Suppose there are M input variables. After each tree is constructed, the values of the mth variable in the out-of-bag examples are randomly permuted and the out-of-bag data is run down the corresponding tree. The classification given for each xn that is out of bag is saved. This is repeated for m = 1, 2,..., M. At the end of the run, the plurality of out-of-bag class votes for xn with the mth variable noised up is compared with the true class label of xn to give a misclassification rate.

## Properties

RF has proven suitable for robust meta-analysis of highly complex and heterogeneous literature data. RF regression was employed for model development and attribute assessment given the following advantages: (1) capability of coping with the heterogeneity of compiled QD data that contains both categorical and numerical attributes; (2) robustness with respect to data noise; (3) ability to provide internal estimates of model predictive performance via out-of-bootstrap (OOB) validation, which, together with resubstitution validation, provides a synthetic model performance estimator.

• It is unexcelled in accuracy among current algorithms.
• It runs efficiently on large data bases.
• It can handle thousands of input variables without variable deletion.
• It gives estimates of what variables are important in the classification.
• It generates an internal unbiased estimate of the generalization error as the forest building progresses.
• It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
• It has methods for balancing error in class population unbalanced data sets.
• Generated forests can be saved for future use on other data.
• Prototypes are computed that give information about the relation between the variables and the classification.
• It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.
• The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
• It offers an experimental method for detecting variable interactions.

## Application

### Variable importance

Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way. The following technique was described in Breiman's original paper.

The first step in measuring the variable importance in a data set is to fit a random forest to the data. During the fitting process the out-of-bag error for each data point is recorded and averaged over the forest (errors on an independent test set can be substituted if bagging is not used during training).

To measure the importance of the j-th feature after training, the values of the j-th feature are permuted among the training data and the out-of-bag error is again computed on this perturbed data set. The importance score for the j-th feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees. The score is normalized by the standard deviation of these differences.

Features which produce large values for this score are ranked as more important than features which produce small values.

This method of determining variable importance has some drawbacks. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Methods such as partial permutations and growing unbiased trees can be used to solve the problem. If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups.

## Extensions

### Unsupervised learning

In unsupervised learning the data consist of a set of x -vectors of the same dimension with no class labels or response variables. There is no figure of merit to optimize, leaving the field open to ambiguous conclusions. The usual goal is to cluster the data - to see if it falls into different piles, each of which can be assigned some meaning.

The approach in random forests is to consider the original data as class 1 and to create a synthetic second class of the same size that will be labeled as class 2. The synthetic second class is created by sampling at random from the univariate distributions of the original data. Here is how a single member of class two is created - the first coordinate is sampled from the N values {x(1,n)}. The second coordinate is sampled independently from the N values {x(2,n)}, and so forth.

Thus, class two has the distribution of independent random variables, each one having the same univariate distribution as the corresponding variable in the original data. Class 2 thus destroys the dependency structure in the original data. But now, there are two classes and this artificial two-class problem can be run through random forests. This allows all of the random forests options to be applied to the original unlabeled data set.

If the out-of-bag misclassification rate in the two-class problem is, say, 40% or more, it implies that the x -variables look too much like independent variables to random forests. The dependencies do not have a large role and not much discrimination is taking place. If the misclassification rate is lower, then the dependencies are playing an important role.

Formulating it as a two class problem has a number of payoffs. Missing values can be replaced effectively. Outliers can be found. Variable importance can be measured. Scaling can be performed (in this case, if the original data had labels, the unsupervised scaling often retains the structure of the original scaling). But the most important payoff is the possibility of clustering.

## Implementation

### R

In the R environment, there are many packages that provide tools for Random Forest model development and prediction. One of the most widely used packages is 'randomForest' by Andy Liaw and Matthew Wiener. This package contains many functions that allow model development, model-based prediction, cross-validation, variable importance calculation, outlier detection, or missing value solution.

Some of the most important functions of the package are:

• `randomForest` Implementation of Breiman’s random forest algorithm (based on Breiman and Cutler’s original Fortran code) for classification and regression. It can also be used in unsupervised mode for assessing proximities among data points.
• `predict.randomForest` Prediction of test data using random forest.
• `rfcv` This function shows the cross-validated prediction performance of models with sequentially reduced number of predictors (ranked by variable importance) via a nested cross-validation procedure.
• `importance` This is the extractor function for variable importance measures as produced by `randomForest`.
• `outlier` Computes outlying measures based on a proximity matrix.
• `na.roughfix` Imputes Missing Values by median/mode.

### Rapidminer

In Rapidminer, users can utilize the Random Forest algorithm with the Random Forest operator. The Random Forest operator generates a set of random trees. The number of trees parameter specifies the required number of trees. The resulting model is a voting model of all the random trees. The goal is to create a classification model that predicts the value of a target attribute (often called class or label) based on several input attributes of the ExampleSet. Each interior node of the tree corresponds to one of the input attributes. The number of edges of a nominal interior node is equal to the number of possible values of the corresponding input attribute. Outgoing edges of numerical attributes are labeled with disjoint ranges. Each leaf node represents a value of the label attribute given the values of the input attributes represented by the path from the root to the leaf. Pruning is a technique in which leaf nodes that do not add to the discriminative power of the tree are removed. This is done to convert an over-specific or over-fitted tree to a more general form in order to enhance its predictive power on unseen datasets. Pre-pruning is a type of pruning performed parallel to the tree creation process. Post-pruning, on the other hand, is done after the tree creation process is complete.

#### Input

• training set (Data Table) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input.

#### Output

• model (Random Forest Model) The Random Forest model is delivered from this output port. This model can be applied on unseen data sets for the prediction of the label attribute. This model is a voting model of all the random trees
• example set (Data Table) The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

#### Parameters

Parameter Description Range
number_of_trees This parameter specifies the number of random trees to generate. integer
criterion Selects the criterion on which attributes will be selected for splitting. It can have one of the following values:
• information_gain: The entropy of all the attributes is calculated. The attribute with minimum entropy is selected for split. This method has a bias towards selecting attributes with a large number of values.
• gain_ratio: It is a variant of information gain. It adjusts the information gain for each attribute to allow the breadth and uniformity of the attribute values.
• gini_index: This is a measure of impurity of an ExampleSet. Splitting on a chosen attribute gives a reduction in the average gini index of the resulting subsets.
• accuracy: Such an attribute is selected for a split that maximizes the accuracy of the whole Tree.
selection
maximal_depth The depth of a tree varies depending upon size and nature of the ExampleSet. This parameter is used to restrict the size of the Decision Tree. The tree generation process is not continued when the tree depth is equal to the maximal depth. If its value is set to '-1', the maximal depth parameter puts no bound on the depth of the tree, a tree of maximum depth is generated. If its value is set to '1', a Tree with a single node is generated. integer
apply_prepruning By default the Decision Tree is generated with prepruning. Setting this parameter to false disables the prepruning and delivers a tree without any prepruning. boolean
minimal_gain The gain of a node is calculated before splitting it. The node is split if its Gain is greater than the minimal gain. Higher value of minimal gain results in fewer splits and thus a smaller tree. A too high value will completely prevent splitting and a tree with a single node is generated. real
minimal_leaf_size The size of a leaf node is the number of examples in its subset. The tree is generated in such a way that every leaf node subset has at least the minimal leaf size number of instances. integer
minimal_size_for_split The size of a node is the number of examples in its subset. The size of the root node is equal to the total number of examples in the ExampleSet. Only those nodes are split whose size is greater than or equal to the minimal size for split parameter. integer
number_of_prepruning_alternatives As prepruning runs parallel to the tree generation process, it may prevent splitting at certain nodes when splitting at that node does not add to the discriminative power of the entire tree. In such a case alternative nodes are tried for splitting. This parameter adjusts the number of alternative nodes tried for splitting when split is prevented by prepruning at a certain node. integer
apply_pruning By default the Decision Tree is generated with pruning. Setting this parameter to false disables the pruning and delivers an unpruned Tree. boolean
confidence This parameter specifies the confidence level used for the pessimistic error calculation of pruning. real
guess_subset_ratio If this parameter is set to true then log(m) + 1 attributes are used, otherwise a ratio should be specified by the subset ratio parameter. boolean
voting_strategy Specifies the prediction strategy in case of dissenting tree model predictions:
• confidence_vote: Selects the class that has the highest accumulated confidence.
• majority_vote: Selects the class that was predicted by the majority of tree models.
selection
subset_ratio This parameter specifies the ratio of randomly chosen attributes to test. real
use_local_random_seed This parameter indicates if a local random seed should be used for randomization. Using the same value of local random seed will produce the same randomization. boolean
local_random_seed This parameter specifies the local random seed. This parameter is only available if the use local random seed parameter is set to true. integer