Creating a Random Forest from scratch
Tree-based methods segment the predictor space based on splitting rules and can be applied to both Classification and Regression. In terms of accuracy, these models tend to have high variance - this means that the results change significantly when using different training sets and thus are not very reliable for prediction. To overcome this, we use ensemble classifiers which fit many individual classifiers and combine the result to yield a single prediction. One such ensemble classifier is the Random Forest.
To gain a solid understanding of how Decision Trees and Random Forest works, we will build a Random Forest Classifier from scratch i.e. without making use of existing Machine Learning libraries.
We will be using the UCI Breast Cancer dataset and performing Binary Classification (0/1) to determine if a tumour is benign or malignant.
We make use of the ID3 Algorithm to build the decision trees. Read more about it here. From the linked reference, the basic working of the ID3 algorithms is as follows-
- In the decision tree, each node corresponds to an attribute, each arc to a possible value of that attribute
- A leaf at any point specifies the expected value of the outcome for the records described from the root to that leaf
- Each node in the tree should be associated with the attribute which is most informative among the attributes not yet considered. The informativeness of the attribute in determining the outcome is measured by Entropy
Let’s go through the ID3 algorithm in detail as we define our DecisionTree class in which we add multiple methods.
To start, we define a Node class and initialize the tree to have a root node. We also define a learn method that calls on a recursive function growTree to train the tree and stores values within the tree
Prior to building the growTree function, let’s go over ID3 basics-
Assume a training set T with n predictors. We are trying to predict a binary outcome 0/1 (Success/Failure)
We randomly choose m predictors and iterate over them. For each predictor, choose the split that gives highest information gain. Return a tree with that predictor as node.
Here’s the function definition of growTree-
Start by taking care of some exceptions-
- If T is empty return single node with value 0/Failure
- If T consists of obervations with a single outcome, return a node with that outcome value
One feature that sets Random Forest apart from Bagging - another ensemble model combining several decision trees with the goal of reducing model variance - is the selection of a subset of predictors in each tree. This number (m) is typically assumed to be the square root of total number of predictors.
While iterating over the m predictors we have used the following functions-
- partition_classes - partition the dataset at a split point for a predictor
- entropy - computes entropy; this is a measure of the disorder in a dataset
Through each iteration of growTree, we uncover the predictor that delivers the highest information gain, the split point for that predictor and the left (less than split) and right (greater than split) resulting datasets. We define the functions used in a separate util.py file.
Entropy is a measure of randomness of the information being processed and is calculated as \(E(S) = \sum_{i=1}^c -p_ilog_2p_i\)
We can think of Entropy as a measure of purity of the dataset. If a categorical predictor can take 2 different values e.g. A and B, entropy is low when a predictor contains only one class (either A or B) and high when it contains an equal distribution of both A and B.
Information Gain is calculated as decrease in entropy after a data-set split on an attribute, represented by the formula-
\(IG(Y,X) = E(Y) - E(Y|X)\)
This gives us the reduction in uncertainty of Y (whether tumour is belign or malignant) given the knowledge of X(predictor value e.g. Clump Thickness=2). Constructing a decision tree is based on finding the attribute and split value that returns the highest information gain.
Define these functions in a python file called util.py -
Now all that’s left to do is build a RandomForest class with methods for
- Initializing the RandomForest object as a list of DecisionTree class objects
- Bootstrapping - a type of random sampling with replacement for creating a different dataset for each tree
- Fitting multiple decision trees by calling on the learn defined earlier
- Classifying an OOB (out of bag) sample for error estimation using the mode of observations (if this were a regressionp problem, we would go with the mean value)
This can be used by the main function to accomplish the following
- Initialize a RandomForest object and specify the number of trees used
- Load the data and create the bootstrapped datasets
- Build the decision trees/fit the model
- Validate the model using an OOB error estimate
I mentioned that Random Forest only considers a random subset of predictors (m < p) for each tree. This may seem odd but in reality targets the problem of high variance adeptly. Bagging (m = p), for instance, would create multiple correlated trees where the strongest predictors are always used which while giving us high quality individual trees does not do much to reduce variance for the ensemble. Random Forest overcomes this issue and makes the average of the resulting trees more reliable.
Thank you for reading. This programming assignment was submitted as coursework for CSE6242 Data and Visual Analytics (Fall 2016), Georgia Tech College of Computing.