Introduction
This is the first post of our series "Beyond Fit and Predict". In this series, we will learn multiple concepts that are mostly less discussed but hold very important information about the Model. We will understand all the important Machine Learning i.e. DecisionTree, LogisticsRegression, LinearRegression, Support Vector Machines from a new perspective.
The prerequisite is that you have a basic understanding of the algorithm as we will not discuss "how the algorithm works". In this post, we will learn about RandomForest(RF).
We will understand,
- Why it is said that RandomForest never overfits
- What are Bagging and OOB
- Should we always trust the Feature Importance result of RF algorithm
- How to look into individual Trees of using Scikit-Learn
- Understanding a few of the important parameter of Scikit-Learn e.g. max_features
What is BootStrap, and OOB
Bootstrap is a sampling approach where we resample with replacement. The advantage of this approach is that all the samples are independent. This leads to high variance among different samples which is the key requirement when building a Variance reduction Ensembling model i.e. Bagging, RandomForest. Let's see the steps
a. Select a single data point
b. Put it back
c. Select the next data point
d. Repeat
Yes, you are thinking absolutely correct. This will result in many data points coming out multiple times and few datapoints not coming out at all.
The data points which didn't compile for a particular sampling round are called an Out-of-Bag sample. This gives us a readily available validation sample when we use Bootstrap. At the same time, the sample on which we train our model has a lesser number of effective data points and hence more Bias. Let's generate the same using Scikit-Learn
from sklearn.utils import resample
x_train = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# We have kept sample size same as training data. How we do in Bagging
sample = resample(x_train, replace=True, n_samples=len(x_train))
oob = [elem for elem in x_train if elem not in sample]
sample
Result
[10, 6, 6, 5, 3, 3, 10, 2, 4, 8]
6, 3,10 are repeated. [1, 7, 9 ] are OOB
In this way, you may generate as many samples as you want and build a RandomForest(actually Bagging model) from scratch using a DecisionTree model.
What is the % of OOB data point
You might be pondering on this question. It's roughly 37%. Why 37%?
Read this great SE thread for the detailed answer (Hint - It's a simple probability calculation)
Bagging Vs RandomForest
You can make an Ensemble of Trees based on bagged samples as we did in the last section. RandomForest goes one step ahead and adds further Randomization (lesser correlation among Trees) by picking a "Random" subset of all the Features prior to each split.
An excerpt from "An Introduction to Statistical Learning" [Link]. A very good read for all who have just started their ML journey.
Random forests provide an improvement over bagged trees by way of a random small tweak that decorrelates the trees. As in bagging, we build a number forest of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors. The split is allowed to use only one of those m predictors. A fresh sample of m predictors are taken at each split
max_features
is the parameter for this task in scikit-learn's RandomForest
. You may apply these words of wisdom from the Scikit-Learn document Link for this parameter.
The latter is the size of the random subsets of features to consider when splitting a node. The lower the greater the reduction of variance, but also the greater the increase in bias. Empirical good default values are max_features=None (always considering all features instead of a random subset) for regression problems, and max_features="sqrt" (using a random subset of size sqrt(n_features)) for classification tasks (where n_features is the number of features in the data)
Needless to remind that this Randomization will further increase the Bias.
Feature Importance and issue
Now you have built your RandomForest and want to use its Feature importance capability.
Lets us do a quick recap of the concept. Feature importance tells us about the importance of each feature based on the Model's perception while fitting itself on the training data.
It's based on Impurity reduction i.e. how clean is the split when a particular feature was picked for the split. This is aggregated over all the split and then overall the Trees (in the case of an Ensembled model)
Here we will not code to calculate the FI but understand a few of its caveat so that we are better aware while using the same.
Correlated Features - This is a common issue for most of the approaches to calculating FI i.e. This exists with LinearRegression's coefficients too. If your data has two correlated Features then both will actually share the FI and you might get an incorrect perception while reviewing the FI ranking.
Let's say, we have 3 Features and the FI scores are - 0.5, 0.3, 0.1, 0.1
In this data, we added a new feature that is highly correlated to the first feature. What it means is that whenever the Training process split on Feature#1 while building the previous model, now it can split on both Feature#1 or Feature#2 since both the Features are correlated and hence will have similar split quality.
So, the model can pick Feature#1 and Feature#2 with almost equal probability i.e. ~50% of the times, and hence the total share of impurity dip will be shared between both of the features.
So, the new FI score will be - 0.25, 0.25, 0.3, 0.1, 0.1
It means now Feature#3 becomes the most important feature which might be deceiving in most cases.
Biased towards high Cardinality feature - This is another subtle issue that might deceive your understanding of the FI score and this issue is not too simple to comprehend as compared to the previous one. The point is the way FI score is calculated in the Gini Impurity reduction approach favours the features which have a high Cardinality as compared to a feature not having high Cardinality.
Check the examples in this Bog post Beware Default Random Forest Importances,Why it happens
Features with more potential cut-points are more likely to produce a good criterion value by chance. As a result, variables of higher cardinality are more likely to be chosen for splitting than those of lower cardinality. But this will not happen in initial splits as the Important feature will have cut-offs which will reduce the Impurity significantly not possible for a feature split by Random chance. This tendency dominates when the sample size reduces i.e. after few splits
Check this excerpt from Understanding Random Forests: From Theory to Practice - Sec. 7.2
".....the origin of the problem stems from the fact that node impurity is misestimated when the size Nt of the node samples is too small. To a large extent, the issue is aggravated by the fact that trees are fully developed by default, making impurity terms collected near the leaves usually unreliable."
RandomForest never overfits?
This question surfaces quite regularly over many discussion forums whether RandomForest never overfits since something similar to this was claimed by the inventor Leo Breiman.
This is what the author said in the paper [Link]
Random forests are an effective tool in prediction. Because of the Law of Large Numbers, they do not overfit
But neither this is the case nor the Author meant so. Let's review another point from the paper,
For random forests, an upper bound can be derived for the generalization error in terms of two parameters that are measures of how accurate the individual classifiers are and of the dependence between them. The interplay between these two gives the foundation for understanding the workings of random forests
As we also discussed that one of the key pillars is the independence of different Trees. Less the correlation, better the Forest. The second point simply meant that if the individual Trees are not doing great work e.g. Overfitting, so will be the Forest.
Hence, RF will definitely overfit if the individual fitted Trees does so.
This is a simple exercise, you can easily try with a small code. You should!
Some coding gems
We, at 10xAI, believe in balanced learning across the axis of Theory-Code and Width-Depth. So, let's review some code related stuff of RandomForest
Get all the Trees - If you want you may individually review all the Trees i.e. their prediction for individual data points. Using that you may able to figure out the confidence of the RF for the particular prediction. This is discussed in Jeremy Howard's book too Deep Learning for Coders
You can get the individual trained Tree using the model.estimators_ attributes
# Score - Individual Tree on Full test Data tree_score = np.array([accuracy_score(y_test, model.predict(x_test)) for model in model.estimators_]) #1 # Probability - Individual Tree on Single Test data data_tree_proba = np.array([tree.predict_proba(data.reshape(-1, 30)) for tree in model.estimators_ for data in x_test.to_numpy()]) #2 # To a DataFrame data_tree_proba = data_tree_proba[:,:,0].reshape(-1,model.n_estimators) data_tree_proba = pd.DataFrame(data_tree_proba,columns=["col_"+str(i+1) for i in range(model.n_estimators)])
Code-explanation
#1 - A simple List comprehension on all the Trees to predict the score on test data
#2 - A nested List comprehension on all the Trees and each test data points to predict the class probabilityGet OOB score - The model will calculate the OOB score for the training dataset when we set the flag "oob_score=True" while initializing the Model. Then we can get the value using the "oobscore" attribute
from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0) clf = RandomForestClassifier(max_depth=2, random_state=0, oob_score=True) clf.fit(X, y) clf.oob_score_
Build Tree in parallel - The key advantage of Bagging based approach over Boosting based approach is that in the former, Trees are built independently and hence can be built in parallel. In scikit-learn, you just need to set "n_jobs=-1". Below is the code snippet on a 2-Core machine
clf = RandomForestClassifier(n_estimators=500) %timeit clf.fit(X, y) # 1 loop, best of 3: 16 s per loop clf = RandomForestClassifier(n_estimators=500, n_jobs=-1) %timeit clf.fit(X, y) # 1 loop, best of 3: 10.1 s per loop
Conclusion and Summary
We are sure that now you have an in-depth understanding of RandomFOrest not just in terms of how it works but also many inner intricacies. You very well understand the key parameters and attributes of Scikit-Learn RandomForest.
We will suggest you read the complementary user guide of Scikit-Learn models e.g. [This]. These are quite informative and also contain the link to original papers or best references.
Now you have all the required knowledge to answer almost any question regarding RandomForest and bagging. We will end this post with a question :-)
"....Good values for max_features are -
1. "all features" for regression problems, and 2. "sqrt(n_features)" for classification tasks.....".
So the question is - Why does the random sampling of Features don't work for the Regression task?