When it comes to Classification, there are many algorithms out there which can be put into practice. As per my personal experience, there is never a right algorithm. It all depends on many factors from how the data is preprocessed to the percentage of outliers in the dataset.
Decision Forest Classifier is one of my most favorite algorithms in the current slot of algorithms that are available for Classification. The major reason being the way it deduces the class. It’s really intuitive I can say.
Suppose you are told to guess the age of a random person on the street. How would you go about with the task? There would surely be a number of parameters that you would consider in order to come up with your Prediction.
Firstly, you’d look at his hair, if it’s grey. So definitely he is above 35-40, else less than it.
Then looking at his face, if he has wrinkles, then there are chances that he is getting older, this combined with the hair color would give us at least a good estimate of his current age. The chances are that if you classify the person this way, the task gets much easier. The figure might give you a better understanding of the working of a Decision Tree.
Talking about Decision Forest. It’s nothing but a bunch of Decision Trees, all whose results are taken into consideration in order to decide which one performs the Best.
Let’s take a real World Example, Companies like Google, TCS, Capgemini, Directi, what they do is, they conduct Coding Contests on National Level. Most students who come in the Top 10/20, are given an Interview Call for the Job or in some cases even offered a Job. These people who came on the Top positions are indeed one of the Best Coders.
But what if, the Company had fixed criteria to just go in a few reputed Universities like IITs and NITs, and recruit the students directly. Even worse, the Recruited Candidate doesn’t even prove to be good at his work. To avoid this, and to get the best people on Board, they put on a challenge online, open for everyone, irrespective of his location. This would open up opportunities for both the Employers and the Job Seekers. That is our Decision Forest! Given a problem, we let everyone try their best to solve it, and the Best Solution gets the Price. Similarly, given a Dataset, every individual tree takes different parameters into consideration. Whichever Tree has the most Accuracy is selected.
Generally, this Classifier doesn’t suffer from overfitting, but then if the Dataset is too biased then it might overfit. The best Solution to this is Sampling the data and Dropping off the outliers.
Let’s have a look at the implementation in brief :
#Library - ScikitLearn #RandomForestClassifier import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score #Pre Process the Dataset df = pd.read_csv('sample_Database.csv') X_train, X_test, y_train, y_test = train_test_split( df, df['target'], stratify = df['target'], train_size=0.8 ) clf = DecisionForestClassifier( min_sample_split = 5, random_state = 2 ) clf.fit(X_train, y_train ) y_pred = clf.predict( x_test ) #Accuracy Calculation print("Accuracy Score : ", accuracy_score( y_test, y_pred ))
I used Scikit Learn’s DecisionForestClassifier for my Dataset. It gave me about 85% of accuracy, whereas the DecisionTreeClassifier gave me about 80% accuracy. So, I’d suggest you use DecisionForestClassifier in case of datasets with features that can be Bucketed into finite groups. For the same Prediction, I tried the Dense Neural Network ( TensorFlow ) and unexpectedly it didn’t perform well, giving me just about 70% accuracy. So, as the numbers suggest, this Classifier is actually DOPE 😉