Looking for a Tutor Near You?

Post Learning Requirement »
x

Choose Country Code

x

Direction

x

Ask a Question

x

x
x
x
Hire a Tutor

DATA Science

Loading...

Published in: Big Data & Hadoop
812 Views

DATA Science

Rahul A / Pune

10 years of teaching experience

Qualification: B.Tech/B.E. (mit - 2010)

Teaches: Computer Science, Mathematics, Physics, Statistics, Big Data & Hadoop, AIEEE, IIT JEE Mains, .Net, C# (C Sharp), Java And J2EE, Python Programming, Java Script, PHP And MySQL

Contact this Tutor
  1. The algorithm of Random Forest Random forest is like bootstrapping algorithm with Decision tree (CART) model. Say, we have 1000 observation in the complete population with 10 variables. Random forest tries to build multiple CART model with different sample and different initial variables. For instance, it will take a random sample of 100 observation and 5 randomly chosen initial variables to build a CART model. It will repeat the process (say) 10 times and then make a final prediction on each observation. Final prediction is a function of each prediction. This final prediction can simply be the mean of each prediction. Back to Case study Disclaimer : The numbers in this article are illustrative Mexico has a population of 118 MM. Say, the algorithm Random forest picks up 10k observation with only one variable (for simplicity) to build each CART model. In total, we are looking at 5 CART model being built with different variables. In a real life problem, you will have more number of population sample and different combinations of input variables. Salary bands : Band 1 : Below $40,000 Band 2: $40,000 - 150,000 Band 3: More than $150,000 Following are the outputs of the 5 different CART model. CART 1 : Variable Age
  2. 1 Age CART 2 : Gender CART 3 : Education CART 4 : Salary Band Below 18 19-27 28-40 40-55 More than 55 Variable Gender Salary Band Female 1 75% Variable Education Salary Band
  3. Using these 5 CART models, we need to come up with singe set of probability to belong to each of the salary classes. For simplicity, we will just take a mean of probabilities in this case study. Other than simple mean, we also consider vote method to come up with the final prediction. To come up with the final prediction let's locate the following profile in each CART model : 1. Age : 35 years , 2, Gender : Male , 3. Highest Educational Qualification : Diploma holder, 4. Industry : Manufacturing, 5. Residence : Metro For each of these CART model, following is the distribution across salary bands : CART Age Gender Education Band 28-40 Diploma 1 '70% 60% Industry Manufacturing Residence Metro Final probability 2 27% 24% -3 6% 6% The final probability is simply the average of the probability in the same salary bands in different CART models. As you can see from this analysis, that there is 70% chance of this individual falling in class 1 (less than $40,000) and around 24% chance of the individual falling in class 2. End Notes Random forest gives much more accurate predictions when compared to simple CART/CHAID or regression models in many scenarios. These cases generally have high number of predictive variables and huge sample size. This is because it captures the variance of several input variables at the same time and enables high number of observations to participate in the prediction
  4. The algorithm of Random Forest Random forest is like bootstrapping algorithm with Decision tree (CART) model. Say, we have 1000 observation in the complete population with 10 variables. Random forest tries to build multiple CART model with different sample and different initial variables. For instance, it will take a random sample of 100 observation and 5 randomly chosen initial variables to build a CART model. It will repeat the process (say) 10 times and then make a final prediction on each observation. Final prediction is a function of each prediction. This final prediction can simply be the mean of each prediction. Back to Case study Disclaimer : The numbers in this article are illustrative Mexico has a population of 118 MM. Say, the algorithm Random forest picks up 10k observation with only one variable (for simplicity) to build each CART model. In total, we are looking at 5 CART model being built with different variables. In a real life problem, you will have more number of population sample and different combinations of input variables. Salary bands : Band 1 : Below $40,000 Band 2: $40,000 - 150,000 Band 3: More than $150,000 Following are the outputs of the 5 different CART model. CART 1 : Variable Age
  5. Age CART 2 . Salary Band Below 18 19-27 28-40 40-55 More than 55 1 85% 7096 • Variable Gender Gender CART 3 . Salary Band Male Female 1 7096 75% • Variable Education Education CART 4 : Salary Band
  6. The algorithm of Random Forest Random forest is like bootstrapping algorithm with Decision tree (CART) model. Say, we have 1000 observation in the complete population with 10 variables. Random forest tries to build multiple CART model with different sample and different initial variables. For instance, it will take a random sample of 100 observation and 5 randomly chosen initial variables to build a CART model. It will repeat the process (say) 10 times and then make a final prediction on each observation. Final prediction is a function of each prediction. This final prediction can simply be the mean of each prediction. Back to Case study Disclaimer : The numbers in this article are illustrative Mexico has a population of 118 MM. Say, the algorithm Random forest picks up 10k observation with only one variable (for simplicity) to build each CART model. In total, we are looking at 5 CART model being built with different variables. In a real life problem, you will have more number of population sample and different combinations of input variables. Salary bands : Band 1 : Below $40,000 Band 2: $40,000 - 150,000 Band 3: More than $150,000 Following are the outputs of the 5 different CART model. CART 1 : Variable Age
  7. Using these 5 CART models, we need to come up with singe set of probability to belong to each of the salary classes. For simplicity, we will just take a mean of probabilities in this case study. Other than simple mean, we also consider vote method to come up with the final prediction. To come up with the final prediction let's locate the following profile in each CART model : 1. Age : 35 years , 2, Gender : Male , 3. Highest Educational Qualification : Diploma holder, 4. Industry : Manufacturing, 5. Residence : Metro For each of these CART model, following is the distribution across salary bands : CART Age Gender Education Band 28-40 Male Diploma 1 6096 7096 2 23% 27% 2096 24% 3 6% 5% Industry Manufacturing Residence Metro Final probability The final probability is simply the average of the probability in the same salary bands in different CART models. As you can see from this analysis, that there is 70% chance of this individual falling in class 1 (less than $40,000) and around 24% chance of the individual falling in class 2. End Notes Random forest gives much more accurate predictions when compared to simple CART/CHAID or regression models in many scenarios. These cases generally have high number of predictive variables and huge sample size. This is because it captures the variance of several input variables at the same time and enables high number of observations to participate in the prediction
  8. Age CART 2 . Salary Band Below 18 19-27 28-40 40-55 More than 55 1 85% 7096 • Variable Gender Gender CART 3 . Salary Band Male Female 1 7096 75% • Variable Education Education CART 4 : Salary Band
  9. Using these 5 CART models, we need to come up with singe set of probability to belong to each of the salary classes. For simplicity, we will just take a mean of probabilities in this case study. Other than simple mean, we also consider vote method to come up with the final prediction. To come up with the final prediction let's locate the following profile in each CART model : 1. Age : 35 years , 2, Gender : Male , 3. Highest Educational Qualification : Diploma holder, 4. Industry : Manufacturing, 5. Residence : Metro For each of these CART model, following is the distribution across salary bands : CART Age Gender Education Band 28-40 Male Diploma 1 6096 7096 2 23% 27% 2096 24% 3 6% 5% Industry Manufacturing Residence Metro Final probability The final probability is simply the average of the probability in the same salary bands in different CART models. As you can see from this analysis, that there is 70% chance of this individual falling in class 1 (less than $40,000) and around 24% chance of the individual falling in class 2. End Notes Random forest gives much more accurate predictions when compared to simple CART/CHAID or regression models in many scenarios. These cases generally have high number of predictive variables and huge sample size. This is because it captures the variance of several input variables at the same time and enables high number of observations to participate in the prediction
  10. The algorithm of Random Forest Random forest is like bootstrapping algorithm with Decision tree (CART) model. Say, we have 1000 observation in the complete population with 10 variables. Random forest tries to build multiple CART model with different sample and different initial variables. For instance, it will take a random sample of 100 observation and 5 randomly chosen initial variables to build a CART model. It will repeat the process (say) 10 times and then make a final prediction on each observation. Final prediction is a function of each prediction. This final prediction can simply be the mean of each prediction. Back to Case study Disclaimer : The numbers in this article are illustrative Mexico has a population of 118 MM. Say, the algorithm Random forest picks up 10k observation with only one variable (for simplicity) to build each CART model. In total, we are looking at 5 CART model being built with different variables. In a real life problem, you will have more number of population sample and different combinations of input variables. Salary bands : Band 1 : Below $40,000 Band 2: $40,000 - 150,000 Band 3: More than $150,000 Following are the outputs of the 5 different CART model. CART 1 : Variable Age
  11. Age CART 2 . Salary Band Below 18 19-27 28-40 40-55 More than 55 1 85% 7096 • Variable Gender Gender CART 3 . Salary Band Male Female 1 7096 75% • Variable Education Education CART 4 : Salary Band
  12. Using these 5 CART models, we need to come up with singe set of probability to belong to each of the salary classes. For simplicity, we will just take a mean of probabilities in this case study. Other than simple mean, we also consider vote method to come up with the final prediction. To come up with the final prediction let's locate the following profile in each CART model : 1. Age : 35 years , 2, Gender : Male , 3. Highest Educational Qualification : Diploma holder, 4. Industry : Manufacturing, 5. Residence : Metro For each of these CART model, following is the distribution across salary bands : CART Age Gender Education Band 28-40 Male Diploma 1 6096 7096 2 23% 27% 2096 24% 3 6% 5% Industry Manufacturing Residence Metro Final probability The final probability is simply the average of the probability in the same salary bands in different CART models. As you can see from this analysis, that there is 70% chance of this individual falling in class 1 (less than $40,000) and around 24% chance of the individual falling in class 2. End Notes Random forest gives much more accurate predictions when compared to simple CART/CHAID or regression models in many scenarios. These cases generally have high number of predictive variables and huge sample size. This is because it captures the variance of several input variables at the same time and enables high number of observations to participate in the prediction