Random forest and gradient boosting are leading data mining techniques. They are designed to improve upon the poor predictive accuracy of decision trees. Random forest is by far the more popular, if the google trends chart below is anything to go by.
Correlation between predictors is the data miners’ bugbear. It is an inevitable fact of life in many situations. Multicollinearity can lead to misleading conclusions and degrade predictive power. A natural question is: Which approach handles multicollinearity better? Random forest or gradient boosting?
Suppose there are observations and potential predictors . Assume that
where is the amplitude of gaussian noise (mean zero and unit variance). Only 2 of Â potential predictors actually play a role in generating the observations. The Â are independently distributed ()Â with the exception of which is correlated with Â (correlation ).
As the correlation increases, it becomes harder for a data mining algorithm to ignore , even though Â is not present in (A) and it is not a “true” explanatory variable.
Variable importance charts for this class of problem show that gradient boosting does a better job of handling multicollinearity than random forest. The complex trees used by random forest tend to spread variable importance more widely, particularly to variables which are correlated with the “true” predictors. The simpler base learner trees of gradient boosting (4 terminal nodes in the above example)Â seem to have greater immunity from the evils of multicollinearity.
Random forest is an excellent data mining technique, but it’s greater popularity compared to boosting seems unjustified.
In stock market investing, beta is a widely used empirical measure of the riskiness of a particular stock or sector relative to the market as a whole.Â For example, an advertising stock has beta > 1Â if it tends to outperform during a stock market rally but under-performs during a sell-off. On the other hand, Â a utility stock has beta < 1Â if it tends to follow the direction of the market butÂ with reduced volatility.
I have not seen it used before, but “beta” might also be a useful measure of agricultural risk. Simply replace stock prices by crop yields of some agricultural unit (such as farm or county). The map shows the results of a calculation of corn yield “betas” at county level for six US corn belt states for the period 1910-2012. Data are from the National Agricultural Statistics Service (NASS). Here the benchmark index against which the yields of each county are compared isÂ the total annual corn belt yield.
High values of beta are found in the Missouri River basin, and low values in Eastern Indiana.
 Mathematically Â where are individual stock and market returns respectively.