A random forest regressor is one example of an "ensemble" machine learning model. Such models aggregate the predictions of a multiplicity of "weak" learners, meaning models that predict the value of the target variable only slightly better than would be accomplished by always predicting that the variable equals its mean value. They do so in a way that captures the complex interactions among the "target" variable, that is, the variable whose value one seeks to predict, and the "features," that is, the explanatory variables that presumably influence the value of the target variable. In our case, the target variable is the bicycle mode share variable, and the features are the bicycle network density, distance to CBD, university degree share, and blue collar share variables.
I use the word, explanatory, somewhat loosely in that last paragraph. Compared to standard regression models, machine learning models generally do a lousy job providing intelligible insights into exactly how each feature influences the value of the target variable. They almost invariably perform better than standard regression models at predicting the value of the target variable, based on the values of the features, however. And our goal here is to find ways to accurately predict how many commuters from an area unit will bicycle to work, based on the values of the features of that area unit. Although the random forest regressor will not directly inform us what general impact the accessibility of bicycle infrastructure has on the bike mode share variable, I will later show how it can be used in conjunction with GIS mapping software to find pockets of Auckland where the expansion of bicycle infrastructure would most likely boost the bicycle mode share.
The ingeniousness of ensemble machine learning models lies in their aggregating the predictions of a plethora of diverse weak learners. Such models, owing to the heterogeneity of their constituent elements, do a good job at integrating complex and nonlinear interactions among the variables into making their predictions, i.e., they reduce "bias," while mitigating the risk of overfitting the data, i.e., they reduce "variance," too, to use machine learning lingo. (A note: "Overfitting" manifests itself when a model estimated using the training dataset performs significantly worse on new data, such as the test dataset, than it did on the training dataset because it learned the intricacies in the training dataset "too" well. That is, the estimated model conforms so tightly to the idiosyncrasies of the training data that it misses the general patterns in data it was not trained on.)
The learners constituting a random forest regressor are "decision trees." A decision tree commences, beginning at its top "node," by bifurcating the records in the full dataset in a way that minimizes the weighted average statistical variance of the target variable in the resulting two sub-sets, or sub-nodes. For those of you without a background in statistics, just think of "statistical variance" as variation, and you will not be too far off the mark. (Alternatively, one can have the decision tree split each node in such a way that minimizes the weighted average absolute errors of the sub-nodes. I also do this.) It does the same on each of the sub-sets, and keeps drilling down farther and farther until either it is directed to stop or the statistical variance in the target variable cannot be reduced further. In effect, a decision tree groups the values of the features in such a way that these values lead to very similar, if not identical, outcomes for the target variable. Once the decision tree has been constructed, one can feed a new set of values for the features into its top node. The tree will then run the set down the appropriate track of branches to arrive at a "leaf" containing similar values for the features, along with a corresponding prediction for the target variable. (A "leaf" is a node that cannot be split further.) A random forest regressor simply takes the mean of the predictions of the myriad decision trees constituting it in order to predict the target variable based on the values of the features inputted into it. Also, each decision tree in the forest uses a randomly selected sub-sample of the full set of features.
Obviously, I cannot go beyond the beyond the broad intuition of a random forest regressor here. For the details, I strongly recommend an excellent book on machine learning, An Introduction to Statistical Learning with Applications in Python by Gareth James et al, which you can download for free here. You will find chapter 8 particularly useful.
For training and testing the random forest regressor, I used the same training and testing datasets I used for the Tobit and spatial autocorrelated regression approaches, but with two modifications. (Recall that I had to delete several "island" area units from these datasets, especially the test dataset.) One, I disaggregated the cycle network density variable into its eight constituent types of infrastructure: shared zone (SZ); local area traffic management (LATM); off-road shared path (ORSP); on-road protected cycle lane (ORPCL); on-road unbuffered cycle lane (ORUBCL); off-road cycleway(ORC); on-road buffered cycle lane (ORBCL); and off-road trail (ORT). I did this to ascertain whether certain types of infrastructure might be inducing commuters to bicycle to work more than others. Recall from our previous analyses that there is no evidence that the cycle network, considered in its entirety and without regard to the differences in types of infrastructure, is promoting commuter cycling. Perhaps one or more particular types of infrastructure are.
Secondly, I threw infrastructure classified as shared zone (SZ) out of the dataset. One reason for this is that they have miniscule coverage, as you can see in the map to the left. Another, more important reason, is that they are simply nothing more than side streets in urban retailing and dining areas where very slow-moving cars, pedestrians and bicyclists intermingle. They have no conceivable connection with commuting.
Other than the two modifications I mention here, I used the exact same training and testing datasets as before.
Before running the random forest regressor on the training dataset, I first had to "tune" the "hyperparameters." Hyperparameters are user-specified values for the structure of the forest as a whole, and for certain parameters of each decision tree within the forest. The algorithm does not determine these value; the user does. One hyperparameter is the number of trees in the forest. Too many trees entails a significant slowing of the computational process. Too few entails that the regressor does not detect many of the intracacies in the relationships among the variables. Another is the number of randomly selected features that each tree uses. Empirical studies have indicated that log base 2 of the total number of features tends to be optimal. (In our case, that would be 3 features, rounded to the nearest integer.) Another is how deep each tree is allowed to grow. Too little depth entails that the trees do not learn much from the data. Too much depth opens the door to overfitting. A fourth hyperparameter is the maximum number of leafs each decision tree is allowed to grow. Allowing too many leaf nodes risks overfitting the data. Allowing too few entails that the trees fail to capture certain complex interactions among the variables. Finally, the last hyperparameter of concern to us is the minimum number of samples (or records) that must be present in order for an internal node to be split. The greater is this value, the lower is the risk of overfitting, but at the cost of the trees learning less from the training dataset.
Fortunately, Sci-Kit Learn, a Python library, includes a function, random search cross-validation, using a parameter grid, that allowed me to try out combinations of user-inputted hyperparameters. You see the selections I made for each hyperparameter in the code block you see to the top left. I used five-fold cross-validation to improve the robustness of the best-performing forest, as estimated by the algorithm. (For background on hyperparameter tuning and cross-validation, please see the Introduction to Statistical Learning book.)
Running the random search over the hyperparameter grid with five-fold cross-validation yielded an optimal number of decision trees of 300, a maximum depth of 9 layers of each tree, a maximum number of leaf nodes of 12, and a minimum allowable number of samples for node-splitting of 4. (I used the exact same training dataset as I used in my spatial autoregressive model, which, remember, excludes "island" area units.) You can see a barchart showing the estimated feature importances at the left. The following table of feature importances shows the portion of the variation of the bike mode share attributable to each feature. It does not report their signs. (Again, machine learning models fall short when it comes to providing explanations.) Notice that the three most influential infrastructure interventions are local area traffic management, off-road shared paths, and on-road protected cycle lanes. Do not feel befuddled that off-road cycleways (ORC) appear near the bottom of the list. I suspect that their having limited access points, spread relatively far apart, accounts for this. Notice that on-road protected cycle lanes (ORPCL) and off-road shared paths (ORSP), which appear higher on the list, offer similar physical separation from motor vehicle traffic but are accessible at any intersecting side-street.
In terms of model predictive power, the random forest regressor trained to minimize squared error with each splitting of a node and applied to the test dataset rendered the following metrics of predictive power:
Mean Squared Error: 0.00038
Mean Absolute Error: 0.01402
Notice that, on the squared error criterion, the random forest regressor significantly outperforms the simple regression and spatial Tobit regression models and ties the spatial autoregressive one. The spatial Tobit model beats it on the absolute error criterion. However, training the forest to minimize mean absolute error with each split of a node yields the following metrics of predictive power:
Mean Squared Error: 0.00046
Mean Absolute Error: 0.01037
Notice that the random forest trained in this manner predicts more accurately than all the standard regression models based on mean absolute error.
On balance, I would conclude that the random forest regressor performs better at predicting commuters' propensity to bicycle to work in the context of the type of data we have here. Based on the squared error criterion, it performs at least as well at predicting the target variable as any traditional regression approach applied here, and on the mean absolute error criterion, it does better.
Our random forest regressor does not directly reveal whether each type of bicycle infrastructure positively promotes commuter cycling and, if so, to what degree. It only tells us the relative importance of each feature. We have an easy workaround at hand, though. For each type of infrastructure intervention, I simply added 100 meters of that infrastructure to each area unit's record in the complete dataset (training and testing datasets combined), had the trained bot estimate the new bicycle mode share, from which I subtracted the area unit's actual mode share. The calculation estimates what the change in the area unit's mode share would be if 100 meters of that type of infrastructure were added within its boundaries. Notice that 100 meters are added to contiguous area units, too, so it is not a trivial addition.
I next calculated the median change in the mode share, after each addition of infrastructure and obtained the results seen in the table to the left. Notice that all the results are positive, with on-road protected cycle lanes having the strongest median impact and off-road trails the weakest. It appears that all types of infrastructure interventions tend to boost ridership. Again, do not feel perplexed at the relatively low impact of off-road cycleways (ORC). Please see my explanation for this in the last section.
One encouraging result is that the local area traffic management (LATM) infrastructure intervention does not lag too far behind the most potent one, that is, on-road protected cycle lanes. It appears to be about 25% less effective at boosting the bicycle mode share. Recall that LATM involves the placement of traffic-calming features, such as speed bumps, chicanes, and road narrowing, to reduce motor vehicle speeds on residential side-streets. Compared to most other interventions, this type of infrastructure intervention is quite inexpensive. Its advantages over other on-road interventions include its not impeding the flow of motorized vehicle commuters using arterial roadways, which has a positive impact on fuel economy and commuting time. It also offers notable benefits in other areas, too, such as enhancing pedestrian safety in residential neighborhoods and making them safer for children as they walk or bicycle to school or play.
With 7364 area units in the dataset after the previous cullings, adding 100 meters of bicycle infrastructure to each area unit would be a massive undertaking. Yet the overall bicycle mode share would increase from roughly 1.2% to somewhere between 1.8% and 2.1%, -- such a minuscule benefit for such a vast expenditure of public funds, a rat-hole for taxpayer money if there ever was one. Fortunately, our trained random forest regressor, when used in conjunction with GIS mapping techniques, allows the targeting of a much smaller quantity of public funds in a way that maximizes results per dollar spent.
In order to identify which area units might see the highest uptake in commuter cycling by adding 100 meters of LATM features to the area unit and contiguous ones, I used the trained bot in conjunction with QGIS to map these area units. You can see the results in the map to the left, which shows the top quintile of the predicted changes in the bicycle mode share. Clusters of such area units clearly manifest themselves. One rather expansive cluster is situated across the neck of Waitemata Harbor directly north and northwest of the CBD. Smaller clusters emerge in the southeastern portion of Te Atatu Peninsula and in the far southest portion of the urbanized area, near Waima. I would never assert that my dataset is sufficiently developed and my analysis strong enough to justify spending public funds on placing LATM infrastructure interventions in these areas. I am merely attempting to show how machine learning in conjunction with GIS techniques offer potential insights to transportation planners and civic associations focused on promoting commuter bicycling.
I also mapped the top quintile of predicted changes in the bicycle mode share resulting from the addition of 100 meters of on-road protected cycle lanes to each area unit. This map appears to the left, also. The clusters differ in certain respects from the ones seen on the LATM map, especially in the eastern suburbs. Bear in mind that the ORPCL intervention tends to choke-off rush-hour automobile traffic flows on arterial roadways, which reduces fuel economy, worsens urban air quality and increases commute times. It does not offer the ancillary benefits of the LATM intervention, too. Therefore, although it appears to be more effective at promoting commuter cycling than the LATM intervention is, we have reason to question its benefits relative to the LATM intervention.