One of the defects afflicting the plain vanilla regression model is that it does not take into account the spatial interactions between the variables. For example, the regression equation undoubtedly omits certain explanatory variables, such as the hilliness of the terrain within an area unit and surrounding ones, that would likely impact commuters' willingness to bicycle to work. Such omitted variables typically exhibit spatial autocorrelation, meaning area units that are contiguous or near to one another exhibit similar values for the variable. For instance, if one area unit has above-average hilliness of terrain, then, most likely, so do contiguous ones. If excessive hilliness deters bicycle commuting, then the regression residuals of contiguous hilly area units should be biased toward negative values. The spatial autocorrelation of omitted variables manifests itself in spatially autocorrelated regression error terms. Indeed, the map to the top left shows spatial autocorrelation galore of the regression residuals from our plain vanilla ordinary least squares regression equation run on the test data. Positive values for the residuals cluster, as do negative ones.
Another spatial relationship absent from our simple regression model is the positive impact that cycle infrastructure in contiguous area units likely has on the propensities of commuters from an area unit to bicycle to work. After all, these commuters would ride through some of this infrastructure. We need a "spatially lagged" variation of our bicycle network density variable to capture the influence of cycling infrastructure present in contiguous area units on the bike mode share variable.
Using Geoda, a spatial statistics software package, I analyzed the data using a hybrid spatially autocorrelated regression model that accounts for both the spatial autocorrelation of regression error terms and the impact of bicycle infrastructure on contiguous area units. You see the model equation to the left. I discussed the construction of the spatial weights matrix, W, in The Transformed Data webpage. You can think of WZ as the average total length of bicycle infrastructure in contiguous area units, where contiguity is defined as queen order of 2 including the lower order of 1. I ran the model on the exact same training dataset used to estimate the plain vanilla regression model, except for the deletion of "island" area units. (A note: "Island" area units are those that do not have a queen order of 1 contiguous neighbor. If an area unit does not have any queen order of 1 area units, it cannot have any queen order of 2 area units, either, given the definition of queen order of 2. Nine such island area units existed in the training dataset. They had to be removed because otherwise, nine rows would have consisted solely of zeros, which would raise obvious mathematical problems.) The results appear to the left.
You likely find it puzzling that neither the variable for the bicycle network density within the area unit, "CycNetDens", nor for the average density in contiguous area units, "LagCND", seems to have any impact at all on the bicycle mode share variable. (Two notes here: The regression diagnostics seen to the left indicate a set of issues that should raise questions about model adequacy. Also, as previously mentioned, the network does not appear to integrate residential areas with centers of employment well. A network better designed to serve this function may show a stronger impact on the bicycle mode share.) I attempt workarounds in response to this puzzle next.
Perhaps something to do with the data compilation process might account for our inability to find an influence of accessibility to bicycle infrastructure on the bike mode share variable. Recall that the dataset excludes cycle infrastructure categorized as "local area traffic management," "on-road buffered cycle lane," "on-road unbuffered cycle lane," and "shared zone" on the grounds that such infrastructure does not provide physical separation from the motor vehicle traffic and thus might lack attractiveness to potential bicycle commuters. Infrastructure classified as "off-road trail" was also excluded because it has rugged terrain and is designed for recreational users. Perhaps all this was a mistake. I therefore expanded the training dataset to include all infrastructure, no matter how categorized. Two maps appearing to the left provide a comparison.
Before running the spatially autocorrelated regression model on the expanded dataset, I chose to capitalize on the calculation of the spatially lagged bicycle network density variable by simply combining it with the (unlagged) cycle network density variable by adding them together to create a new variable, broad cycle network density, which I abbreviate as "BroadCND". The value of BroadCND for an area unit is the sum of the meters of cycle infrastructure (no matter how classified) within the area unit, including the 20 meter buffer, and the average sum of meters of cycle infrastructure, again, no matter how classified, in contiguous area units, including their 20 meter buffers. Contiguity, as usual, is defined, as queen order of 2 including the lower order of 1.
With the new BroadCND variable, I no longer used either the cycle network density variable or its lagged variant in the regression model. The new spatially autocorrelated regression equation appears to the left. Notice that it, like the previous iteration, attempts to capture the spatial autocorrelation of the regression residuals. Running the new regression model on the expanded training dataset yielded the results that appear to the left.
Hmm, still no sign that accessibility to bicycle infrastructure is encouraging commuters to bicycle to work. The big drivers appear to be distance to the CBD and the share of commuters holding a university degree. The regression diagnostics indicate issues with the model, though. Let's examine some data visualizations to see if our spatially autocorrelated regression model is fit for purpose before even considering running it on the test dataset.
As you can discern from the data visualizations to the left, our spatial autoregressive model is not quite up to the task. Notice that the regression residuals have a pronounced skewness: they are not normally distributed. To make matters worse, the residuals are correlated with the value of the bike mode share variable: they are not independently distributed.
Notice, too, that the logic of the model allows for negative predicted values for the bike mode share variable. You can see this in the bottom visual to the left. This is the same absurdity as we encounted in the plain vanilla regression model.
In light of these inadequacies, we can conclude that the spatially autoregressive regression model is not quite fit for purpose for predicting the bike mode share among commuters in the context of data like we have here. We may not be ready yet, though, to throw up our hands and give up on traditional regression modeling to predict commuters' propensities to bicycle to work, for there is one more variation on regression modeling, a Tobit regression model, that may work, especially when we combine it with certain features of the spatially autocorrelated regression model. We will try this in the next section. For now, though, it might be worth noting how well (or poorly) our estimated spatially autocorrelated regression model predicts the bike mode share values in the test data. The cycle network density variable is based on the all-encompassing categorization of bicycle infrastructure to include "local area traffic management," "on-road buffered cycle lane," "on-road unbuffered cycle lane," "off-road trail" and "shared zone" (despite doubts about how attractive some of this infrastructure is). The metrics of predictive power, when the estimated model is applied to the test dataset, are as follows:
Mean Squared Error: 0.00038
Mean Absolute Error: 0.01437
(A note: Since the test dataset was only 20% of the full dataset, 393 area units were left stranded as "island" area units and had to be removed from the test dataset, leaving 1078 area units in the sample.) These results certainly represent an improvement over the predictive ability of the simple ordinary least squares regression model. Still nothing to write home about, though. Let's see if we can do better with a Tobit regression model, augmented by certain features of the spatial autocorrelated regression model.