One reason our previous regression models were so wanting is that the bike mode share variable is left-censored: many of its observed values in the dataset are zero, which is also the value that it cannot conceptually drop below. There cannot be a negative number of commuters who bicycle to work; therefore, there cannot be a negative bicycle mode share. However, the two regression models we have used so far both allow for negative predicted values for the bike mode share variable, the dependent variable in the regression equation.
A Tobit regression model is designed for censored data like ours. (Yes, the T is capitalized. "Tobit" is an amalgam of "probit" and the last name of the creator of the Tobit regression model, James Tobin.) The model version appropriate to a left-censored dependent variable appears to the left. Notice that it explicitly censors the value of the dependent variable at zero. Maximum likelihood estimation is used to estimate the model parameters.
In applying the Tobit model to the training dataset, I used the "broad" cycle network density variable developed in the last webpage. Recall that this variable equals the total meters of bicycle infrastructure within the area unit, including the 20 meter buffer, plus the average total meters of infrastructure in contiguous area units, with contiguity defined as queen order of 2 including the lower order of 1. The bicycle network is the full network, including all types of bicycle infrastructure, regardless of whether they provide a physical barrier between the cyclist and the motor vehicle traffic.
The results of applying the Tobit regression model to the training dataset appear to the left. (My thanks to Dr. Tetsugen Haruyama and his colleagues at the Kobe University Graduate School of Economics for providing their Py4etrics Python package that allowed me to run the Tobit model on my data.) Oddly enough, we are still not obtaining a signal distinguishable from the random noise for the bicycle network density variable. The distance to CBD and university degree share variables seem to be the main drivers of the bike mode share variable. The share of blue collar workers seems to have influence, but oddly enough, it appears to be a negative one. (My hunch is that blue collar workers are less likely to have access to changing facilities at their workplace, compared to other workers, which makes them less likely to bicycle to work.)
Before placing our trust in the Tobit model (and in the results we have derived from it), we should run some simple diagnostics in terms of histograms and scatterplots. You can see them to the left.
It comes as a relief that the model is not predicting any negative values for the bicycle mode share variable (as we would expect.) You can see this on the scatterplot for the actual and predicted bike share values. But take a look at the histogram for the regression residuals -- they are very oddly (meaning not normally) distributed. Also, the scatterplot of the regression residuals and the bike mode share variable exhibits the same problem we have been seeing all along: the values of the residuals correlate strongly with the values of the bike mode share variable.
I believe it is time we gave up on regression techniques to predict the share of commuters from an area unit who choose to bicycle to work and move on to machine learning techniques. Given the nature of the data we are working with here, a traditional regression approach just is not fit for purpose. Before moving on though, just one last matter at hand: let us see how well our estimated Tobit regression model fits the same test data as we used for our spatially autocorrelated regression model:
Mean Absolute Error: 0.01141
Mean Squared Error: 0.00054
Compared to the spatially autocorrelated model, the Tobit model represents an improvement in terms of mean absolute prediction error. However, the spatially autocorrelated model performs better in terms of mean squared prediction error.