In this webpage, we will see why ordinary least squares multiple regression simply does not work well when analyzing data like we have here: data where a preponderance of the observations on the dependent variable are zero and where the variable values are geographically autocorrelated. Not only do the standard assumptions made by ordinary least squares regression logic fail to hold, but the model does a lousy job at predicting the bicycle mode share, too. In the following webpages, I attempt to overcome these problems using more developed versions of regression modeling, namely a spatially autoregressive and a Tobit regression model, but to limited avail. Machine learning techniques work much better. Before then, though, let's take a look at multiple regression analysis.
Standard multivariate regression modeling assumes a linear relationship between the dependent variable -- the bicycle mode share variable in our case -- and each explanatory variable: the bicycle network density, distance to CBD, university degree share, and blue collar share variables. It assumes a stochastic relationship, with a random error term, assumed to be independently and normally distributed with a constant variance and a mean value of zero.
The vector, y, contains the values of the dependent variable. the matrix, X, has a value of 1 in the first entry of each row, followed by the values for the explanatory variables in the remaining entries. X is postmultiplied by a vector, beta, containing the intercept term for the regression equation as its first element, followed by the coefficients for the explanatory variables. The vector, epsilon, containing the random errors appears at the end of the equation.
Although it is difficult to imagine that overfitting the data might be an issue in a linear model with a sample size of 7872 and only four explanatory variables, I nevertheless estimated the model on a randomly selected sub-sample, or training sample, containing 80 percent of the records. I held back the remaining 20 percent as a test sub-sample to ascertain model fit. I did this in order to facilitate comparisons with the machine learning approaches that I use later, where overfitting of the data can rear its head. (I use the exact same train-test split, with the exact same sub-samples, in the machine learning models.)
The results of the model estimation, using ordinary least squares and the Statsmodels library for Python, appear to the left. Do not feel befuddled, or at least not yet, at the statistically insignificant coefficient estimate for the bicycle network density explanatory variable, our main variable of interest. Recall that it does not include bicycle infrastructure in contiguous area units. Also, the regression diagnostics visible in the report, along with the histogram and scatterplots that follow, show that a host of problems plague the plain vanilla multiple regression model, when applied to data like ours. Notice that the regression residuals, or estimated random error terms, are not normally distributed. They have a visible rightward skewness. In addition, they are positively correlated with the values for the dependent variable, which violates the model's independence of errors assumption. Most bizarrely of all, the model allows for negative predicted values of the bicycle mode share variable -- as though the number of cyclists from an area unit could be negative! I will partially clean up these problems, along the spatial correlation of variables problem, using further refinements of the model in later webpages.
Despite the inadequacies of the plain vanilla regression model, applying its estimated constant term and parameter values (coefficients) to the test data yields the following metrics of model predictive power:
Mean Squared Error: 0.00044
Mean Absolute Error: 0.01532
Given that the mean value of the bike mode share variable in the test dataset is 0.01215, the mean squared and mean absolute errors, each normalized by the mean value of the bike mode share variable, are 0.036 and 1.261, respectively. Not terribly impressive. Let's see if we can do better.