We begin our initial data exploration with a histogram of the bicycle mode share variable, which will be the dependent variable in all our analyses.
As you can see, the vast majority of the area units had a bicycle mode share of zero: nobody from these area units rode their bike to work on census day. The presence of such a preponderance of zero entries for this variable points toward the unfitness of standard, plain vanilla regression analysis to analyze the data. The presence of spatial autocorrelation, discussed in the Transformed Data webpage, only adds to this problem. That didn't stop me from running a basic regression model as my first cut, though. 🙂 I display summary statistics for the bicycle mode share variable, along with the other variables, in the table below.
As you can see, the table indicates the preponderance of zero values for the bicycle mode share variable, as it does for the bicycle network density table, too. The other variables do not exhibit this characteristic. We examine their histograms in the next section.
The explanantory variable of primary interest to us is the bicycle network density variable. As we can discern from its histogram, the vast majority of the area units do not have any infrastructure at all. Again, this raises doubts about the tenability of standard regression analysis to analyze the data.
The histograms for the remaining three explanatory variables appear below:
This variable appears to be approximately normally distributed.
This variable is roughly normally distributed
Other than its slight rightward skewness, nothing noteworthy appears in this distribution.
Two puzzling entries appear in the correlation matrix. One, there is virtually no correlation between the bicycle mode share and bicycle network density variables. On second thought, though, this should not come as a surprise. Recall that the cycle network density variable captures the total length of bicycle paths and lanes only within the area unit. However, the presence of bicycle infrastructure in nearby and contiguous area units likely also plays a role in a commuter's decision about whether to cycle to work. When we move on from a basic regression analysis to a spatially autocorrelated one, we will factor in this consideration explicitly. Secondly, the matrix shows a negative correlation between the bicycle mode share variable and the share of commuters who hold blue collar jobs and would thus be less likely to have to change clothing after arriving at work by bicycle than commuters holding non-blue collar jobs. The matrix shows a high degree of multicollinearity between the blue collar share variable and the other explanatory variables, however. This may account in part for the negative simple correlation between the bike mode share and blue collar share variables.
Confirming the entry in the correlation matrix, the scatterplot of the bicycle mode share and bicycle network density variables shows no discernible correlation between the two. The black reference line was fitted using ordinary least squares. Notice that it is essentially horizontal.
Scatterplots, including reference lines fitted using ordinary least squares, pairing the bicycle mode share variable with the three control variables appear below.
This section displays scatterplots depicting the multicollinearity between selected explanatory variables.
The scatterplots confirm the multicollinearity detected in the correlation matrix