The first decision in cleaning the data was to decide which, if any, of the area units I should exclude. I chose to exclude those that are not part of the contiguous urbanized area of Auckland. (Administratively, Auckland is actually a province of New Zealand and encompasses vast rural regions and small towns.) As a result, I excluded area units in satellite communities of Auckland, such as Swanson, Orewa, and Waiheke Island. After this first cleaning, the area units remaining in the data set numbered 8053. The map to the left displays them in random colors.
As explained in the Raw Data webpage, Statistics New Zealand sometimes enters a "C", meaning "confidential," in a field for an area unit's record. This occurred in several records in the raw journey-to-work dataset. It also occurred in some of the raw census dataset entries pertaining to my control variables. I therefore culled these records, too, when calculating the bicycle mode share for each area unit. After these cuts, 7872 area units remained in the dataset.
The bicycle mode share equals the ratio of the number of commuters in the area unit who bicycled to work on census day to the total number of respondents in the area unit who indicated that they commuted to work that day. I have mapped the compiled bicycle mode share data here. I have also mapped in green the bicycle infrastructure network, after the deletions I mention in the Raw Data webpage. Two maps appear, one at full scale and the other slightly zoomed in toward the central business district (CBD). The centroids of the area units, not their polygons, appear color-coded in accordance with their bicycle mode shares. I used Jenkins natural breaks, using five graduated categories, to create the color scheme.
Glancing at the full scale bicycle mode share map, one finds difficulty in discerning a relationship between the proximity of an area unit to the bicycle infrastructure network and the bicycle mode share of that area unit. Several area units with high bicycle mode shares are in close proximity to the network, but so are several that have low bicycle mode shares. We need a more careful investigation, one involving statistical analysis, before drawing any conclusions, however.
The map reveals one important pattern involving the bicycle mode share and the distance from the area unit to the CBD, indicated by the blue star. The slightly zoomed-in map displays this pattern more clearly. It also reveals a clustering of high mode share area units to the west and southwest of the central business district as well as northeast of the CBD in Devonport, across the mouth of the harbor. Another cluster appears directly east of the central business district in the vicinity of Mission Bay. Fewer clusters exist due south and southeast of the CBD.
Geographic patterning like this reveals one inadequacy of plain vanilla regression analysis to analyze our data, for the values in the dataset are obviously dependent on one another. Area units with high values for the bicycle mode share tend to cluster geographically, which violates the independence-of-observations assumption of standard-fare regression modeling. (That won't stop me from doing this, though, as the first cut in my analysis.) Fortunately, spatially autocorrelated regression modeling, which I also use later, overcomes this problem to some extent.
In order to quantify the accessibility of bicycle infrastructure to an area unit, I used QGIS to calculate how many meters of bicycle infrastructure existed within the boundary of each unit, including a 20 meter buffer extending beyond each boundary. Presumably, the greater is the number of meters of infrastructure within the area unit and its 20 meter buffer, the more access commuters within the area unit have to bicycling infrastructure. (As you can see in the webpage devoted to the spatially autocorrelated regression analysis, I eventually include the meters of bicycle infrastructure in nearby area units.)
I included the 20 meter buffer when compiling the data for this network accessibility variable because the boundaries for the area units typically fall along road center lines. Consider an area unit having as one of its boundaries the center line of Road X. Also suppose that there is a cycle lane on the far side of Road X, just a few meters from the boundary. This cycle lane certainly serves the commuters within the area unit, but without the buffer, the meters of this cycle lane serving the area unit's commuters would not be included in the value of the network density variable for this area unit.
I chose not to normalize the value of the network density variable by the physical area of the area unit. Statistics New Zealand constructs the boundaries of the area units with the aim of keeping the resident population within a pre-specified range. As a result, there are some area units that are quite large in terms of physical area because they contain wide swaths of non-residential uses, such as parks and industrial areas. Bicycling infrastructure in such area units tends to be placed in close proximity to the residences there. Therefore, if I had normalized this variable by dividing the square meters of the area unit, I would have generally understated how accessible bicycling infrastructure is to the residents of these physically large area units.
A full-scale map of the area units, along with a map zoomed in toward the center of Auckland, each color-coded in accordance with the value of its network density variable, appear here. I used Jenk's natural breaks with five classes for the color coding. Unless you use the zoom function on your computer, it may be difficult to see in the full scale map how QGIS has captured the network densities. The zoomed-in map shows more clearly how it has.
You can see the same issue arising here as we observed in the bicycle mode share maps: the values for the variable exhibit spatial autocorrelation, which violates the independence-of-observations assumption made by plain vanilla, ordinary least squares regression analysis. Again, though, a spatially autocorrelated regression model, which I also use, ameliorates this problem.
You undoubtedly noticed that some of the area units situated a short linear distance from the CBD are separated from it by water. These area units are situated on the southern shore of what the locals call "North Shore." Ferries run from three terminals on the southern shore of North Shore to the CBD, and commuters can take bicycles on these ferries. Therefore, the waterway does not impede commuter cycling. (It raises the issue of census data accuracy I mention in The Raw Data webpage, however.)
Commuting distance undoubtedly influences a commuter's decision about whether to bicycle to work. After all, cycling is a strenuous activity. Auckland, like all cities, exhibits centralization of employment, and approximately sixteen percent of Auckland's total employment occurs in the CBD. Therefore, the proximity of an area unit to the CBD should impact the bicycle mode share of the area unit, with closer area units exhibiting higher mode shares, which we need to control for in our statistical analyses. Some complexity undoubtedly exists in this relationship, however. Bear in mind that in area units extremely close to the CBD or within it, walking is oftentimes a more attractive alternative to bicycling to work.
Using QGIS, I compiled data pertaining to the area units' proximity to the CBD and displayed the centroids of the area units, color coded in accordance with their linear distances to the CBD, on the map displayed here. By virtue of its very construction, the distance to CBD variable is highly spatially autocorrelated.
For reasons explained in The Raw Data webpage, I use, as a control variable, the fraction of the resident population holding a university degree as a proxy variable for the fraction of the resident population having an above-average willingness to engage in deferred gratification. Recall that I assume that enduring the pain in the present of cycling to work in order to obtain the distant benefits of a fit, healthy body and financial peace-of-mind generally requires a relatively high propensity to engage in deferred gratification.
The map shown in this section, constructed using QGIS, displays data on the fraction of the area unit’s resident population holding a bachelor's degree or higher. I use color-coded centroids of the area units, rather than their polygons. Notice that in addition to the issue of spatial autocorrelation, the problem of multicollinearity rears its head: there exists a visible negative correlation between the distance to the CBD variable and the proportion of university degree holders variable. Area units with relatively high values for the university degree holders variable tend to be situated relatively close to the CBD. I deal with this as efficaciously as I can in the analyses.
One noteworthy pattern we see in this map, when considered in conjunction with the map showing the share of the resident population holding a university degree, is the geographic negative correlation between the shares working at blue collar jobs and the share of residents holding a university degree. This should come as no surprise. We see again the issue of multicollinearity among the explanatory variables in our regression modeling, which we will deal with later.
Not having to change clothes after arriving at work may encourage a commuter to bicycle in. Therefore, one useful control variable is the fraction of the commuters from an area unit who work in jobs that would likely not entail a change in clothing after bicycling in to work. I categorize such jobs as “blue collar,” if only because I desire a short variable name. The New Zealand Census provides data, at the area unit level, on people’s occupations. The occupational categories used are: managers; professionals; technicians and trades workers; community and personal service workers; clerical and administrative workers; sales workers; machinery operators and drivers; and labourers. Although it involved a considerable degree of speculation, I thought it best to classify technicians and trades workers, machinery operators and drivers, and labourers as “blue collar” and the rest not.
In the map seen here, we see the centroids of the area units color coded in accordance with the share of commuters from the area unit whom I classified as “blue collar.” Presumably, the higher is this share, the higher should be the fraction of commuters who bicycle to work. (On the other hand, though, such workers are less likely to have access to changing facilities after arriving at work than office workers, so they may be less inclined to bicycle in. We'll see what the data analysis reveals.)
One of the old-fashioned analyses I undertook was a spatial autoregressive, or spatially autocorrelated, regression model, which posits interactions between variables based on their relative locations in space. In order to do this, I had to first construct a spatial weights matrix used to characterize whether two area units are contiguous, based on a contiguity criterion. I chose "queen" contiguity (as in how the queen piece moves on a chessboard.) After conducting some data exploration, I opted to kick the order of queen contiguity up to queen order of 2 contiguity, including the lower order of 1. Two area units are queen order of 1 contiguous if and only if they share a common boundary or vertex. The area units shaded in darker green in the image seen here are queen order of 1 contiguous with area unit 7002470, a coastal area unit, which is shaded in red. An area unit is queen order of 2 contiguous with another area unit if and only if it is not order of 1 contiguous but has an order of 1 neighbor that is contiguous with the latter area unit. The area units shaded in light green in the image seen here are queen order of 2 contiguous with area unit 7002470. Under the contiguity criterion I have selected, all area units shaded in green are contiguous to area unit 7002470.
In spatially autocorrelated regression equations, a spatial weights matrix expresses contiguity. Before row standardization, an entry of 1 appears in a cell where the area units represented by the row and column are contiguous. The orange shaded image shows selected entries from the row for area unit 7002470. Each row in the spatial weights matrix is then standardized by dividing each entry by the number of contiguous neighbors the area unit represented by the row has. The green shaded image seen here shows the same selected entries after standardization. (Note from the map that area unit 7002470 has nine contiguous neighbors under queen order of 2 including the lower order of 1 contiguity and that 1/9 = 0.111.)