Chapter 4 Mapping & Spatial Analysis

David

jaspatial, leaflet, mapbox

4.1 Setup

Geospatial analysis in R can require significant setup of your computer and installation of a variety of packages. We have setup a R package, jaspatial that installs and loads the required packages, including sf, leaflet, tigris, mapview, and rmapshaper.

To get started, install japspatial from GitHub using the following code:

devtools::install_github("januaryadvisors/jaspatial")

After install, you should load the jaspatial library and run two additional commands to get your environment ready for analysis: load_geo_packages(), which loads sf and other packages you’ll need, and set_geo_options(), which sets the correct options for packages like tigris.

#Let's load up some standard packages first
library(tidyverse)
library(here)
library(janitor)

#Then japspatial
library(jaspatial)
load_geo_packages()
set_geo_options()

4.2 Reading shapefiles

There are several ways to read shapefiles into R, but the one I would recommend is sf::read_sf(). It accommodates several different kinds of spatial files (shapefiles, geojson, etc.), and also allows you to import files directly from the web.

Below is an example of how you can import Harris County Commissioner Precinct boundaries directly from the City of Houston’s GIS website.

sf <- read_sf("https://opendata.arcgis.com/datasets/7237db114eeb416cb481f4450d8a0fa6_11.geojson")

If you’re reading in a shapefile, you only need to read in the file that ends in .shp. These types of files typically have a number of additional files that are important but are not necessary to read into R.

4.3 Cleaning shapefiles

After you read in your shapefile, you want to first clean it up. Spatial files often come with different map projections. If you want to create a map or analyse multiple shapefiles at once (e.g., a point-in-polygon analysis), you’ll want to standardize your shapefiles so that they share the same projection.

The jaspatial package comes with a function called clean_shape that performs several cleaning tasks. First, it converts the projection to wgs84. Second, it cleans the field names using janitor::clean_names to make them more R-friendly.

clean_sf <- clean_shape(.data = sf)

names(sf)

## [1] "OBJECTID"      "PCT_NO"        "AREA_IN_MI"    "AREA_IN_SQ_FT"
## [5] "ShapeSTArea"   "ShapeSTLength" "geometry"

names(clean_sf)

## [1] "objectid"        "pct_no"          "area_in_mi"      "area_in_sq_ft"  
## [5] "shape_st_area"   "shape_st_length" "geometry"

You also have the option to use a different map projection. To do so, simply add an additional argument to clean_shape():

#This is an alternative projection that's good for Texas
utm14n <- st_crs("+proj=utm +zone=14 +ellps=GRS80 +datum=NAD83 +units=m +no_defs +towgs84=0,0,0")

clean_sf_utmn14n <- clean_shape(.data = sf, .crs = utm14n)

4.4 Mapping your data

When you’re starting out with a new shapefile, you’ll often want to take a quick peek and make sure your data looks right. Rather than building a leaflet map, which requires several lines of code (see below), you can use mapview to pull up a quick map.

mapview(clean_sf)

TODO: Leaflet map section here

4.5 Point-in-polygon analysis

One of the most basic geospatial techniques is a point-in-polygon analysis. This type of analysis is useful when you have two spatial files — one with location points (longitude and latitude), and the other with polygons (e.g., administrative boundaries).

One type of question you might want to know is: Which locations are within which geographic boundaries? Let’s use the example of Health Dept Facilities in Harris County. We’d like to know which Commissioner Precinct each facility falls.

#First, we need to read in the facilities file and clean it up
facilities <- read_sf("https://opendata.arcgis.com/datasets/dc3d27e2fb8b4c76ab6713909e7269a9_11.geojson") %>% 
  clean_shape()

#Next, we perform a spatial join
joined_sf <- st_join(facilities, clean_sf)

## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## although coordinates are longitude/latitude, st_intersects assumes that they are planar

#Let's take a look at two of the columns
select(joined_sf, name, pct_no)

## Simple feature collection with 32 features and 2 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -95.60304 ymin: 29.6563 xmax: -95.22971 ymax: 29.90181
## CRS:           +proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0
## # A tibble: 32 × 3
##    name                                 pct_no             geometry
##    <chr>                                 <int>          <POINT [°]>
##  1 MAGNOLIA HEALTH CENTER (Dental Only)      2 (-95.30104 29.73468)
##  2 KASHMERE MULTI-SERVICE CENTER             1 (-95.31653 29.80433)
##  3 FIFTH WARD MULTI-SERVICE CENTER           1 (-95.33082 29.77226)
##  4 MAGNOLIA MULTI-SERVICE CENTER             2 (-95.30101 29.73463)
##  5 MAGNOLIA MSC WIC CENTER                   2 (-95.30104 29.73455)
##  6 AIRLINE WIC CENTER                        2 (-95.38422 29.85171)
##  7 SOUTHWEST WIC CENTER                      3 (-95.49388 29.71094)
##  8 NORTHSIDE WIC CENTER                      1 (-95.34071 29.83726)
##  9 LA NUEVA CASA DE AMIGOS WIC CENTER        2 (-95.36171 29.77545)
## 10 BRAESNER WIC CENTER                       1  (-95.5289 29.67224)
## # … with 22 more rows

That might be enough for some analyses. But you might also want to summarize the data and count up how many facilities fall within each precinct.

summary <- joined_sf %>% 
  st_drop_geometry() %>% #You don't have to drop the geometry but it improves runtime
  count(pct_no)

summary

## # A tibble: 3 × 2
##   pct_no     n
##    <int> <int>
## 1      1    16
## 2      2    10
## 3      3     6

An alternative method to count is to use group_by — both are tidyverse verbs that you would normally use to summarize non-spatial data. This could be useful if you want to perform additional commands beyond just counting up the number of points in each polygon.

summary_alt <- joined_sf %>% 
  st_drop_geometry() %>% 
  group_by(pct_no) %>% 
  summarise(
    n = n()
  )

summary_alt

## # A tibble: 3 × 2
##   pct_no     n
##    <int> <int>
## 1      1    16
## 2      2    10
## 3      3     6

4.6 Polygon-in-polygon analysis

Another common geospatial technique is a polygon-in-polygon analysis. This comes up most often when you have summary data at one level of geography (e.g., census tracts) and wish to summarize the data at a different level of geography (e.g., neighborhoods).

In these cases, what you need to do is first create a crosswalk between the two types of geography. You want to determine what share of the original geospatial layer falls within the new layer. In the above example, we would want to know what share of each census tract falls within each neighborhood.

One you’ve calculated the overlap, you want to redistribute the data in the original layer to the new layer based on the amount of overlap. For count data, this would be a weighted sum. For percentage or rate data, a weighted average.

(Sidenote: This approach assumes that the population is evenly distributed within each census tract. This is a fair assumption for most data and is oftentimes the best we can do. But if you’re using population data, this approach can be improved by using residential location information. Rather than weighting based on the amount of area overlap, you weight based on how many residential homes overlap between geographies. See David’s repo for Vision Galveston for an example of this method.)

Here’s an example using census tract and Super Neighborhood data. In this example, we have ACS data on the number of residents who identify as Hispanic/Latino and we want to generate estimates at the Super Neighborhood level.

We’ll start by pulling in the latest Census 2020 redistricting file from Census 2020 (which includes a census tract shapefile) and the City of Houston Super Neighborhood file. We’re going to pull the number of Hispanic residents and the total number of residents. Remember that we need to clean and standardize these shapefiles before using them together.

library(tidycensus)

cen2020 <- get_decennial(
  geography = "tract", 
  variables = c(total_pop = "P1_001N", hisp = "P2_003N"),
  state = 48, county = c(201, 157, 167, 339), #City of Houston stretches across multiple counties
  cache_table = T, year = 2020, geometry = T , #This argument means we also grab the shapefile
  sumfile="pl"
) %>% 
  spread(variable, value) %>% #turn from long to wide
  clean_shape()

## Getting data from the 2020 decennial Census

supern <- read_sf("https://opendata.arcgis.com/datasets/c3bfee99cbc14a899e4a603ee73203ee_3.geojson") %>% 
  clean_shape()

Next, we need to create the crosswalk. The jaspatial package provides a quick and easy function to create a crosswalk (TO BE ADDED), but here’s what’s going on under the hood:

options(scipen = 999) #I like to turn off scientific notation first

#First, find the intersection and calculate the overlap area
intersection = st_intersection(cen2020, supern) %>% #Intersect
  mutate(intersect_area = st_area(.)) %>% #Calculate the intersecting area
  dplyr::select(geoid, snbname, intersect_area) %>% #Only keep what you need
  st_drop_geometry() #You can drop the geometry at this point

## although coordinates are longitude/latitude, st_intersection assumes that they are planar

## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries

#Second, join the intersection with original shapefile and calculate the % overlap (coverage)
crosswalk = cen2020 %>% 
  mutate(total_area = st_area(.)) %>% 
  left_join(., intersection, by="geoid") %>% 
  mutate(coverage = as.numeric(intersect_area/total_area)) %>% 
  filter(!is.na(snbname)) %>% 
  st_drop_geometry() %>% 
  dplyr::select(geoid, snbname, coverage)

crosswalk

## # A tibble: 1,416 × 3
##    geoid       snbname                      coverage
##    <chr>       <chr>                           <dbl>
##  1 48201411100 GREENWAY / UPPER KIRBY AREA   0.00151
##  2 48201411100 AFTON OAKS / RIVER OAKS AREA  0.994  
##  3 48201411100 NEARTOWN - MONTROSE           0.00469
##  4 48201522102 SPRING BRANCH NORTH           0.999  
##  5 48201522102 SPRING BRANCH CENTRAL         0.00100
##  6 48201423402 BRAYS OAKS                    1      
##  7 48201433100 SHARPSTOWN                    0.709  
##  8 48201433100 WESTWOOD                      0.256  
##  9 48201433100 ALIEF                         0.0348 
## 10 48201222602 GREATER GREENSPOINT           1      
## # … with 1,406 more rows

The coverage column now indicates the share (0-1) of each census tract that is in each neighborhood. This is the weight you should use to redistribute the census tract data and calculate neighbothood estimates. Now you’re ready to do that!

neighborhood_estimate1 = left_join(cen2020, crosswalk, by="geoid") %>% 
  st_drop_geometry() %>% 
  filter(!is.na(snbname)) %>% 
  mutate(
    #Use across and multiply each variable by the coverage
    across(c("hisp", "total_pop"), ~.*coverage)
  ) %>% 
  group_by(snbname) %>% 
  summarise(
    across(c("hisp", "total_pop"), sum, na.rm=T)
  )

neighborhood_estimate1

## # A tibble: 88 × 3
##    snbname                        hisp total_pop
##    <chr>                         <dbl>     <dbl>
##  1 ACRES HOME                   19503.    30765.
##  2 ADDICKS PARK TEN             14468.    22766.
##  3 AFTON OAKS / RIVER OAKS AREA 14146.    16212.
##  4 ALIEF                        53430.   106706.
##  5 ASTRODOME AREA               13951.    16040.
##  6 BRAEBURN                      6858.    18907.
##  7 BRAESWOOD                    18904.    22286.
##  8 BRAYS OAKS                   34936.    57542.
##  9 BRIAR FOREST                 31931.    41909.
## 10 CARVERDALE                    1984.     4938.
## # … with 78 more rows

The above example is what you would do if you’re using count data. The following example is slightly modified if you’re using data that has percentages. However, you MUST have the proper denominator for the first level of geography.

#Just going to make that dataset real quick
percent_data = cen2020 %>% 
  mutate(hisp_pct = hisp/total_pop * 100) 

#Now let's use it to aggregate up to the neighborhood level
neighborhood_estimate2 = left_join(percent_data, crosswalk, by="geoid") %>% 
  st_drop_geometry() %>% 
  filter(!is.na(snbname)) %>% 
  mutate(
    #First, calculate the numerator for each tract
    hisp = (hisp_pct/100) * total_pop,
    
    #Then weight the numerator and denominator
    across(c("hisp", "total_pop"), ~.*coverage)
  ) %>% 
  group_by(snbname) %>% 
  summarise(
    across(c("hisp", "total_pop"), sum, na.rm=T)
  ) %>% 
  mutate(
    #Then calculate the percentage again
    hisp_pct = hisp/total_pop * 100
  )

  
neighborhood_estimate2

## # A tibble: 88 × 4
##    snbname                        hisp total_pop hisp_pct
##    <chr>                         <dbl>     <dbl>    <dbl>
##  1 ACRES HOME                   19503.    30765.     63.4
##  2 ADDICKS PARK TEN             14468.    22766.     63.6
##  3 AFTON OAKS / RIVER OAKS AREA 14146.    16212.     87.3
##  4 ALIEF                        53430.   106706.     50.1
##  5 ASTRODOME AREA               13951.    16040.     87.0
##  6 BRAEBURN                      6858.    18907.     36.3
##  7 BRAESWOOD                    18904.    22286.     84.8
##  8 BRAYS OAKS                   34936.    57542.     60.7
##  9 BRIAR FOREST                 31931.    41909.     76.2
## 10 CARVERDALE                    1984.     4938.     40.2
## # … with 78 more rows